A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification

A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification

Shuming Ma MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, China and School of Electronics Engineering and Computer Science, Peking University, Beijing, China. E-mail: shumingma@pku.edu.cn. Peking University    Xu Sun MOE Key Laboratory of Computational Linguistics, Peking University, Beijing, China and School of Electronics Engineering and Computer Science, Peking University, Beijing, China. E-mail: xusun@pku.edu.cn. Peking University
Abstract

Text summarization and text simplification are two major ways to simplify the text for poor readers, including children, non-native speakers, and the functionally illiterate. Text summarization is to produce a brief summary of the main ideas of the text, while text simplification aims to reduce the linguistic complexity of the text and retain the original meaning. Recently, most approaches for text summarization and text simplification are based on the sequence-to-sequence model, which achieves much success in many text generation tasks. However, although the generated simplified texts are similar to source texts literally, they have low semantic relevance. In this work, our goal is to improve semantic relevance between source texts and simplified texts for text summarization and text simplification. We introduce a Semantic Relevance Based neural model to encourage high semantic similarity between texts and summaries. In our model, the source text is represented by a gated attention encoder, while the summary representation is produced by a decoder. Besides, the similarity score between the representations is maximized during training. Our experiments show that the proposed model outperforms the state-of-the-art systems on two benchmark corpus111Our code is available at https://github.com/shumingma/SRB.

\issue

112017 {CJK*}UTF8gbsn

1 Introduction

Text summarization and text simplification is to make the text easier to read and understand, especially for poor readers, including children, non-native speakers, and the functionally illiterate. Text summarization is to simplify the texts at the document level. The source texts are often consist of many sentences or paragraphs, and the simplified texts are some brief sentences of the main ideas of the source texts. Text simplification is to simplify the texts at the sentence level. It aims to simplify the sentences to reduce the lexical and structure complexity. Unlike text summarization, it does not require the simplified sentences are shorter, but requires the words are simple to understand.

In some previous work, extractive summarization achieves satisfying performance by selecting a few sentences from source texts Radev et al. (2004); Cheng and Lapata (2016); Cao et al. (2015). By extracting the sentences, the generated texts are grammatical, and retain the same meaning with the source texts. However, it does not simplify the texts but only shorten the texts. Some previous related work regards text simplification as a combination of three operations: splitting, deletion and paraphrasing, which requires some rule-based models or heavy syntactic features Zhu, Bernhard, and Gurevych (2010); Woodsend and Lapata (2011); Filippova et al. (2015).

Most recent approaches use sequence-to-sequence model for text summarization Rush, Chopra, and Weston (2015); Hu, Chen, and Zhu (2015) and text simplification Nisioi et al. (2017); Cao et al. (2017); Zhang and Lapata (2017). Sequence-to-sequence model is a widely used end-to-end framework for text generation, such as machine translation. It compresses the source text information into dense vectors with the neural encoder, and the neural decoder generates the target text using the compressed vectors.

For both text summarization and text simplification, the simplified texts must have high semantic relevance to the source texts. However, current sequence-to-sequence models tend to produce grammatical and coherent simplified texts regardless of the semantic relevance to source texts. Table 1 shows that the summary generated by a LSTM sequence-to-sequence model (Seq2seq) is similar to the source text literally, but it has low semantic relevance.

Text 昨晚,中联航空成都飞北京一架航班 被发现有多人吸烟。后因天气原因,飞机 备降太原机场。有乘客要求重新安检,机 长决定继续飞行,引起机组人员与未吸烟 乘客冲突。
Last night, several people were caught to smoke on a flight of China United Airlines from Chendu to Beijing. Later the flight temporarily landed on Taiyuan Airport. Some passengers asked for a security check but were denied by the captain, which led to a collision between crew and passengers.
Seq2seq 中联航空机场发生爆炸致多人死亡。
China United Airlines exploded in the airport, leaving several people dead.
Gold 航班多人吸烟机组人员与乘客冲突。
Several people smoked on a flight which led to a collision between crew and passengers.
Table 1: An example of the simplified text generated by Seq2seq for summarization. The summary has high similarity to the text literally, but has low semantic relevance.

In this work, our goal is to improve the semantic relevance between source texts and generated simplified texts for text summarization and text simplification. To achieve this goal, we propose a Semantic Relevance Based neural network model (SRB). In our model, we compress the source texts into dense vectors with the encoder, and decode the dense vector into simplified texts with the decoder. The encoder produces the representation of source texts, and the decoder produces the representation of the generated texts. A similarity evaluation component is introduced to measure the relevance of source texts and generated texts. During training, it maximizes the similarity score to encourage high semantic relevance between source texts and simplified texts. In order to better represent a long source text, we introduce a self-gated attention encoder to memory the input text. We conduct the experiments on three corpus, namely LCSTS, PWKP, and EW-SEW. Experiments show that our proposed model has better performance than the state-of-the-art systems on two benchmark corpus.

The contributions of this work are as follow:

  • We propose a Semantic Relevance Based neural network model (SRB) to improve the semantic relevance between source texts and generated simplified texts for text summarization and text simplification. A similarity evaluation component is introduced to measure the relevance of source texts and generated texts, and the similarity score is maximized to encourage high semantic relevance between source texts and simplified texts.

  • We introduce a self-gated encoder to better represent a long redundant text. We perform the experiments on three corpus, namely LCSTS, PWKP, and EW-SEW. Experiments show that our proposed model outperforms the state-of-the-art systems on two benchmark corpus.

2 Background: Sequence-to-sequence Model

Most recent models for text summarization and text simplification are based on the sequence-to-sequence model. The sequence-to-sequence model is able to compress source texts into a continuous vector representation with an encoder, and then generates the simplified text with a decoder. In the previous work Nisioi et al. (2017); Hu, Chen, and Zhu (2015), the encoder is a two layer Long Short-term Memory Network (LSTM) Hochreiter and Schmidhuber (1997), which maps source texts into the hidden vector . The decoder is a uni-directional LSTM, producing the hidden output , which is the dense representation of the words at the time step. Finally, the word generator computes the distribution of output words with the hidden state and the parameter matrix :

(1)

Attention mechanism is introduced to better capture context information of source texts Bahdanau, Cho, and Bengio (2014). Attention vector is calculated by the weighted sum of encoder hidden states:

(2)
(3)

where is an attentive score between the decoder hidden state and the encoder hidden state . When predicting an output word, the decoder takes account of the attention vector, which contains the alignment information between source texts and simplified texts. With the attention mechanism, the word generator computes the distribution of output words :

(4)
(5)
Figure 1: Our Semantic Relevance Based neural model. It consists of decoder (above), encoder (below) and cosine similarity function.

3 Proposed Model

Our goal is to improve the semantic relevance between source texts and simplified texts, so our proposed model encourages high similarity between their representations. Figure 1 shows our proposed model. The model consists of three components: encoder, decoder and a similarity function. The encoder compresses source texts into semantic vectors, and the decoder generates summaries and produces semantic vectors of the generated summaries. Finally, the similarity function evaluates the relevance between the sematic vectors of source texts and generated summaries. Our training objective is to maximize the similarity score so that the generated summaries have high semantic relevance to source texts.

Figure 2: The self-gated encoder. It measure the importance of each word, and decide how much information is reserved as the representation of the texts.

3.1 Self-gated Encoder

The goal of the complex text encoder is to provide a series of dense representation of source texts for the decoder and the semantic relevance component. In the previous work Nisioi et al. (2017), the complex text encoder is a two-layer uni-directional Long Short-term Memory Network (LSTM), which produces the dense representation from the source text .

However, in text summarization and text simplification, source texts are usually very long and noisy. Therefore, some encoding information in the beginning of the texts will vanish until the end of the texts, which leads to bad representations of the texts. Bi-directional LSTM is an alternative to deal with the problem, but it needs double time to encoder the source texts, and it does not represents the middle of the texts well when the texts are too long. To solve the problem, we propose a self-gated encoder to better represent a long text.

In text summarization and text simplification, some words or information in the source texts are unimportant, so they need to be simplified or discarded. Therefore, we introduce a self-gated encoder, which can reduce the unnecessary information and enhance the important information to represent a long text.

Self-gated encoder try to measure the importance of each word, and decide how much information is reserved as the representation of the texts. At each time step, every upcoming word is fed into the LSTM cell, which outputs the dense vector :

(6)

where is the LSTM function, and is the output vector of the LSTM cell. A feed-forward neural network is used to measure the importance and decide how much information is reversed:

(7)

where is the feed-forward neural network function, and measures the proportion of the reserved information. Finally, the reversed information is computed by multiplying :

(8)
(9)

where is the representation at the time step, and is the input embedding of at the time step.

3.2 Simplified Text Decoder

The goal of the simplified text decoder is to generate a series of simplified words from the dense representation of source texts. In our model, the dense representations of the source texts are fed into an attention layer to generate the context vector :

(10)
(11)

where is the dense representation of generated simplified computed by a two-layer LSTM.

In this way, and respectively represent the context information of source texts and the target texts at the time step. To predict the word, the decoder uses and to generate the probability distribution of the candidate words:

(12)
(13)

where and are the parameter matrix of the output layer. Finally, the word with the highest probability is predicted:

(14)

3.3 Semantic Relevance

Our goal is to compute the semantic relevance of source texts and generated texts given the source semantic vector and the generated sementic vector . Here, we use cosine similarity to measure the semantic relevance, which is represented with a dot product and magnitude:

(15)

Source texts and generated texts share the same language, so it is reasonable to assume that their semantic vectors are distributed in the same space. Cosine similarity is a good way to measure the distance between two vectors in the same space.

With the semantic relevance metric, the problem is how to get the semantic vector and . There are several methods to represent a text or a sentence, such as mean pooling of LSTM output or reserving the last state of LSTM. In our model, we select the last state of the encoder as the representation of source texts:

(16)

A natural idea to get the semantic vector of a summary is to feed it into the encoder as well. However, this method wastes much time because we encode the same sentence twice. Actually, the last output of the decoder contains information of both source text and generated summaries. We simply compute the semantic vector of the summary by subtracting from :

(17)

Previous work has proved that it is effective to represent a span of words without encoding them once more Wang and Chang (2016).

3.4 Training

Given the model parameter and input text , the model produces corresponding summary and semantic vector and . The objective is to minimize the loss function:

(18)

where is the conditional probability of summaries given source texts, and is computed by the encoder-decoder model. is cosine similarity of semantic vectors and . This term tries to maximize the semantic relevance between source input and target output.

We use Adam optimization method to train the model, with the default hyper-parameters: the learning rate , and , , .

4 Experiments

In this section, we present the evaluation of our model and show its performance on three popular corpus. Besides, we perform a case study to explain the semantic relevance between generated summary and source text.

4.1 Datasets

We introduce a Chinese text summarization dataset and two popular text simplification datasets. The simplification datasets are both from the alignments between English Wikipedia website222http://en.wikipedia.org and Simple English Wikipedia website333http://simple.wikipedia.org. The Simple English Wikipedia is built for “the children and adults who are learning the English language”, and the articles are composed with “easy words and short sentences”. Therefore, Simple English Wikipedia is a natural public simplified text corpus. Most of the text simplification benchmark datasets are constructed from Simple English Wikipedia.

Large Scale Chinese Short Text Summarization Dataset (LCSTS). LCSTS is constructed by \namecitelcsts. The dataset consists of more than 2.4 million text-summary pairs, constructed from a famous Chinese social media website called Sina Weibo444weibo.sina.com. It is split into three parts, with 2,400,591 pairs in PART I, 10,666 pairs in PART II and 1,106 pairs in PART III. All the text-summary pairs in PART II and PART III are manually annotated with relevant scores ranged from 1 to 5, and we only reserve pairs with scores no less than 3. Following the previous work, we use PART I as training set, PART II as development set, and PART III as test set.

Parallel Wikipedia Simplification Corpus (PWKP). PWKP Zhu, Bernhard, and Gurevych (2010) is a widely used benchmark for evaluating text simplification systems. It consists of aligned complex text from English WikiPedia (as of Aug. 22nd, 2009) and simple text from Simple Wikipedia (as of Aug. 17th, 2009). The dataset contains 108,016 sentence pairs, with 25.01 words on average per complex sentence and 20.87 words per simple sentence. Following the previous work Zhang and Lapata (2017), we remove the duplicate sentence pairs, and split the corpus with 89,042 pairs for training, 205 pairs for development and 100 pairs for test.

English Wikipedia and Simple English Wikipedia (EW-SEW). EW-SEW is a publicly available dataset provided by \nameciteHwangEA2015. To build the corpus, they first align the complex-simple sentence pairs, score the semantic similarity between the complex sentence and the simple sentence, and classify each sentence pair as a good, good partial, partial, or bad match. Following the previous work Nisioi et al. (2017), we discard the unclassified matches, and use the good matches and partial matches with a scaled threshold greater than 0.45. The corpus contains about 150K good matches and 130K good partial matches. We use this corpus as the training set, and the dataset provided by Xu et al. Xu et al. (2016) as the development set and the test set. The development set consists of 2,000 sentence pairs, and the test set contains 359 sentence pairs. Besides, each complex sentence is paired with 8 reference simplified sentences provided by Amazon Mechanical Turk workers.

4.2 Settings

We describe the experimental details of text summarization and text simplification respectively.

Text Summarization. To alleviate the risk of word segmentation mistakes Xu and Sun (2016); Sun, Wang, and Li (2012), we use Chinese character sequences as both source inputs and target outputs. We limit the model vocabulary size to 4000, which covers most of the common characters. Each character is represented by a random initialized word embedding. We tune our parameter on the development set. In our model, the embedding size is 400, the hidden state size of encoder-decoder is 500, and the size of gated attention network is 1000. We use Adam optimizer to learn the model parameters, and the batch size is set as 32. The parameter is 0.0001. Both the encoder and decoder are based on LSTM unit. Following the previous work Hu, Chen, and Zhu (2015), our evaluation metric is F-score of ROUGE: ROUGE-1, ROUGE-2 and ROUGE-L Lin and Hovy (2003).

Text Simplification. The text simplification datasets contain a lot of named entities, which makes the vocabulary too large. To reduce the vocabulary size, we follow the setting by \nameciteZhangEA2017. We recognize the named entities with the Stanford CoreNLP tagger Manning et al. (2014), and replace the named entities with the anonymous symbols NE@N, where NE{PER, LOC, ORG, MISC} where represents the entity in the sentence. To limit the vocabulary size, we prune the vocabulary to top 50,000 most frequent words, and replace the rest words with the UNK symbols. At test time, we replace the UNK symbols with the highest probability score from the attention alignment matrix following Jean et al. Jean et al. (2015). We filter out sentence pairs whose lengths exceed 100 words in the training set. The encoder is implemented on LSTM, and the decoder is based on LSTM with Luong style attention Luong, Pham, and Manning (2015). We tune our hyper-parameter on the development set. The model has two LSTM layers. The hidden size of LSTM is 256, and the embedding size is 256. We use Adam optimizer Kingma and Ba (2014) to learn the parameters, and the batch size is set to be 64. We set the dropout rate Srivastava et al. (2014) to be 0.4. All of the gradients are clipped when the norm exceeds 5. The evaluation metric is BLEU score.

4.3 Baseline Systems

Seq2seq. We first compare our model with a basic sequence-to-sequence model Sutskever, Vinyals, and Le (2014). It is a widely used model to generate texts, so it is an important baseline.

Seq2seq-Attention. Seq2seq-Attention Bahdanau, Cho, and Bengio (2014) is a sequence-to-sequence framework with neural attention. Attention mechanism helps capture the context information of source texts. This model is a stronger baseline system.

4.4 Results

Model ROUGE-1 ROUGE-2 ROUGE-L
Seq2seq (W) Hu, Chen, and Zhu (2015) 17.7 8.5 15.8
Seq2seq (C) Hu, Chen, and Zhu (2015) 21.5 8.9 18.6
Seq2seq-Attention (W) Hu, Chen, and Zhu (2015) 26.8 16.1 24.1
Seq2seq-Attention (C) Hu, Chen, and Zhu (2015) 29.9 17.4 27.2
Seq2seq-Attention (C) (our implementation) 30.1 17.9 27.2
SRB (C) (our proposal) 33.3 20.0 30.1
Table 2: Results of our model and baseline systems. Our models achieve substantial improvement of all ROUGE scores over baseline systems. The results are reported on the test sets.(W: Word level; C: Character level).
 PWKP BLEU
 Seq2seq-Attention Nisioi et al. (2017) 47.52
 Seq2seq-Attention-w2v Nisioi et al. (2017) 48.10
 Seq2seq-Attention (our implementation) 48.26
 SRB (our proposal) 50.18
 EW-SEW BLEU
 Seq2seq-Attention Nisioi et al. (2017) 84.70
 Seq2seq-Attention-w2v Nisioi et al. (2017) 87.50
 Seq2seq-Attention (our implementation) 88.97
 SRB (our proposal) 89.84
Table 3: Comparison with our model and the recent neural models for text simplification. Our models achieve substantial improvement of BLEU score over baseline systems. The results are reported on the test sets.
Model ROUGE-1 ROUGE-2 ROUGE-L
Seq2seq (W) Hu, Chen, and Zhu (2015) 26.8 16.1 24.1
Seq2seq (C) Hu, Chen, and Zhu (2015) 29.9 17.4 27.2
Seq2seq-Attention (W) Hu, Chen, and Zhu (2015) 26.8 16.1 24.1
Seq2seq-Attention (C) Hu, Chen, and Zhu (2015) 29.9 17.4 27.2
COPYNET (C) Gu et al. (2016) 35.0 22.3 32.0
SRB (C) (our proposal) 33.3 20.0 30.1
Table 4: Results of our model and state-of-the-art systems. COPYNET incorporates copying mechanism to solve out-of-vocabulary problem, so its has higher ROUGE scores. Our model does not incorporate this mechanism currently. In the future work, we will implement this technique to further improve the performance. The results are reported on the test sets. (Word: Word level; Char: Character level; R-1: F-score of ROUGE-1; R-2: F-score of ROUGE-2; R-L: F-score of ROUGE-L)
PWKP BLEU
NTS Nisioi et al. (2017) 47.52
NTS-w2v Nisioi et al. (2017) 48.10
DRESS Zhang and Lapata (2017) 34.53
DRESS-LS Zhang and Lapata (2017) 36.32
SRB (our proposal) 50.18
EW-SEW BLEU
PBMT-R (Wubben et al. 2012) 67.79
SBMT-SARI Xu et al. (2016) 73.62
NTS Nisioi et al. (2017) 84.70
NTS-w2v Nisioi et al. (2017) 87.50
DRESS Zhang and Lapata (2017) 77.18
DRESS-LS Zhang and Lapata (2017) 80.12
SRB (our proposal) 89.84
Table 5: Results of our model and state-of-the-art systems. SRB achieves the best BLEU scores compared with the related systems on PWKP and EW-SEW. The results are reported on the test sets.
Text 仔细一算,上海的互联网公司不乏成功 案例,但最终成为BAT一类巨头的几乎没有, 这也能解释为何纳税百强的榜单中鲜少互联 网公司的身影。有一类是被并购,比如:易 趣、土豆网、PPS、PPTV、一号店等;有一 类是数年偏安于细分市场。
With careful calculation, there are many successful Internet companies in Shanghai, but few of them becomes giant company like BAT. This is also the reason why few Internet companies are listed in top hundred companies of paying tax. Some of them are merged, such as Ebay, Tudou, PPS, PPTV, Yihaodian and so on. Others are satisfied with segment market for years.
Reference 为什么上海出不了互联网巨头?
Why Shanghai comes out no giant company?
Seq2seq-A 上海的互联网巨头。
Shanghai’s giant company.
SRB 上海鲜少互联网巨头的身影。
Shanghai has few giant companies.
Table 6: An Example of SRB generated summary on LCSTS dataset, compared with the system output of Seq2seq-Attention and the reference.
Source Depending on the context, another closely-related meaning of constituent is that of a citizen residing in the area governed, represented, or otherwise served by a politician; sometimes this is restricted to citizens who elected the politician.
Reference The word constituent can also be used to refer to a citizen who lives in the area that is governed, represented, or otherwise served by a politician; sometimes the word is restricted to citizens who elected the politician.
NTS Depending on the context, another closely-related meaning of constituent is that of a citizen living in the area governed, represented, or otherwise served by a politician; sometimes this is restricted to citizens who elected the politician.
NTS-w2v This is restricted to citizens who elected the politician.
PBMT-R Depending on the context and meaning of closely-related siemens-martin -rrb- is a citizen living in the area, or otherwise, was governed by a 1924-1930 shurba; this is restricted to people who elected it.
SBMT-SARI In terms of the context, another closely-related sense of the component is that of a citizen living in the area covered, make up, or if not, served by a policy; sometimes this is limited to the people who elected the policy.
SRB Depending on the context, another closely-related meaning of constituent is that of a citizen living in the area governed, represented, or otherwise served by a politician; sometimes the word is restricted to citizens who elected the politician.
Table 7: An examples of different text simplification system outputs in EW-SEW dataset. Differences from the source texts are shown in bold.

We compare our model with above baseline systems, including Seq2seq and Seq2seq-Attention. We refer to our proposed Semantic Relevance Based neural model as SRB. Table 2 shows the results of our models and baseline systems on LCSTS. As shown in Table 2, the models at the character level achieve better performance than the models at the word level. Therefore, we implement our model at the character level. For fair comparison, we also implement a Seq2seq-Attention model following the details in the previous work Hu, Chen, and Zhu (2015). Our implementation of Seq2seq-Attention has better score, mainly because we tune the hyper-parameters well on the development set. We can see SRB outperforms both Seq2seq and Seq2seq-Attention with the F-score of 33.3 ROUGE-1, 20.0 ROUGE-2 and 30.1 ROUGE-L.

Table 3 shows the results in two text simplification corpus. We compare our model with Seq2seq-Attention and Seq2seq-Attention-w2v. Seq2seq-Attention-w2v is a Seq2seq-Attention with pretrain word embeddings. We also implement a Seq2seq-Attention model, and carefully tune it on the development set. Our implementation get 48.26 BLEU score on PWKP, and 88.97 BLEU score on EW-SEW. Our SRB outperforms all of the baseline systems, with the BLEU score of 50.18 on PWKP and the BLEU score of 89.84 on EW-SEW.

Table 4 summarizes the results of our model and state-of-the-art systems. COPYNET has the highest scores, because it incorporates copying mechanism to deals with out-of-vocabulary word problem. In this paper, we do not implement this mechanism in our model. Our model can also be improved with these additional techniques, which, however, are not the focus of this paper.

We also compare SRB with other models for text simplification, which are not limit to neural models. Table 5 summarizes the results of SRB and the related systems. On PWKP dataset, we compare SRB with NTS, NTS-w2v, DRESS and DRESS-LS. We run the public release code of NTS and NTS-w2v provided by \nameciteNisioiEA2017, and get the BLEU score of 47.52 and 48.10 respectively. As for DRESS and DRESS-LS, we use the scores reported by \nameciteZhangEA2017. The goal of DRESS is not to generate the outputs closer to the references, so BLEU of DRESS and DRESS-LS are relatively lower than NTS and NTS-w2v. SRB achieves a BLEU score of 50.18, outperforming all of the previous systems. On EW-SEW dataset, we compare WEAN-dot with PBMT-R, SBMT-SARI, and the neural models described above. We do not find any public release code of PBMT-R and SBMT-SARI. Fortunately, \nameciteXuEA2016 provides the predictions of PBMT-R and SBMT-SARI on EW-SEW test set, so that we can compare our model with these systems. It shows that the neural models have better performance in BLEU, and WEAN-dot achieves the best BLEU score with 89.84.

4.5 Case Study

Table 6 is an example to show the semantic relevance between the source text and the summary. It shows that the main idea of the source text is about the reason why Shanghai has few giant company. RNN context produces “Shanghai’s giant companies” which is literally similar to the source text, while SRB generates “Shanghai has few giant companies”, which is closer to the main idea in semantics. It concludes that SRB produces summaries with higher semantic similarity to texts.

Table 7 shows an examples of different text simplification system outputs on EW-SEW. NTS-w2v omits so many words that it lacks a lot of information. PBMT-R generates some irrelevant words, like ’siemens-martin’, ’-rrb-’, and ’-shurba’, which hurts the fluency and adequacy of the generated sentence. SBMT-SARI is able to generate a fluent sentence, but the meaning is different from the source text, and even more difficult to understand. Compared with the statistic model, SRB generates a more fluent sentence. Besides, SRB improves the semantic revelance between the source texts and the generated texts, so the generated sentence is semantically correct, and very close to the original meaning.

5 Related Work

Abstractive text summarization has achieved successful performance thanks to the sequence-to-sequence model Sutskever, Vinyals, and Le (2014) and attention mechanism Bahdanau, Cho, and Bengio (2014). \nameciteabs first used an attention-based encoder to compress texts and a neural network language decoder to generate summaries. Following this work, recurrent encoder was introduced to text summarization, and gained better performance Lopyrev (2015); Chopra, Auli, and Rush (2016). Towards Chinese texts, \namecitelcsts built a large corpus of Chinese short text summarization. To deal with unknown word problem, \nameciteibmsummarization proposed a generator-pointer model so that the decoder is able to generate words in source texts. \namecitecopynet also solved this issue by incorporating copying mechanism. Besides, \nameciteminimum proposes a minimum risk training method which optimizes the parameters with the target of rouge scores.

\namecite

ZhuEA2010 constructs a wikipedia dataset, and proposes a tree-based simplification model, which is the first statistical simplification model covering splitting, dropping, reordering and substitution integrally. \nameciteWoodsend2011 introduces a data-driven model based on quasi-synchronous grammar, which captures structural mismatches and complex rewrite operations. \nameciteWubbenEA2012 presents a method for text simplification using phrase based machine translation with re-ranking the outputs. \nameciteKauchak2013 proposes a text simplification corpus, and evaluates language modeling for text simplification on the proposed corpus.

\namecite

NarayanEA2014 propose a hybrid approach to sentence simplification which combines deep semantics and monolingual machine translation. \nameciteHwangEA2015 introduces a parallel simplification corpus by evaluating the similarity between the source text and the simplified text based on WordNet. \nameciteLigthLS propose an unsupervised approach to lexical simplification that makes use of word vectors and require only regular corpora. \nameciteXuEA2016 design automatic metrics for text simplification, and they introduce a statistic machine translation, and tune with the proposed automatic metrics.

Recently, most works focus on the neural sequence-to-sequence model. \nameciteNisioiEA2017 present a sequence-to-sequence model, and re-ranks the predictions with BLEU and SARI. \nameciteZhangEA2017 propose a deep reinforcement learning model to improve the simplicity, fluency and adequacy of the simplified texts. \nameciteCaoEA2017 introduce a novel sequence-to-sequence model to join copying and restricted generation for text simplification.

Our work is also related to the encoder-decoder framework Cho et al. (2014) and the attention mechanism Bahdanau, Cho, and Bengio (2014). Encoder-decoder framework, like sequence-to-sequence model, has achieved success in machine translation Sutskever, Vinyals, and Le (2014); Jean et al. (2015); Luong, Pham, and Manning (2015), text summarization Rush, Chopra, and Weston (2015); Chopra, Auli, and Rush (2016); Nallapati et al. (2016); Cao et al. (2016), and other natural language processing tasks. Neural attention model is first proposed by \nameciteattention. There are many other methods to improve neural attention model Jean et al. (2015); Luong, Pham, and Manning (2015).

6 Conclusion

In this work, our goal is to improve the semantic relevance between source texts and generated simplified texts for text summarization and text simplification. To achieve this goal, we propose a Semantic Relevance Based neural network model (SRB). A similarity evaluation component is introduced to measure the relevance of source texts and generated texts. During training, it maximizes the similarity score to encourage high semantic relevance between source texts and simplified texts. In order to better represent a long source text, we introduce a self-gated attention encoder to memory the input text. We conduct the experiments on three corpus, namely LCSTS, PWKP, and EW-SEW. Experiments show that our proposed model has better performance than the state-of-the-art systems on the benchmark corpus.

Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No. 61673028), and an Okawa Research Grant (2016). Xu Sun is the corresponding author of this paper. This work is a substantial extension of the conference version presented at ACL 2017 Ma et al. (2017).

\starttwocolumn

References

  • Ayana et al. (2016) Ayana, Shiqi Shen, Zhiyuan Liu, and Maosong Sun. 2016. Neural headline generation with minimum risk training. CoRR, abs/1604.01904.
  • Bahdanau, Cho, and Bengio (2014) Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  • Cao et al. (2016) Cao, Ziqiang, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. 2016. Attsum: Joint learning of focusing and summarization with neural attention. In COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, December 11-16, 2016, Osaka, Japan, pages 547–556.
  • Cao et al. (2017) Cao, Ziqiang, Chuwei Luo, Wenjie Li, and Sujian Li. 2017. Joint copying and restricted generation for paraphrase. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 3152–3158.
  • Cao et al. (2015) Cao, Ziqiang, Furu Wei, Sujian Li, Wenjie Li, Ming Zhou, and Houfeng Wang. 2015. Learning summary prior representation for extractive summarization. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers, pages 829–833.
  • Cheng and Lapata (2016) Cheng, Jianpeng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Cho et al. (2014) Cho, Kyunghyun, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pages 1724–1734.
  • Chopra, Auli, and Rush (2016) Chopra, Sumit, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98.
  • Filippova et al. (2015) Filippova, Katja, Enrique Alfonseca, Carlos A. Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with lstms. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 360–368.
  • Glavaš and Štajner (2015) Glavaš, Goran and Sanja Štajner. 2015. Simplifying lexical simplification: Do we need simplified corpora? In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, ACL, pages 63–68.
  • Gu et al. (2016) Gu, Jiatao, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016.
  • Hochreiter and Schmidhuber (1997) Hochreiter, Sepp and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • Hu, Chen, and Zhu (2015) Hu, Baotian, Qingcai Chen, and Fangze Zhu. 2015. LCSTS: A large scale chinese short text summarization dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1967–1972.
  • Hwang et al. (2015) Hwang, William, Hannaneh Hajishirzi, Mari Ostendorf, and Wei Wu. 2015. Aligning sentences from standard wikipedia to simple wikipedia. In NAACL HLT 2015, pages 211–217.
  • Jean et al. (2015) Jean, Sébastien, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, ACL 2015, pages 1–10.
  • Kauchak (2013) Kauchak, David. 2013. Improving text simplification language modeling using unsimplified text data. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL, pages 1537–1546.
  • Kingma and Ba (2014) Kingma, Diederik P. and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Lin and Hovy (2003) Lin, Chin-Yew and Eduard H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003.
  • Lopyrev (2015) Lopyrev, Konstantin. 2015. Generating news headlines with recurrent neural networks. CoRR, abs/1512.01712.
  • Luong, Pham, and Manning (2015) Luong, Thang, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1412–1421.
  • Ma et al. (2017) Ma, Shuming, Xu Sun, Jingjing Xu, Houfeng Wang, Wenjie Li, and Qi Su. 2017. Improving semantic relevance for sequence-to-sequence learning of chinese social media text summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pages 635–640.
  • Manning et al. (2014) Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, pages 55–60.
  • Nallapati et al. (2016) Nallapati, Ramesh, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pages 280–290.
  • Narayan and Gardent (2014) Narayan, Shashi and Claire Gardent. 2014. Hybrid simplification using deep semantics and machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL, pages 435–445.
  • Nisioi et al. (2017) Nisioi, Sergiu, Sanja Stajner, Simone Paolo Ponzetto, and Liviu P. Dinu. 2017. Exploring neural text simplification models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL, pages 85–91.
  • Radev et al. (2004) Radev, Dragomir R., Timothy Allison, Sasha Blair-Goldensohn, John Blitzer, Arda Çelebi, Stanko Dimitrov, Elliott Drábek, Ali Hakim, Wai Lam, Danyu Liu, Jahna Otterbacher, Hong Qi, Horacio Saggion, Simone Teufel, Michael Topper, Adam Winkel, and Zhu Zhang. 2004. MEAD - A platform for multidocument multilingual text summarization. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004.
  • Rush, Chopra, and Weston (2015) Rush, Alexander M., Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 379–389.
  • Srivastava et al. (2014) Srivastava, Nitish, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
  • Sun, Wang, and Li (2012) Sun, Xu, Houfeng Wang, and Wenjie Li. 2012. Fast online training with frequency-adaptive learning rates for chinese word segmentation and new word detection. In Proceedings of ACL’12, pages 253–262.
  • Sutskever, Vinyals, and Le (2014) Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, pages 3104–3112.
  • Wang and Chang (2016) Wang, Wenhui and Baobao Chang. 2016. Graph-based dependency parsing with bidirectional LSTM. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL.
  • Woodsend and Lapata (2011) Woodsend, Kristian and Mirella Lapata. 2011. Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 409–420.
  • Wubben, van den Bosch, and Krahmer (2012) Wubben, Sander, Antal van den Bosch, and Emiel Krahmer. 2012. Sentence simplification by monolingual machine translation. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pages 1015–1024.
  • Xu and Sun (2016) Xu, Jingjing and Xu Sun. 2016. Dependency-based gated recursive neural network for chinese word segmentation. In Meeting of the Association for Computational Linguistics, pages 567–572.
  • Xu et al. (2016) Xu, Wei, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. TACL, 4:401–415.
  • Zhang and Lapata (2017) Zhang, Xingxing and Mirella Lapata. 2017. Sentence simplification with deep reinforcement learning. CoRR, abs/1703.10931.
  • Zhu, Bernhard, and Gurevych (2010) Zhu, Zhemin, Delphine Bernhard, and Iryna Gurevych. 2010. A monolingual tree-based translation model for sentence simplification. In COLING 2010, pages 1353–1361.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
16485
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description