Sequence-to-sequence Pre-training with Data Augmentation for Sentence Rewriting

Sequence-to-sequence Pre-training with Data Augmentation
for Sentence Rewriting

Yi Zhang      Tao Ge1      Furu Wei     Ming Zhou     Xu Sun
Peking University, Beijing, China
Microsoft Research Asia, Beijing, China
{zhangyi16, xusun}@pku.edu.cn
{tage, fuwei, mingzhou}@microsoft.com
  This work was done during the first author’s internship at Microsoft Research Asia.  Equal contribution
2footnotemark: 2
Abstract

We study sequence-to-sequence (seq2seq) pre-training with data augmentation for sentence rewriting. Instead of training a seq2seq model with gold training data and augmented data simultaneously, we separate them to train in different phases: pre-training with the augmented data and fine-tuning with the gold data. We also introduce multiple data augmentation methods to help model pre-training for sentence rewriting. We evaluate our approach in two typical well-defined sentence rewriting tasks: Grammatical Error Correction (GEC) and Formality Style Transfer (FST). Experiments demonstrate our approach can better utilize augmented data without hurting the model’s trust in gold data and further improve the model’s performance with our proposed data augmentation methods.

Our approach substantially advances the state-of-the-art results in well-recognized sentence rewriting benchmarks over both GEC and FST. Specifically, it pushes the CoNLL-2014 benchmark’s score and JFLEG Test GLEU score to 62.61 and 63.54 in the restricted training setting, 66.77 and 65.22 respectively in the unrestricted setting, and advances GYAFC benchmark’s BLEU to 74.24 (2.23 absolute improvement) in E&M domain and 77.97 (2.64 absolute improvement) in F&R domain.

1 Introduction

Data augmentation proves effective in alleviating the issue of insufficient training data because it can help improve the model’s generalization ability and reduce the risk of overfitting. For sequence-to-sequence (seq2seq) learning in Natural Language Processing (NLP) tasks, previous studies  Sennrich et al. (2016a); Edunov et al. (2018); Karakanta et al. (2018); Wang et al. (2018) using data augmentation tend to train with the gold data and augmented data simultaneously. Despite the certain effectiveness of the approaches, we find that they suffer from a limitation when applied to sentence rewriting: since simultaneous training does not discriminate between gold and augmented data, the noisy, unnecessary and even erroneous edits introduced in the augmented data tend to make the model become aggressive to rewrite the content that should not be edited, as Figure 1 shows, which is undesirable for sentence rewriting.

Figure 1: An augmented sentence pair generated through back-translation (BT) for GEC. Though it includes useful rewriting knowledge (the underlined text) for GEC, it additionally introduces undesirable edit (the bold text) which may lead the model to learn to rewrite the content that should not be edited.

To address the issue for better utilizing the augmented data for seq2seq learning in sentence rewriting, we study seq2seq pre-training with data augmentation. Instead of training with gold and augmented data simultaneously, our approach trains the model with augmented and gold data in two phases: pre-training and fine-tuning. In the pre-training phase, we train a seq2seq model from scratch with augmented data to help the model learn contextualized representation (encoding), sentence generation (decoding) and potentially useful transformation knowledge (mapping from the source to the target); while in the fine-tuning phase, the model can fully concentrate on the gold training data. In contrast to the previous approaches that train the model with gold and augmented in the same phase, our approach can not only learn useful information from the augmented data, but also avoid the risk of being overwhelmed and adversely affected by the augmented data.

Moreover, we introduce three data augmentation methods to help seq2seq pre-training for sentence rewriting: back translation, feature discrimination and multi-task transfer, which are helpful in improving the model’s generalization ability, and also introduce additional rewriting knowledge, as depicted in Figure 2.

We evaluate our approach in two typical well-defined sentence rewriting tasks: Grammatical Error Correction (GEC) and Formality Style Transfer (FST). Experiments show our approach is more effective to utilize the various augmented data and significantly improves the model, obtaining the state-of-the-art results in both of the tasks.

Figure 2: An example that Formality Style Transfer (FST) benefits from data augmented via feature discrimination (F-Dis) and multi-task transfer (M-Task). F-Dis identifies useful sentence pairs whose target’s formality score (the numbers in the parentheses) is higher than the source, from paraphrase sentences generated by cross-lingual MT, while M-Task utilizes training data for GEC to help formality improvement.

Our contributions are summarized as follows:

  • We study seq2seq pre-training with data augmentation by empirically comparing with other training paradigms for sentence rewriting, confirming its effectiveness and advantages for sentence rewriting tasks.

  • We introduce multiple data augmentation ideas for sentence rewriting, which can improve the quality and diversity of the augmented data and introduce additional rewriting knowledge to benefit model pre-training.

  • Our approach substantially advances the state-of-the-art in all the three important benchmarks (CoNLL2014, JFLEG, GYAFC) in GEC and FST sentence rewriting tasks.

2 Background

2.1 Sequence-to-sequence learning

Sequence-to-sequence (seq2seq) Sutskever et al. (2014); Cho et al. (2014) learning has achieved tremendous success in various NLP tasks. Given a source sentence , a seq2seq model learns to generate its target sentence . The model is usually trained by maximizing the log-likelihood of the training source-target sentence pairs:

(1)

where denotes the training set (i.e., source-target parallel sentence pairs) and denotes the parameters of the model.

During inference, the decoder generates output autoregressively by maximizing :

(2)

2.2 Data Augmentation

To train a good-performing neural network, sufficient training data is indispensable. However, most tasks lack the annotated training data. As a result, the model may suffer from unsatisfactory generalization ability, as well as robustness defending perturbations outside the training data.

To improve the model’s generalization ability, data augmentation is employed to enrich the training set with additional augmented data which is usually artificially generated:

(3)

where and denote the original training set and the augmented training data respectively.

3 Approach

3.1 Seq2seq Pre-training & Fine-tuning

(a) Simultaneous Training
(b) Pre-training & Fine-tuning
Figure 3: Comparison between (a) Simultaneous Training and (b) Pre-training & Fine-tuning framework.

In general, massive augmented data can help a seq2seq model to learn contextualized representations, sentence generation and source-target alignments. However, it is usually noisier and less valuable than gold training data. In simultaneous training (Figure 3(a)), the massive augmented data tends to overwhelm the gold data and introduce unnecessary and even erroneous editing knowledge, which is undesirable for sentence rewriting.

To better exploit the augmented data, we propose to first pre-train the model with augmented data and then fine-tune the model with gold training data (Figure 3(b)). In our pre-training & fine-tuning approach, the augmented data is not treated equally to the gold data; instead it only serves as prior knowledge that can be updated and even overwritten during the fine-tuning phase. Then the model can better learn from the gold data without being overwhelmed or distracted by the augmented data. Moreover, separating the augmented and gold data into different training phases makes the model become more tolerant to noise in augmented data, which reduces the quality requirement for the augmented data and enables the model to use noisier augmented data and even training data from other tasks (See Section 3.2.3).

3.2 Data Augmentation for Text Rewriting

We study three data augmentation methods for seq2seq sentence rewriting: back translation (Section 3.2.1), feature discrimination (Section 3.2.2) and multi-task transfer (Section 3.2.3).

3.2.1 Back translation

The original idea of back translation Sennrich et al. (2016a) is to train a target-to-source seq2seq model using bi-lingual parallel corpora, and use the model to generate source language sentences from target monolingual sentences, establishing synthetic parallel sentence pairs.

Although back translation is originally proposed for machine translation (MT), it can be easily generalized to sentence rewriting tasks where source and target are in the same language. In this paper, we use back translation as our basic data augmentation method.

3.2.2 Feature discrimination

For a well-defined sentence rewriting task, the target sentence is usually expected to improve the source sentence in some aspects without changing its meaning. For instance, for GEC, the target sentence should be grammatically correct, more fluent and native-sounding than the source sentence; while for FST, the target sentence should look more formal than the source sentence. With this motivation, we propose feature discrimination which identifies valuable sentence pairs using a feature-based discriminator from paraphrased sentences for a specific sentence rewriting task.

To make it easy to understand, we present two examples of feature discrimination for data augmentation in GEC and FST respectively.

Figure 4: Fluency discrimination for GEC. The paraphrased sentences are generated by a back translation model trained on GEC parallel data. The scores right after the sentences are their fluency scores. The fluency discriminator only chooses the sentence whose fluency score is lower than the correct sentence and pairs them (by the red dashed arrow) as augmented data.

GEC: Fluency discrimination

For GEC, data augmentation establishes new parallel sentence pairs through deriving a source sentence with grammatical errors by making small modifications to a correct sentence. However, the derived source sentence sometimes does not have grammatical issues; instead, it is just a paraphrased sentence to the correct sentence. Training with such pairs will make the model prone to edit a sentence even if the sentence has no grammatical issues, which may introduce unnecessary and undesirable edits, as depicted in Figure 4.

To address this challenge, we use fluency discrimination, whose idea was first proposed by \newcitege2018fluency. Fluency discrimination is motivated by that the target correct sentence should be more fluent than the source sentence. It uses a discriminator to evaluate sentences’ fluency defined in Eq (4) by Ge et al. (2018a), and only chooses the sentence whose fluency score is lower than the correct sentence, and then pairs them as augmented data. In this way, the undesirable sentence pairs for GEC can be filtered out, as shown in Figure 4.

(4)
(5)

where is the fluency score of sentence . is the probability of given context , computed by a pre-trained language model, and is the length of sentence .

FST: Formality discrimination

For FST, we propose a novel feature augmentation method called formality discrimination. The idea was already depicted in Figure 2, motivated by the observation that cross-lingual machine translation (MT) often changes the formality of a sentence.

We collect a number of informal English sentences from twitter and online forums, denoting as where denotes the -th sentence. We first translate111https://translate.google.com/ them into a pivot language (e.g., Chinese) and then translate them back into English, as Figure 2 shows. In this way, we obtain a rewritten sentence for each sentence .

To verify whether improves the formality of , we introduce a formality discriminator which is a binary classifier trained with formal text (e.g., news) and informal text (e.g., tweets) to quantify the formality level of a sentence. If the discriminator finds largely improves the formality of , then (, ) will be selected as augmented data:

(6)

where is the probability of sentence being formal, predicted by the discriminator, and is the threshold for augmented data selection.

With this method, we can obtain much augmented data with valuable rewriting knowledge for FST that is not included by the original training data, which is helpful to generalize the model.

3.2.3 Multi-task transfer

In addition to back translation and feature discrimination that use artificially generated sentence pairs for data augmentation, we introduce multi-task transfer that uses annotated data from other seq2seq tasks, which may involve useful rewriting knowledge, as augmented data to benefit the target sentence rewriting task. A typical example is shown in Figure 2, in which GEC annotated data can provide knowledge to help the model correct grammatical errors in the input informal sentence for FST task.

For multi-task transfer, the augmented data from other tasks is supplementary and should not distract the model from the gold data. Fortunately, our pre-training & fine-tuning approach allows the model to absorb useful knowledge from the augmented data without hurting the trust in the gold data. Therefore, we can introduce annotated data from other tasks that can potentially benefit the target rewriting task as augmented data to help pre-train the model.

4 Experiments

We use the Transformer Vaswani et al. (2017) as our default seq2seq model and evaluate our approach in two important well-defined sentence rewriting tasks: Grammatical Error Correction (GEC) and Formality Style Transfer (FST), both of which have high-quality benchmark datasets with reliable references from multiple human annotators and evaluation metrics.

4.1 GEC Evaluation

4.1.1 Setting

We test our approach on two well-known GEC benchmarks: CoNLL-2014 Ng et al. (2014) and JFLEG Napoles et al. (2017). CoNLL-2014 contains 1,312 test sentences while JFLEG contains 747 test sentences. Being consistent with the official evaluation metrics, we use Max-Match (M) Precision, Recall and Dahlmeier and Ng (2012) for CoNLL-2014 and GLEU Napoles et al. (2015) for JFLEG evaluation. As previous studies, we use CoNLL-2013 test set and JFLEG dev set as our development sets for CoNLL-2014 and JFLEG respectively. As most of the systems Sakaguchi et al. (2017); Chollampatt and Ng (2018a); Grundkiewicz and Junczys-Dowmunt (2018) that use an additional spell checker to resolve spelling errors in JFLEG, we follow Ge et al. (2018a) to use the public Bing spell checker222https://azure.microsoft.com/en-us/services/cognitive-services/spell-check/ to fix spelling errors in JFLEG as preprocessing.

We follow the restricted setting where only the public resources can be used and use public Lang-8 Mizumoto et al. (2011); Tajiri et al. (2012) and NUCLE Dahlmeier et al. (2013) dataset as gold parallel data, as most previous work did. For data augmentation, we use a combination of back translation (Section 3.2.1) and fluency discrimination (Section 3.2.2) to generate 118M augmented data from English Wikipedia and the News Crawl corpus during 2007-2013. Specifically, for a correct sentence, a back translation model trained with the public GEC data first generates 10 best outputs; then a 5-gram language model Junczys-Dowmunt and Grundkiewicz (2016) trained on Common Crawl works as the fluency discriminator to select one whose fluency score is lower than the correct sentence and pairs them as augmented data.

We use Transformer (big) in \newcitevaswani2017attention as our error correction model and back translation model, which has a 6-layer encoder and decoder with the dimensionality of 1,024 for both input and output and 4,096 for inner-layers, and 16 self-attention heads. We use a shared source-target vocabulary of 30,000 BPE Sennrich et al. (2016b) tokens and train the model on 8 Nvidia V100 GPUs, using Adam optimizer with =0.9, =0.98. We allow each batch to have at most 4,096 tokens per GPU. In pre-training, the learning rate is set to 0.0005 with warmup over the first 8,000 steps and then decreasing proportionally to the inverse square root of the number of steps, and dropout probability is set to 0.3; while in the fine-tuning phase, the learning rate is set to 0.0001 with warmup over the first 4,000 steps and inverse square root decay after warmup, and the dropout ratio is set to 0.2. We pre-train the model for 200k steps and fine-tune it up to 50k steps. For inference, we follow \newcitechollampatt2018multilayer to generate 12-best predictions and choose the best sentence after re-ranking with their edit operations and language model scores computed by the 5-gram Common Crawl language model.

4.1.2 Results

We compare our pre-training & fine-tuning approach to simultaneous training with both gold and augmented data. According to Table 1, the 118M augmented sentence pairs derived through back translation and fluency discrimination are less valuable than the gold pairs, only achieving 44.75 and 57.54 GLEU. Training with the gold data and the augmented data simultaneously does not bring large improvements, instead leads to a decrease of both precision and on the CoNLL-2014 test set. When we use up-sampling or down-sampling to balance the original data and augmented data, we see 1-3 absolute improvement in CoNLL and JFLEG. In contrast, our pre-training & fine-tuning approach significantly improves the performance over the model trained with only original data, achieving 61.11 (+6.87 improvement) and 62.93 GLEU score (+2.77 improvement), which is much more than the improvements by the simultaneous training approaches.

Also, we confirm that fluency discrimination benefits GEC data augmentation by comparing the last two models in Table 1, because it can help filter out unnecessary and undesirable edits, which makes the augmented data more informative and helpful in improving the performance.

Model CoNLL-2014 JFLEG
Original data 60.15 38.94 54.24 60.16
Augmented data 48.25 34.68 44.75 57.54
ST 59.27 39.41 53.84 60.59
ST (down-sampling) 61.90 39.04 55.41 61.02
ST (up-sampling) 64.29 39.18 56.98 61.37
PT&FT 68.05 43.40 61.11 62.93
PT&FT (w/o F-Dis) 67.23 42.94 60.40 62.42
Table 1: The performance comparison of models trained with simultaneous training (ST) and our pre-training & fine-tuning (PT&FT) approach, and the ablation test for fluency discrimination (F-Dis). For ST, down-sampling and up-sampling are for balancing the size of the augmented data and the original data. Specifically, down-sampling samples augmented data to make it in the same size of the original data; while up-sampling increases the frequency of the original data so that it becomes in the same size with the augmented data.
System Setting CoNLL-2014 JFLEG
No edit - - 40.54
NUS18-CNN Chollampatt and Ng (2018a) R 54.79 57.47
NUS18-NeuQE Chollampatt and Ng (2018b) R 56.52 -
Adapted-transformer Junczys-Dowmunt et al. (2018) R 55.8 59.9
SMT-NMT hybrid Grundkiewicz and Junczys-Dowmunt (2018) R 56.25 61.50
Wiki edit + Round-trip translation Lichtarge et al. (2019) R 60.4 63.3
Copy-Augmented Transformer Zhao et al. (2019) R 61.15 61.00
Our approach (R) R 62.61 63.54
Nested-RNN-seq2seq Ji et al. (2017) U 45.15 53.41
Fluency Boost Learning Ge et al. (2018b) U 61.34 61.41
Wiki edit + Round-trip translation Lichtarge et al. (2019) U 62.8 65.0
Our approach (U) U 66.77 65.22
Table 2: Comparison to the state-of-the-art GEC systems. R denotes the restricted setting where only public GEC data can be used for training, while U denotes the unrestricted setting where any data can be used.

We compare our approach to the top-performing GEC systems333The results of some latest work (e.g., Grundkiewicz et al. (2019)) using W&I and LOCNESS corpus Yannakoudakis et al. (2018) for training are not reported. in CoNLL and JFLEG. In addition to the restricted setting in which only public GEC data can be used for training, we also evaluate our approach in the unrestricted setting in which any data can be used. In the unrestricted setting, we additionally include 1.4M Cambridge Learner Corpus Nicholls (2003) and 2.9M non-public Lang-8 data as gold data, as \newcitege2018reaching did, and 85M sentence pairs augmented from English Gigaword using the same data augmentation methods as we used in the restricted setting, except that the back translation model is replaced with the one trained with both public and non-public GEC data. Like most state-of-the-art GEC systems, we train 4 models with different random initializations for ensemble decoding.

Table 2 shows the results evaluated in CoNLL and JFLEG benchmarks. According to Table 2, our approach obtains the state-of-the-art results in both restricted and unrestricted settings. In the restricted setting, it achieves 62.61 in CoNLL-2014. In JFLEG, it achieves 63.54 GLEU score which is the new state-of-the-art result, even outperforming the multi-round decoding results of \newcitegrundkiewicz2018near and \newciteGoogle_GEC. In the unrestricted setting, our approach significantly outperforms the previous state-of-the-art GEC systems, and achieves the best results for the GEC benchmarks by now.

4.2 FST Evaluation

Formality style transfer is a practical sentence rewriting task, aiming to paraphrase an input sentence into desired formality. In this paper, we focus on informalformal style transfer since it is more practical in real application scenarios.

4.2.1 Setting

We use GYAFC benchmark dataset Rao and Tetreault (2018) for training and evaluation. GYAFC’s training split contains a total of 110K annotated informal-formal parallel sentences, which are annotated via crowd-sourcing of two domains: Entertainment & Music (E&M) and Family & Relationships (F&R). In its test split, there are 1,146 and 1,332 informal sentences in E&M and F&R domain respectively and each informal sentence has 4 referential formal rewrites. Following prior work Niu et al. (2018), we use GYAFC dev split as our development set and use tokenized BLEU as our automatic evaluation metric.

We use all the three data augmentation methods we introduced and obtain a total of 4.9M augmented parallel sentences. Among them, 1.6M are generated by back-translating formal sentences in E&M and F&R domain on Yahoo Answers L6 corpus, 1.5M are derived by formality discrimination (the threshold = 0.5), and 1.8M are from the public GEC data (Lang-8 and NUCLE).

We use the Transformer (base) model in \newcitevaswani2017attention as the seq2seq model, which has 6-layer transformer blocks with embedding dimension of 512 for input and output and 2,048 for inner-layers, and 8 self-attention heads. We build a shared vocabulary of 20K BPE Sennrich et al. (2016b) tokens, and adopt the Adam optimizer to train the model with batch size of 4,096 tokens per GPU, as in Section 4.1. In pre-training, the dropout probability is set to 0.1, the learning rate is set to 0.0005 with 8000 warmup steps and scheduled to an inverse square root decay after warmup; while during fine-tuning, the learning rate is set to 0.00025. We pre-train the model for 80k steps and fine-tune the model for a total of 15k steps.

4.2.2 Results

Table 3 shows results of the models trained with simultaneous training and our pre-training & fine-tuning approach. As the results in GEC, simultaneously training with the augmented and original data leads to a performance decline, because the noisy augmented data cannot achieve desirable performance by itself and may hinder the model to learn from the gold data in simultaneous training. In contrast, PT&FT only uses the augmented data in the pre-training phase and treats it as the prior knowledge which is supplementary to the gold training data, reducing the negative effects of the augmented data and improving the results.

Model E&M F&R
Original data 69.44 74.19
Augmented data 51.83 55.66
ST 59.93 63.16
ST (up-sampling) 68.43 73.04
ST (down-sampling) 68.54 73.69
PT&FT 72.63 77.01
Table 3: The comparison of simultaneous training (ST) and Pre-train & Fine-tuning (PT&FT) for FST.

Table 4 compares the results of our pre-training & fine-tuning approach with different data augmentation methods. Compared with back translation, the improvements of formality discrimination and multi-task transfer are more significant since they introduce new rewriting knowledge and valuable training signals. The combination of the augmented data further improves the performance, obtaining more than 2.5 absolute improvement over the baseline trained with only original data.

Model E&M F&R
Original data 69.44 74.19
Pre-training & Fine-tuning
+ BT 71.18 75.34
+ F-Dis 71.72 76.24
+ M-Task 71.91 76.21
+ M-Task + F-Dis 72.40 76.92
+ BT + M-Task + F-Dis 72.63 77.01
Table 4: The comparison of different data augmentation methods for FST.

We compare our approach to the following previous approaches in GYAFC benchmarks:

  • Rule, PBMT, NMT, PBMT-NMT: Rule-based, phrase-based MT, NMT, PBMT-NMT hybrid model in \newciteDBLP:conf/naacl/RaoT18.

  • NMT-MTL: The state-of-the-art NMT model with multi-task learning Niu et al. (2018).

According to the results in Table 5, our single model outperforms the previous state-of-the-art ensemble model Niu et al. (2018) and our ensemble model achieves a new state-of-the-art result: 74.24 in E&M and 77.97 in F&R domain in GYAFC benchmark.

System E&M F&R
No-edit 50.28 51.67
Rule 60.37 66.40
PBMT 66.88 72.40
NMT 58.27 68.26
NMT-PBMT 67.51 73.78
NMT-MTL 71.29 (72.01) 74.51 (75.33)
Our approach 72.63 (74.24) 77.01 (77.97)
Table 5: The comparison of our approach to the state-of-the-art result for FST. Numbers in parentheses are the results of ensemble of 4 models with different random initializations.
Model Formality Fluency Meaning
Original data 1.31 1.77 1.80
NMT-MTL 1.34 1.78
Ours
Table 6: Results of human evaluation of FST. Scores marked with */ are significantly different from the Original data/NMT-MTL scores ( in t-test).

We also conduct human evaluation. Following previous work Rao and Tetreault (2018), we assess the model output on three criteria: formality, fluency and meaning preservation. We compare our baseline model trained only with original data (in Table 3), the previous state-of-the-art model (NMT-MTL) and our PT&FT approach. We randomly sample 300 items and each item includes an input and three corresponding outputs that shuffled to anonymize model identities. Two annotators are asked to rate these outputs on a discrete scale of 0 to 2.

Table 6 presents the human evaluation results, showing that our model is consistently well rated in human evaluation. It significantly improves our baseline model trained with only original data in all three aspects, and outperforms the previous state-of-the-art model in terms of fluency ( in t-test), confirming that our pre-training & fine-tuning approach with data augmentation is helpful in improving FST task.

4.3 Discussion

With the success of the pre-training & fine-tuning approach in sentence rewriting, we study generalizing it to other seq2seq tasks. We use WMT14 English-German benchmark as our testbed and train and evaluate on the standard WMT14 English-German dataset. As previous work Vaswani et al. (2017), we validate on newstest2013. By removing the sentences longer than 250 words and sentence-pairs with a source/target length ratio exceeding 1.5 in training data, we obtain 3.9M parallel sentences as the original training data. For data augmentation, we back-translate 37M German mono-lingual sentences from News Crawl in 2013.

We use the same model architecture and training configuration in Section 4.1 and compare the results of our pre-training & fine-tuning approach and the simultaneous training approaches. Table 8 reports tokenized BLEU of our approach in the WMT14 English-German dataset. Our PT&FT approach still outperforms the simultaneous training and achieves 32.2 BLEU. As far as we know, it is the best result for an MT model that uses only WMT14’s mono- and bi-lingual data for training, which is only inferior to the commercial translation engine DeepL444https://www.deepl.com/press.html and FAIR’s model trained with the larger WMT18 dataset (containing 5.2M bi-lingual sentence pairs) and 226M augmented sentence pairs through back translation simultaneously with up-sampling.

Model BLEU
SOTA DeepL 33.3
FAIR Edunov et al. (2018) 35.0
Ours Original data 28.7
Augmented data 28.8
ST 29.3
ST (up-sampling) 31.3
ST (down-sampling) 31.0
PT&FT 32.2
Table 7: Results in WMT14 English-German dataset.

One interesting observation in Table 7 is that the augmented data itself can achieve the comparable performance to the original training data in MT. This is quite different from the results in the sentence rewriting tasks (i.e., GEC and FST) where the augmented data can only yield a low performance by itself. The reason is that for many sentence rewriting tasks, most parts of a source sentence should not be edited unless necessary. Since the augmented data may contain various noisy and unnecessary editing signals, it is likely to make the model become aggressive to do erroneous rewrites, resulting in a low performance. Therefore, for sentence rewriting, the augmented data is better to be pre-trained than trained together with the original training data.

5 Related Work

Pre-training approaches Dai and Le (2015); Conneau et al. (2017); McCann et al. (2017); Howard and Ruder (2018) have drawn much attention recently. Among them, the most successful ones are ELMo Peters et al. (2018), OpenAI-GPT Radford et al. (2018) and BERT Devlin et al. (2018), which are all based on pre-training a language model on massive unlabeled text data and fine-tuning with the task-specific gold data. While some previous work studies initializing a seq2seq model with a pre-trained language model Ramachandran et al. (2017) and multi-task seq2seq learning Luong et al. (2015), there is no much work related to seq2seq pre-training with data augmentation until in the last few months when some work Lichtarge et al. (2019); Zhao et al. (2019); Grundkiewicz et al. (2019) have started to explore pre-training with augmented data for GEC. Different from the studies that report better GEC performance through pre-training with augmented data, we focus on studying how to best utilize the augmented data by empirically comparing the effects of different training paradigms (i.e., simultaneous training VS pre-training & fine-tuning) given the same augmented data in the final performance, analyzing the necessity of pre-training & fine-tuning for seq2seq sentence rewriting tasks.

Our work is also related to the research exploring data augmentation methods in NLP. In addition to word substitution Fadaee et al. (2017); Zhou et al. (2019) and paraphrasing Dong et al. (2017), back translation Bojar and Tamchyna (2011); Sennrich et al. (2016a) including its variations He et al. (2016); Zhang et al. (2018) attracts much attention as its success in MT Poncelas et al. (2018); Edunov et al. (2018). For sentence rewriting, an important research branch for data augmentation is artificial error generation for GEC Brockett et al. (2006); Foster and Andersen (2009); Rozovskaya and Roth (2010, 2011); Rozovskaya et al. (2012); Felice et al. (2014); Yuan et al. (2016); Rei et al. (2017); Xie et al. (2018), which studies generating source sentences with grammatical errors. Also, recent work uses back translation to obtain style-reduced paraphrases Prabhumoye et al. (2018) and employs the data from other tasks with the same target language to enhance the model in terms of target language modeling for sentence rewriting tasks Niu et al. (2018).

6 Conclusion and Future Work

In this paper, we study seq2seq pre-training & fine-tuning with various data augmentation methods in sentence rewriting. Extensive experiments demonstrate that our proposed data augmentation methods can effectively improve the performance and that pre-training & fine-tuning with data augmentation has advantages over the conventional simultaneous training approaches. It achieves new state-of-the-art results in multiple benchmarks in GEC and FST sentence rewriting tasks. In the future, we plan to generalize the current task-specific seq2seq pre-training approach so that we could pre-train a task-independent seq2seq model as a base for any monolingual sentence rewriting task.

References

  • O. Bojar and A. Tamchyna (2011) Improving translation model by monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT@EMNLP 2011, Edinburgh, Scotland, UK, July 30-31, 2011, pp. 330–336. External Links: Link Cited by: §5.
  • C. Brockett, W. B. Dolan, and M. Gamon (2006) Correcting ESL errors using phrasal SMT techniques. In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006, External Links: Link Cited by: §5.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078. External Links: Link, 1406.1078 Cited by: §2.1.
  • S. Chollampatt and H. T. Ng (2018a) A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.1.1, Table 2.
  • S. Chollampatt and H. T. Ng (2018b) Neural quality estimation of grammatical error correction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2528–2539. Cited by: Table 2.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 670–680. External Links: Link Cited by: §5.
  • D. Dahlmeier, H. T. Ng, and S. M. Wu (2013) Building a large annotated corpus of learner english: the nus corpus of learner english. In Proceedings of the eighth workshop on innovative use of NLP for building educational applications, pp. 22–31. Cited by: §4.1.1.
  • D. Dahlmeier and H. T. Ng (2012) Better evaluation for grammatical error correction. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 568–572. Cited by: §4.1.1.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 3079–3087. External Links: Link Cited by: §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §5.
  • L. Dong, J. Mallinson, S. Reddy, and M. Lapata (2017) Learning to paraphrase for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 875–886. External Links: Link Cited by: §5.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. arXiv preprint arXiv:1808.09381. Cited by: §1, Table 7, §5.
  • M. Fadaee, A. Bisazza, and C. Monz (2017) Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers, pp. 567–573. External Links: Link, Document Cited by: §5.
  • M. Felice, Z. Yuan, Ø. E. Andersen, H. Yannakoudakis, and E. Kochmar (2014) Grammatical error correction using hybrid systems and type filtering. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, CoNLL 2014, Baltimore, Maryland, USA, June 26-27, 2014, pp. 15–24. External Links: Link Cited by: §5.
  • J. Foster and Ø. E. Andersen (2009) GenERRate: generating errors for use in grammatical error detection. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, BEA@NAACL-HLT 2009, Boulder, CO, USA, June 5, 2009, pp. 82–90. External Links: Link Cited by: §5.
  • T. Ge, F. Wei, and M. Zhou (2018a) Fluency boost learning and inference for neural grammatical error correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1055–1065. Cited by: §3.2, §4.1.1.
  • T. Ge, F. Wei, and M. Zhou (2018b) Reaching human-level performance in automatic grammatical error correction: an empirical study. arXiv preprint arXiv:1807.01270. Cited by: Table 2.
  • R. Grundkiewicz, M. Junczys-Dowmunt, and K. Heafield (2019) Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 252–263. Cited by: §5, footnote 3.
  • R. Grundkiewicz and M. Junczys-Dowmunt (2018) Near human-level performance in grammatical error correction with hybrid machine translation. arXiv preprint arXiv:1804.05945. Cited by: §4.1.1, Table 2.
  • D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W. Ma (2016) Dual learning for machine translation. In Advances in Neural Information Processing Systems, pp. 820–828. Cited by: §5.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 328–339. External Links: Link Cited by: §5.
  • J. Ji, Q. Wang, K. Toutanova, Y. Gong, S. Truong, and J. Gao (2017) A nested attention neural hybrid model for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pp. 753–762. External Links: Link, Document Cited by: Table 2.
  • M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, and K. Heafield (2018) Approaching neural grammatical error correction as a low-resource machine translation task. arXiv preprint arXiv:1804.05940. Cited by: Table 2.
  • M. Junczys-Dowmunt and R. Grundkiewicz (2016) Phrase-based machine translation is state-of-the-art for automatic grammatical error correction. arXiv preprint arXiv:1605.06353. Cited by: §4.1.1.
  • A. Karakanta, J. Dehdari, and J. van Genabith (2018) Neural machine translation for low-resource languages without parallel corpora. Machine Translation 32 (1-2), pp. 167–189. External Links: Link, Document Cited by: §1.
  • J. Lichtarge, C. Alberti, S. Kumar, N. Shazeer, N. Parmar, and S. Tong (2019) Corpora generation for grammatical error correction. CoRR abs/1904.05780. External Links: Link, 1904.05780 Cited by: Table 2, §5.
  • M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2015) Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114. Cited by: §5.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6297–6308. External Links: Link Cited by: §5.
  • T. Mizumoto, M. Komachi, M. Nagata, and Y. Matsumoto (2011) Mining revision log of language learning sns for automated japanese error correction of second language learners. In Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 147–155. Cited by: §4.1.1.
  • C. Napoles, K. Sakaguchi, M. Post, and J. Tetreault (2015) Ground truth for grammatical error correction metrics. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, pp. 588–593. Cited by: §4.1.1.
  • C. Napoles, K. Sakaguchi, and J. Tetreault (2017) JFLEG: a fluency corpus and benchmark for grammatical error correction. arXiv preprint arXiv:1702.04066. Cited by: §4.1.1.
  • H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant (2014) The conll-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–14. Cited by: §4.1.1.
  • D. Nicholls (2003) The cambridge learner corpus: error coding and analysis for lexicography and elt. In Proceedings of the Corpus Linguistics 2003 conference, Vol. 16, pp. 572–581. Cited by: §4.1.2.
  • X. Niu, S. Rao, and M. Carpuat (2018) Multi-task neural models for translating between styles within and across languages. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 1008–1021. External Links: Link Cited by: 2nd item, §4.2.1, §4.2.2, §5.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 2227–2237. External Links: Link Cited by: §5.
  • A. Poncelas, D. Shterionov, A. Way, G. M. de Buy Wenniger, and P. Passban (2018) Investigating backtranslation in neural machine translation. CoRR abs/1804.06189. External Links: Link, 1804.06189 Cited by: §5.
  • S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 866–876. External Links: Link Cited by: §5.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. In Technical report, OpenAI, Cited by: §5.
  • P. Ramachandran, P. J. Liu, and Q. V. Le (2017) Unsupervised pretraining for sequence to sequence learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 383–391. External Links: Link Cited by: §5.
  • S. Rao and J. R. Tetreault (2018) Dear sir or madam, may I introduce the GYAFC dataset: corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 129–140. External Links: Link Cited by: §4.2.1, §4.2.2.
  • M. Rei, M. Felice, Z. Yuan, and T. Briscoe (2017) Artificial error generation with machine translation and syntactic patterns. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@EMNLP 2017, Copenhagen, Denmark, September 8, 2017, pp. 287–292. External Links: Link Cited by: §5.
  • A. Rozovskaya and D. Roth (2010) Training paradigms for correcting errors in grammar and usage. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA, pp. 154–162. External Links: Link Cited by: §5.
  • A. Rozovskaya and D. Roth (2011) Algorithm selection and model adaptation for ESL correction tasks. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pp. 924–933. External Links: Link Cited by: §5.
  • A. Rozovskaya, M. Sammons, and D. Roth (2012) The UI system in the HOO 2012 shared task on error correction. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, BEA@NAACL-HLT 2012, June 7, 2012, Montréal, Canada, pp. 272–280. External Links: Link Cited by: §5.
  • K. Sakaguchi, M. Post, and B. Van Durme (2017) Grammatical error correction with neural reinforcement learning. arXiv preprint arXiv:1707.00299. Cited by: §4.1.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016a) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §1, §3.2.1, §5.
  • R. Sennrich, B. Haddow, and A. Birch (2016b) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §4.1.1, §4.2.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Link Cited by: §2.1.
  • T. Tajiri, M. Komachi, and Y. Matsumoto (2012) Tense and aspect error correction for esl learners using global context. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 198–202. Cited by: §4.1.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §4.3, §4.
  • X. Wang, H. Pham, Z. Dai, and G. Neubig (2018) SwitchOut: an efficient data augmentation algorithm for neural machine translation. CoRR abs/1808.07512. External Links: Link, 1808.07512 Cited by: §1.
  • Z. Xie, G. Genthial, S. Xie, A. Ng, and D. Jurafsky (2018) Noising and denoising natural language: diverse backtranslation for grammar correction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, pp. 619–628. Cited by: §5.
  • H. Yannakoudakis, Ø. E. Andersen, A. Geranpayeh, T. Briscoe, and D. Nicholls (2018) Developing an automated writing placement system for esl learners. Applied Measurement in Education 31 (3), pp. 251–267. Cited by: footnote 3.
  • Z. Yuan, T. Briscoe, and M. Felice (2016) Candidate re-ranking for smt-based grammatical error correction. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, BEA@NAACL-HLT 2016, June 16, 2016, San Diego, California, USA, pp. 256–266. External Links: Link Cited by: §5.
  • Z. Zhang, S. Liu, M. Li, M. Zhou, and E. Chen (2018) Joint training for neural machine translation models with monolingual data. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
  • W. Zhao, L. Wang, K. Shen, R. Jia, and J. Liu (2019) Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. CoRR abs/1903.00138. External Links: Link, 1903.00138 Cited by: Table 2, §5.
  • W. Zhou, T. Ge, K. Xu, F. Wei, and M. Zhou (2019) BERT-based lexical substitution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3368–3373. Cited by: §5.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
390123
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description