Automatic Fact-guided Sentence Modification

Automatic Fact-guided Sentence Modification

Darsh J Shah*    Tal Schuster*    Regina Barzilay
Computer Science and Artificial Intelligence Lab
Massachusetts Institute of Technology
{darsh, tals, regina}@csail.mit.edu
Abstract

Online encyclopediae like Wikipedia contain large amounts of text that need frequent corrections and updates. The new information may contradict existing content in encyclopediae. In this paper, we focus on rewriting such dynamically changing articles. This is a challenging constrained generation task, as the output must be consistent with the new information and fit into the rest of the existing document. To this end, we propose a two-step solution: (1) We identify and remove the contradicting components in a target text for a given claim, using a neutralizing stance model; (2) We expand the remaining text to be consistent with the given claim, using a novel two-encoder sequence-to-sequence model with copy attention. Applied to a Wikipedia fact update dataset, our method successfully generates updated sentences for new claims, achieving the highest SARI score. Furthermore, we demonstrate that generating synthetic data through such rewritten sentences can successfully augment the FEVER fact-checking training dataset, leading to a relative error reduction of 13%.

* Order decided by a coin toss.

1 Introduction

Online text resources like Wikipedia contain millions of articles that must be continually updated. Some updates involve expansions of existing articles, while others modify the content. In this work, we are interested in the latter scenario where the modification contradicts the current articles. Such changes are common in online sources and often cover a broad spectrum of subjects ranging from the changing of dates for events to modifications of the relationship between entities. In these cases, simple solutions like negating the original text or concatenating it with the new information would not apply. In this work, our goal is to automate these updates. Specifically, given a claim and an outdated sentence from an article, we rewrite the sentence to be consistent with the given claim while preserving non-contradicting content.

Figure 1: Our fact-guided update pipeline. Given a claim which refutes incorrect information, a masker is applied to remove the contradicting parts from the original text while preserving the rest of the context. Then, the residual neutral text and claim are fused to create an updated text that is consistent with the claim.

Consider the Wikipedia update scenario depicted in Figure 1. The claim, informing that 23 of 43 minority stakeholdings are significant, contradicts the old information in the Wikipedia sentence, requiring modification. Directly learning a model for this task would demand supervision, i.e. demonstrated updates with the corresponding claims. For Wikipedia, however, the underlying claims which drive the changes are not easily accessible. Therefore, we need to utilize other available sources of supervision.

In order to make the corresponding update, we develop a two step solution: (1) Identify and remove the contradicting segments of the text (in this case, 28 of their 42 minority stakeholdings); (2) Rewrite the residual sentence to include the updated information (e.g. fraction of significant stakeholdings) while also preserving the rest of the content.

For the first step, we utilize a neutrality stance classifier as indirect supervision to identify the polarizing spans in the target sentence. We consider a sentence span as polarizing if its absence increases the neutrality of the claim-sentence pair. To identify and mask such sentence spans, we introduce an interpretability-inspired  (Lei et al., 2016) neural architecture to effectively explore the space of possible spans. We formulate our objective in a way that the masking is minimal, thus preserving the context of the sentence.

For the second step, we introduce a novel, two-encoder decoder architecture, where two encoders fuse the claim and the residual sentence with a more refined control over their interaction.

We apply our method to two tasks: automatic fact-guided modifications and data augmentation for fact-checking. On the first task, our method is able to generate corrected Wikipedia sentences guided by unstructured textual claims. Evaluation on Wikipedia modifications demonstrates that our model’s outputs were the most successful in making the requisite updates, compared to strong baselines. On the FEVER fact-checking dataset, our model is able to successfully generate new claim-evidence supporting pairs, starting with claim-evidence refuting pairs — intended to reduce the bias in the dataset. Using these outputs to augment the dataset, we attain a 13% decrease in relative error on an unbiased evaluation set.

2 Related Work

Text Rewriting

There have been several recent advancements in the field of text rewriting, including style transfer  (Shen et al., 2017; Zhang et al., 2018; Chen et al., 2018) and sentence fusion  (Barzilay and McKeown, 2005; Narayan et al., 2017; Geva et al., 2019). Unlike previous approaches, our sentence modification task addresses potential contradictions between two sources of information.

Our work is fairly related to the approach of Li et al. (2018), which separates the task of sentiment transfer into deleting strong markers of sentiment in a sentence and retrieving markers of the target label to generate a sentence with the opposite sentiment. In contrast to such work, where the requisite modification is along a fixed aspect (e.g. sentiment), in our setting, an arbitrary input sentence (the claim) dictates the space of desired modifications. Therefore, in order to succeed at our task, a system should understand the varying degree of polarization in the spans of the outdated sentence against the claim before modifying the sentence to be consistent with the claim.

Wikipedia Edits

Wikipedia edit history have been analyzed for insights into the kinds of modifications made (Daxenberger and Gurevych, 2013; Yang et al., 2017; Faruqui et al., 2018). The edit history has also been used for text generation tasks such as sentence compression and simplification (Yatskar et al., 2010), paraphrasing (Max and Wisniewski, 2010) and writing assistance (Cahill et al., 2013). In this work, we are interested in the novel task of automating the editing process with the guidance of a textual claim.

Fact Verification Datasets

The growing interest in automatic fake news detection led to the development of several fact verification datasets  (Vlachos and Riedel, 2014; Wang, 2017; Rashkin et al., 2017; Thorne et al., 2018). FEVER, the largest fact-checking dataset, contains 185K human written fake and real claims, generated by crowd-workers, in context of sentences from Wikipedia articles. This dataset contains biases that allow a model to identify many of the false claims without any evidence (Schuster et al., 2019). This bias affects the generalization capabilities of models trained on such data. In this work, we show that our automatic modification method can be used to augment a fact-checking dataset and to improve the inference of models trained on it.

Data Augmentation

Methods for data augmentation are commonly used in computer vision (Perez and Wang, 2017). There have been recent successes in NLP where augmentation techniques such as paraphrasing and word replacement were applied to text classification  (Kobayashi, 2018; Wu et al., 2018). Adversarial examples in NLI with syntactic modifications can also be considered as methods of data augmentation  (Iyyer et al., 2018; Zhang et al., 2019). In this work, we create constrained modifications, based on a reference claim, to augment data for our task at hand. Our additions are specifically aimed towards reducing the bias in the training data, by having a false claim appear in both “Agrees” and “Disagrees” classes.

3 Model

Problem Statement

We assume access to a corpus of claims and knowledge-book sentences. Specifically, , where is a short factual sentence (claim), and is a sentence from Wikipedia. Each pair of claim and Wikipedia sentence has a relation , of either agree (), disagree () or neutral (). In this corpus, a Wikipedia sentence is defined as outdated with respect to if and updated if . The neutral relation holds for pairs in which the sentence doesn’t contain specific information about the claim.

Our goal is to automatically update a given sentence , which is outdated with respect to a . Specifically, given a claim and a pair for which , our objective is to apply minimal modifications to such that the relation of the modified sentence will be: . In addition, should be structurally similar to .

Framework

Currently, to the best of our knowledge, there is no large dataset for fact-guided modifications. Instead, we utilize a large dataset with pairs of claims and sentences that are labeled to be consistent, inconsistent or neutral. In order to compensate the lack of direct supervision, we develop a two-step solution. First, using a pretrained fact-checking classifier for indirect supervision, we identify the polarizing spans of the outdated sentence and mask them to get a such that . Then, we fuse this pair to generate the updated sentence which is consistent with the claim. This is done with a sequence-to-sequence model trained with consistent pairs through an auto-encoder style objective. The two steps are trained independently to simplify optimization. Our overall pipeline is depicted in Figure 3.

Figure 2: Illustrating the flow of the masker module.

3.1 Masker: Eliminate Polarizing Spans

In this section we describe the module to identify the polarizing spans within a Wikipedia sentence. Masking these spans ensures that the residual sentence-claim pairs attain a neutral relation. Here, neutrality is determined by a classifier trained on claim and Wikipedia sentence pairs as described below. Using this classifier, the masking module is trained to identify the polarizing spans by maximizing the neutrality of the residual-sentence and claim pairs. In order to preserve the context of the original sentence, we include optimization constraints to ensure minimal deletions. This approach is similar to neural rationale-based models (Lei et al., 2016), where a module tries to identify the spans of the input that justify the model’s prediction.

Neutrality Masker

Given a knowledge-book sentence () and a claim (), the masker’s goal is to create such that . For the original sentence with tokens, , the output is a mask . The neutral sentence is constructed as:

(1)

where is a special token.111The special token is treated as an out-of-vocabulary token for the following models. The details of the masker architecture are stated below and depicted in Figure 2.

Encoding

We encode with a sequence encoder to get . Since the neutrality of the sentence needs to be measured with respect to a claim, we also encode the claim and enhance ’s representations with that of using attention mechanism. Formally, we compute

(2)

where are the encoded representations of the claim and are the parameterized bilinear attention  (Kim et al., 2018) weights computed by:

(3)
(4)

Finally, the aggregated representations are used as input to a sequence encoder .

Masking

The encoded sentence is used to predict a per token masking probability:

(5)

Then, the mask is applied to achieve the residual sentence:

(6)

where denotes element-wise multiplication. During training, we perform soft deletions to allow a simple optimization solution. During inference, the values of are rounded to create a discrete mask.

Figure 3: A summary of our pipeline. Given a sentence that is inconsistent with a claim, a masker is applied to mask out the contradicting parts from the original text while preserving the rest of the content. Then, the residual neutral text and claim are fused to create an updated text that is consistent with the claim. The Masker and the Two-Encoder Generator are trained separately.

Training

A pretrained fact-checking neutrality classifier’s prediction is used to guide the training of the masker. In order to encourage maximal retention of the context, we utilize a regularization term to minimize the fraction of the masked words. The joint objective is to minimize:

(7)

Fact-checking Neutrality Classifier

Our fact-checking classifier is pretrained on agreeing and disagreeing pairs from , in addition to neutral examples constructed through negative sampling. For each claim we construct a neutral pair by sampling a random sentence from the same paragraph of the polarizing sentence, making it contextually close to the claim, but unlikely to polarize it. We pretrain the classifier on these examples and fix its parameters during the training of the masker.

Optional Syntactic Regularization

Currently the model is trained with distant supervision, so, we pre-compute a valid neutrality mask as additional signal, when possible. To this end, we parse the original sentences using a constituency parser and iterate over continuous syntactic phrases by length. For each sentence, the shortest successful neutrality mask (if any) is selected as a target mask.222If there are several successful masks of the same length, we use the one with the highest neutrality score. In the event of successfully finding such a mask, the masking module is regularized to emulate the target mask by adding the following term to Eq. 7:

(8)

where is the target mask.

Empirically, we find that the model can perform well even without this regularization, but it can help to stabilize the training. Additional details and analysis are available in the appendix.

3.2 Two-encoder Pointer Generator: Constructing a Fact-updated Sentence

In this section we describe our method to generate an output which agrees with the claim. If the earlier masking step is done perfectly, the merging boils down to a simple fusion task. However, in certain cases, especially ones with a strong contradiction, our minimal deletion constraint might leave us with some residual contradictions in . Thus, we develop a model which can control the amount of information to consider from either input.

We extend the pointer-generator model of See et al. (2017) to enable multiple encoders. While sequence-to-sequence models support the encoding of multiple sentences by simply concatenating them, our use of a per input encoder allows the decoder to better control the use of each source. This is especially of interest to our task, where the context of the claim must be translated to the output while ignoring contradicting spans from the outdated Wikipedia sentence.

Next, we describe the details of our generator’s architecture. Here, we use one encoder for the outdated claim and one encoder for the sentence. In order to reduce the size of the model, we share the parameters of the two encoders. The model can be similarly extended to any number of encoders.

Encoding

At each time step , the decoder output , is a function of a weighted combination of the two encoders’ context representations , the decoder output in the previous step and the representation of the word output at the end of the previous step :

(9)

As the decoder should decide at each time step which encoder to attend more, we introduce an encoder weight . The shared encoder context representation is based on their individual representations and :

(10)

The context representation () is the attention score over the encoder representation for a particular decoder state :

(11)

Decoding

Following standard copy mechanism, predicting the next word , involves deciding whether to generate () or copy, based on the decoder input , the decoder state and context vector :

(12)

In case of copying, we need an additional gating mechanism to select between the two sources:

(13)

When generating a new word, the probability over words from the vocabulary is computed by:

(14)

The final output of the decoder at each time step is then computed by:

(15)

where are the input sequence attention scores from Eq. 11.

Training

Since we have no training data for claim guided sentence updates, we train the generator module to reconstruct a sentence to be consistent with an agreeing claim . The training input is the residual up-to-date neutral sentence and the guiding claim .

During inference, we utilize only guiding claims and residual outdated sentences to create . While generating the updated sentences , we would like to preserve as much context as possible from the contradicting sentence, while ensuring the correct relation with the claim. Therefore, for each case, if the later goal is not achieved, we gradually increase the focus on the claim by increasing and values until the output satisfies , or until a predefined maximum weight.

4 Experimental Setup

We evaluate our model on two tasks: (1) Automatic fact updates of Wikipedia sentences, where we update outdated wikipedia sentences using guiding fact claims; and (2) Generation of synthetic claim-evidence pairs to augment an existing biased fact-checking dataset in order to improve the performance of trained classifiers on an unbiased dataset.

4.1 Datasets

Training Data from FEVER

We use FEVER (Thorne et al., 2018), the largest available Wikipedia based fact-checking dataset to train our models for both of our tasks. We use the claim-evidence pairs of this dataset as our claim-setnence samples and use the “refutes”, “not enough information”, “supports” labels of that dataset as our relations, respectively.

Evaluation Data for Automatic Fact Updates

We evaluate the automatic fact updates task on an evaluation set based on part of the symmetric dataset from  Schuster et al. (2019) and the fact-based cases from a Wikipedia updates dataset (Yang et al., 2017). For the symmetric dataset, we use the modified Wikipedia sentences with their guiding claims to generate the true Wikipedia sentence. For the cases from the updates dataset, we have human annotators write a guiding claim for each update and use it, together with the outdated sentence, to generate the updated Wikipedia sentence. Overall we have a total of 201 tuples of fact update claims, outdated sentences and updated sentences.

Evaluation Data for Augmentation

To measure the proficiency of our generated outputs for data augmentation, we use the unbiased FEVER-based evaluation set of Schuster et al. (2019). As shown by Schuster et al. (2019), the claims in the FEVER dataset contain give-away phrases that can make FEVER-trained models overly rely on them, resulting in decreased performance when evaluated on unbiased datasets.

The classifiers trained on our augmented dataset are evaluated on the unbiased symmetric dataset of Schuster et al. (2019). This dataset (version 0.2) contains 531 claim-evidence pairs for validation and 534 claim-evidence pairs for testing.

In addition, we extend the symmetric test set by creating additional FEVER-based pairs. We hired crowd-workers on Amazon Mechanical Turk and asked them to simulate the process of generating synthetic training pairs. Specifically, for a “refutes” calim-evidence FEVER pair, the workers were asked to generate a modified supporting evidence while preserving as much information as possible from the original evidence. We collected responses of workers for 500 refuting pairs from the FEVER training set. This process extends the symmetric test set (+TURK) by 1000 cases — 500 “refutes” pairs, and corresponding 500 “supports” pairs generated by turkers.

Automatic Evaluation Human’s Scores
model SARI Keep Add Del Grammar Agreement
Fact updates:
Paraphrase 15.9 18.7 4.2 50.7 3.75 3.65
Claim Ext. 12.9 22.6 1.9 50.4 1.75 2.65
M. Concat 26.5 61.7 6.7 44.9 3.28 2.75
Ours 31.5 45.4 13.2 52.1 3.85 4.00
Human   4.80 4.70
Data augmentation:
Paraphrase 18.2 12.5 10.6 45.7 4.12 3.92
Claim Ext. 12.2 9.8 4.0 46.4 1.58 2.84
M. Concat 22.1 71.6 6.8 22.3 4.45 2.05
Ours 34.4 33.0 26.0 47.5 4.14 3.98
Human   4.69 4.15
Table 1: Human evaluation results for our model’s outputs for the fact update task (top) and for the data augmentation task (bottom). The left part of the table shows the geometric SARI score with the three F1 scores that construct it. The right part shows the human’s scores in a 1-5 Likert scale on grammatically of the output sentence and on agreement with the given claim.

4.2 Implementation Details

Masker

We implemented the masker using the AllenNLP framework (Gardner et al., 2018). For a neutrality classifier, we train an ESIM model Chen et al. (2017) to classify a relation of , or . To train this classifier, we use the and pairs from the FEVER dataset and for each claim we add a neutral sentence which is sampled from the sentences in the same document as the polarizing one. The classifier and masker are trained with GloVe (Pennington et al., 2014) word embeddings. We use BiLSTM (Sak et al., 2014) encoders with hidden dimensions of 100 and share the parameters of the claim and original sentence encoders. The model is trained for up to 100 epochs with a patience value of 10, where the stopping condition is defined as the highest delta between accuracy and deletion size on the development set.

For syntactic guidance, we use the constituency parser of Stern et al. (2017) and consider continuous spans of length 2 to 10 as masking candidates (without combinations). By doing so, we obtain valid neutrality masks for 38% of the and pairs from the FEVER training dataset. These masks are used for Eq. 8.

Two-Encoder Pointer Generator

We implemented our proposed multi-sequence-to-sequence model, based on the pointer-generator framework.333https://github.com/atulkum/pointer_summarizer We use a one layer BiLSTM for encoding and decoding with a hidden dimension of 256. The parameters of the two encoders are shared. The model is trained with batches of size 64 for a total of 50K steps.

BERT Fact-Checking Classifier

We use a BERT (Devlin et al., 2019) classifier, which takes in as input a (claim-evidence) pair separated by a special token, to predict out of 3 labels (, or ). The model is fine-tuned for 3 epochs, which is sufficient to perform well on the task.

Evidence Regeneration

Since we are interested in using the generated supporting pairs for data augmentation, we add machine generated cases to the set of the dataset. Adding machine generated sentences to only one of the labels in the data can be ineffective. Therefore, we balance this by regenerating paraphrased refuting evidence for the false claims. This is then added along with all models’ outputs for a balanced augmentation.

4.3 Baselines

We consider the following baselines for constructing a fact-guided updated sentence:

  • [leftmargin=*]

  • Copy Claim The sentence of the claim is copied and used as the updated sentence for itself (used only for data augmentation).

  • Paraphrase The claim is paraphrased using the back-translation method of Wieting and Gimpel (2018)444https://github.com/vsuthichai/paraphraser, and the output is used as the updated sentence.

  • Claim Extension [Claim Ext.] A pointer-generator network is trained to generate the updated sentence from an input claim alone. The model is trained on FEVER’s agreeing pairs and applied on the to-be-updated claims during inference.

  • Masked Concatenation [M. Concat] Instead of our Two-Encoder Generator, we use a pointer-generator network. The residual sentence (output from the masker module) and the claim are concatenated and used as input.

5 Results

We report the performance of the model outputs for automatic fact-updates by comparing them to the corresponding correct wikipedia sentences. We also have crowd workers score the outputs on grammar and for agreeing with the claim. Additionally, we report the results on a fact-checking classifier using model outputs from the FEVER training set as data augmentation.

Model Dev Test +Turk
No Augmentation 62.7 66.1 77.0
Paraphrase 60.8 64.6 77.4
Copy Claim 62.1 63.6 77.4
Claim Ext. 62.5 65.0 76.8
M. Concat 60.1 63.7 78.5
Ours 63.8 67.8 80.0
Table 2: Classifiers’ accuracy on the symmetric Dev and Test splits. The right column (+Turk) shows the accuracy on the Test set extended to include the 500 responses of turkers for the simulated process and the refuted pairs that they originated from. The BERT classifiers were trained on the FEVER training dataset augmented by outputs of the different methods.

Fact Updates

Following recent text simplification work, we use the SARI (Xu et al., 2016) method. The SARI method takes 3 inputs: (i) original sentence, (ii) human written updated sentence and (iii) model output. It measures the similarity of the machine generated and human reference sentences based on the deletions, additions and kept n-grams555We use the default up to 4-grams setting. with respect to the original sentence.666Following Geva et al. (2019) we use the F1 measure for all three sets, including deletions. The final SARI score is the geometric mean of the Add, Del and Keep score. For human evaluation of the model’s outputs, 20% of the evaluation dataset was used. Crowd-workers were provided with the model outputs and the corresponding supposably consistent claims. They were instructed to score the model outputs from 1 to 5 (1 being the poorest and 5 the highest), on grammaticality and agreement with the claim.

Table 1 reports the automatic and human evaluation results. Our model gets the highest SARI score, showing that it is the closest to humans in modifying the text for the corresponding tasks. Humans also score our outputs the highest for consistency with the claim, an essential criterion of our task. In addition, the outputs are more grammaticality sound compared to those from other methods.

Examining the gold answers, we notice that many of them include very minimal and local modifications, keeping much of the original sentence. The M. Concat model keeps most of the original sentence as is, even at the cost of being inconsistent with the claim. This corresponds to a high Keep score but a lower SARI score overall, and a low human score on supporting the claim. Claim Ext. and Paraphrase do not maintain the structure of the original sentence, and perform poorly on Keep, leading to a low SARI score.

Data Augmentation

For 41850 pairs in the FEVER training data, our method generates synthetic evidence sentences leading to 41850 pairs. We train the BERT fact-checking classifier with this augmented data and report the performance on the symmetric dataset in Table 2. In addition, we repeat the human evaluation process on the generated augmentation pairs and report it in Table 1.

Our method’s outputs are effective for augmentation, outperforming a classifier trained only on the original biased training data by an absolute 1.7% on the Test set and an absolute 3.0% on the +Turk set. The outputs of the Paraphrase and Copy Claim baselines are not Wikipedia-like, making them ineffective for augmentation. All the baseline approaches augment the false claims with a supported evidence. However, the success of our method in producing supporting evidence while trying to maintain a Wikipedia-like structure, leads to more effective augmentations.

Masker Analysis

Acc size Prec Rec F1
.5 5.1 0.0 5 0.0 0.0 0.0
.4 80.0 26.3 54 27.2 75.1 39.9
.3 77.0 27.5 50 25.9 71.6 38.0
.2 81.6 31.1 51 23.1 74.8 35.3
Table 3: Results of different values of for the masker with syntactic regularization. The left three columns describe the accuracy and average mask size (% of the sentence) over the FEVER development set with the masked evidence and a neutral target label. The right three columns contain the precision, recall and F1 of the masks that we have human annotations for. For results without syntactic regularization see the appendix.

To evaluate the performance of the masker model, we test its capacity to modify and pairs from the FEVER development set to a neutral relation. We measure the accuracy of the pretrained classifier in predicting neutral versus the percentage of masked words from the sentence. For a finer evaluation, we manually annotated 75 and 76 pairs with the minimal required mask for neutrality and compute the per token F1 score of the masker against them.

The results for different values of the regularization coefficient are reported in Table 3. Increasing the regularization coefficient helps to minimize the mask size and to improve the precision while maintaining the classifier accuracy and the mask recall. However, setting too large, can collapse the solution to no masking at all.

6 Conclusion

In this paper, we introduce the task of automatic fact-guided sentence modification. Given a claim and an old sentence, we learn to rewrite it to produce the updated sentence. Our method overcomes the challenges of this conditional generation task by breaking it into two steps. First, we identify the polarizing components in the original sentence and mask them. Then, using the residual sentence and the claim, we generate a new sentence which is consistent with the claim. Applied to a Wikipedia fact update evaluation set, our method successfully generates correct Wikipedia sentences using the guiding claims. Our method can also be used for data augmentation, to alleviate the bias in fact verification datasets without any external data, reducing the relative error by 13%.

7 Acknowledgments

We thank the MIT NLP group for their helpful discussion and comments. This work is supported by DSO grant DSOCL18002.

References

  • Y. Bao, S. Chang, M. Yu, and R. Barzilay (2018) Deriving machine attention from human rationales. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1903–1913. External Links: Link Cited by: Appendix A.
  • R. Barzilay and K. R. McKeown (2005) Sentence fusion for multidocument news summarization. Vol. 31, pp. 297–328. Cited by: §2.
  • A. Cahill, N. Madnani, J. Tetreault, and D. Napolitano (2013) Robust systems for preposition error correction using Wikipedia revisions. Atlanta, Georgia, pp. 507–517. External Links: Link Cited by: §2.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2017) Enhanced lstm for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1657–1668. Cited by: §4.2.
  • W. Chen, H. Wachsmuth, K. Al Khatib, and B. Stein (2018) Learning to flip the bias of news headlines. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg University, The Netherlands, pp. 79–88. External Links: Link Cited by: §2.
  • J. Daxenberger and I. Gurevych (2013) Automatically classifying edit categories in Wikipedia revisions. Seattle, Washington, USA, pp. 578–589. External Links: Link Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.2.
  • M. Faruqui, E. Pavlick, I. Tenney, and D. Das (2018) WikiAtomicEdits: a multilingual corpus of Wikipedia edits for modeling language and discourse. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 305–315. External Links: Link, Document Cited by: §2.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018) AllenNLP: a deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), Melbourne, Australia, pp. 1–6. External Links: Link, Document Cited by: §4.2.
  • M. Geva, E. Malmi, I. Szpektor, and J. Berant (2019) DiscoFuse: a large-scale dataset for discourse-based sentence fusion. Minneapolis, Minnesota, pp. 3443–3455. External Links: Link, Document Cited by: §2, footnote 6.
  • M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer (2018) Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1875–1885. External Links: Link, Document Cited by: §2.
  • J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. External Links: Link Cited by: §3.1.
  • S. Kobayashi (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452–457. Cited by: §2.
  • T. Lei, R. Barzilay, and T. Jaakkola (2016) Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 107–117. External Links: Link, Document Cited by: §1, §3.1.
  • J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1865–1874. External Links: Link, Document Cited by: §2.
  • A. Max and G. Wisniewski (2010) Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. Valletta, Malta. External Links: Link Cited by: §2.
  • S. Narayan, C. Gardent, S. B. Cohen, and A. Shimorina (2017) Split and rephrase. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 606–616. External Links: Link, Document Cited by: §2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §4.2.
  • L. Perez and J. Wang (2017) The effectiveness of data augmentation in image classification using deep learning. Cited by: §2.
  • H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi (2017) Truth of varying shades: analyzing language in fake news and political fact-checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2931–2937. External Links: Link, Document Cited by: §2.
  • H. Sak, A. Senior, and F. Beaufays (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, Cited by: §4.2.
  • T. Schuster, D. J. Shah, Y. J. S. Yeo, D. Filizzola, E. Santus, and R. Barzilay (2019) Towards debiasing fact verification models. External Links: Link Cited by: §2, §4.1, §4.1, §4.1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Link, Document Cited by: §3.2.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841. Cited by: §2.
  • M. Stern, J. Andreas, and D. Klein (2017) A minimal span-based neural constituency parser. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 818–827. External Links: Link, Document Cited by: §4.2.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 809–819. External Links: Link, Document Cited by: §2, §4.1.
  • A. Vlachos and S. Riedel (2014) Fact checking: task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, Baltimore, MD, USA, pp. 18–22. External Links: Link, Document Cited by: §2.
  • W. Y. Wang (2017) “liar, liar pants on fire”: a new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 422–426. External Links: Link, Document Cited by: §2.
  • J. Wieting and K. Gimpel (2018) ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 451–462. External Links: Link Cited by: 2nd item.
  • X. Wu, S. Lv, L. Zang, J. Han, and S. Hu (2018) Conditional bert contextual augmentation. Cited by: §2.
  • W. Xu, C. Napoles, E. Pavlick, Q. Chen, and C. Callison-Burch (2016) Optimizing statistical machine translation for text simplification. Vol. 4, pp. 401–415. External Links: Link, Document Cited by: §5.
  • D. Yang, A. Halfaker, R. Kraut, and E. Hovy (2017) Identifying semantic edit intentions from revisions in Wikipedia. Copenhagen, Denmark, pp. 2000–2010. External Links: Link, Document Cited by: §2, §4.1.
  • M. Yatskar, B. Pang, C. Danescu-Niculescu-Mizil, and L. Lee (2010) For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia. Los Angeles, California, pp. 365–368. External Links: Link Cited by: §2.
  • Y. Zhang, J. Baldridge, and L. He (2019) PAWS: paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1298–1308. External Links: Link, Document Cited by: §2.
  • Z. Zhang, S. Ren, S. Liu, J. Wang, P. Chen, M. Li, M. Zhou, and E. Chen (2018) Style transfer as unsupervised machine translation. Cited by: §2.

Appendix A Additional Masker analysis

The masker model makes finding a valid mask in the space of options tractable. However, as mentioned in Bao et al. (2018), training an objective of the type shown in Eq. 7 is unstable. An alternative tractable approach is to enumerate a set of syntactic components of the evidence and score them as potential masks for neutrality. Although this approach is insufficient and might not always work, the cases where the continuous spans satisfy neutrality can help guide the masker training.

Table 4 shows results for the masker model with and without syntactic regularization. The syntactic regularization helps to stabilize the performance, allowing a reasonable solution even without any additional constraint on the mask size. Without syntactic regularization, better accuracy can be achieved, but the learning is very unstable and can lead to solutions that mask the whole sentence or keep it as is.

Acc size Prec Rec F1
With syntactic regularization:
.5 5.1 0.0 5.0 0.0 0.0 0.0
.4 80.0 26.3 54 27.2 75.1 39.9
.3 77.0 27.5 50 25.9 71.6 38.0
.2 81.6 31.1 51 23.1 74.8 35.3
.1 80.5 34.7 46 21.9 77.8 34.2
0 80.0 37.1 43 22.6 81.7 35.5
Without syntactic regularization:
.5 5.1 0.0 5 0.0 0.0 0.0
.4 87.8 25.0 63 25.9 68.5 37.6
.3 5.1 0.0 5 0.0 0.0 0.0
.2 90.1 35.0 55 22.4 78.7 34.8
.1 91.2 48.9 42 17.0 85.3 28.4
0 91.6 100 -8 9.3 100 17.1
Table 4: Results of different values of for the masker with and without syntactic guidance. The left three columns describe the accuracy and average mask size (% of the sentence) over the FEVER development set with the masked evidence and a neutral target label. The right three columns contain the precision, recall and F1 of the masks that we have human annotations for.

Appendix B Example Outputs

Original Text Born in Lawton , Oklahoma and raised in Anaheim , California , Hillenburg became fascinated with the sky as a child and also developed an interest in art .
Claim Stephen Hillenburg was fascinated with the ocean as a child .
Claim Ext. He in Huntington , Trinidad City Tommy in the , Hillenburg developed he became the of the stage , a senior . business in the adopted in 1847 .
Concat Born in Lawton , Oklahoma and raised in Anaheim Anaheim , , Hillenburg became fascinated with the sky as a child and also developed an interest in art .
M. Concat Born in Lawton , Oklahoma and raised in Anaheim , California , Hillenburg became the with the United as the condition and also developed an interest in art .
Ours Born in Lawton , Oklahoma and raised in Anaheim , California , Hillenburg became fascinated with the ocean as a child and also developed an interest in art .
Original Text German Startups Group considers 28 of their 42 minority stakeholdings in operationally active companies to be of particular significance to the group.
Claim It considers 23 of 43 minority stakeholdings to be significant .
Claim Ext. The - soon are the days the eighth capital , is the spending , , find divided active by ’s the original ,
Concat German Startups Group considers 28 of their their minority stakeholdings in operationally active companies to be of particular significance to the group .
M. Concat German Startups Group considers 23 of 18 minority million ‘ in operationally active companies to be of particular significance to the group .
Ours German Startups Group considers 23 of 43 minority stakeholdings beginning in operationally active companies to be of particular significance to the group .
Original Text A sequel , Rio 2 , was released on April 11 , 2012 .
Claim Rio ’s sequel was released on April 11 , 2014 .
Claim Ext. In series , Rio is is is released on January 4 , 2014 ,
Concat A sequel , Rio Rio 2 , was released on April 11 , 2012
M. Concat A sequel , Rio 2 , was released on August 11 , 2014 .
Ours A sequel , Rio 2 , was released on April 11 , 2014 .
Original Text Albert S. Ruddy -LRB- born March 28 , 1940 -RRB- is a Canadian - born film and television producer .
Claim In 1930, Albert S. Ruddy is born.
Claim Ext. Albert S. S. -LRB- -LSB- Hiram 23 , 1939 -RRB- is an former actor born theoretical marketer American . .
Concat Albert S. Ruddy -LRB- born March March , , 1940 -RRB- is a Canadian - born film and television producer
M. Concat Albert S. Ruddy -LRB- born Hiram 12 , 1930 -RRB- is a German - American film and television producer .
Ours Albert S. Ruddy -LRB- born December 18 , 1930 -RRB- is a Chinese - born film and television producer .
Table 5: We compare our model outputs against different models. Each example is showing the two input sentences following the output of each model. The Concat model setting is similar to the M. Concat one but the original text is left unmasked. For the Claim Ext. model, only the claim sentence is given as input.

Examples of outputs from different models are provided in Table 5. For the first 3 examples, our model produces a perfect update. In the last example, even though our model gets the year 1930 correct, it modifies the month and nationality to made-up, incorrect values. This is a result of a too aggressive deletion by the masker. The Claim Ext. model typically produces wrong and non-grammatical sentences. The Concat model doesn’t capture the polarizing relation between the two inputs and mostly ignores the claim. The M. Concat model tends to overly generate made-up content instead of copying it from the claim.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
392215
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description