Towards a Unified Natural Language Inference Framework to Evaluate Sentence Representations

Towards a Unified Natural Language Inference
Framework to Evaluate Sentence Representations

Adam Poliak     Aparajita Haldar     Rachel Rudinger     J. Edward Hu
Ellie Pavlick     Aaron Steven White    Benjamin Van Durme
Johns Hopkins University, BITS Pilani, Goa Campus, India
Brown University, University of Rochester

We present a large scale unified natural language inference (NLI) dataset for providing insight into how well sentence representations capture distinct types of reasoning. We generate a large-scale NLI dataset by recasting 11 existing datasets from 7 different semantic tasks. We use our dataset of approximately half a million context-hypothesis pairs to test how well sentence encoders capture distinct semantic phenomena that are necessary for general language understanding. Some phenomena that we consider are event factuality, named entity recognition, figurative language, gendered anaphora resolution, and sentiment analysis, extending prior work that included semantic roles and frame semantic parsing. Our dataset will be available at, to grow over time as additional resources are recast.

1 Introduction

How well do sentence representation learning models capture semantic phenomena necessary for general natural language understanding? Can models trained to make inferences based on natural language perform intelligent reasoning? Do competitive models for popular natural language inference111The task of determining whether a human would likely infer a textual hypothesis from a context, or premise Dagan et al. (2006, 2013). (NLI) datasets carry-out human-like reasoning? For example, can these models determine whether an event occurred, correctly differentiate between figurative and literal language, or accurately identify and categorize named entities?

We introduce a large-scale NLI dataset to help answer these questions. Our dataset contains context-hypothesis pairs automatically labeled by recasting semantic annotations from structured prediction tasks into NLI sentence pairs. We extend various prior works on challenge datasets, e.g. \newcitezhang2017ordinal and \newcitewhite-EtAl:2017:I17-1. We recast annotations from a total of 11 datasets across 7 NLP tasks into labeled NLI examples. The tasks include event factuality, named entity recognition, gendered anaphora resolution, sentiment analysis, relationship extraction, pun detection, and lexico-syntactic inference. Currently, our dataset contains approximately half a million labeled sentence-pairs. Table 1 includes a sample of NLI pairs that test specific types of reasoning.

Our experiments demonstrate how this dataset can be used to probe how well a model captures different types of semantic reasoning necessary for general natural language understanding (NLU). We use a NLI model with access to just hypothesis sentences as a baseline, since it has recently been shown to be a strong baseline on many NLI datasets and may be used to detect statistical irregularities within class labels Poliak et al. (2018b).222By design, such a model does not reason about any relationship between a pair of texts. Our recast dataset, trained baseline models, and leaderboard will be publicly available at site will be continuously updated as we recast more datasets.

Semantic Phenomena
Find him before he finds the dog food I’ll need to ponder
Event Factuality The finding did not happen The pondering happened
She travels to Liberia on Sunday She travels to Liberia on Sunday
Named Entity Recognition Liberia is a location Liberia is a day of the week
The chef came out to apologize to the guest who was unhappy with his/her dinner. The paramedic performed CPR on the passenger even though he/she was already dead.
Gendered Anaphora The guest was having dinner. The paramedic was already dead.
a particular person remembered to do a particular thing a particular person wished to have a particular thing
Lexicosyntactic Inference that person did that thing that person had that thing
Ward went with Ledger to their native Perth. Stefan had visited his son in Bulgaria.
Relation Extraction Ward was born in Perth. Stefan was born in Bulgaria.
When asked about the restaurant, Liam said, “The food came out at a good pace.” When asked about the movie, Eyana said, “The character developments lacked depth.”
Sentiment Analysis Liam liked the restaurant Eyana liked the movie
Kim heard masks have no face value Tod heard that thrift is better than an annuity
Pun Kim heard a pun Tod heard a pun
I searched in the cave for treasure. John informed me of the situation.
Thematic Roles The cave was searched. John received information.
Table 1: Example sentence pairs for the different semantic phenomena. The ✓ column indicates that the context entails the hypothesis and the ✗ column lists examples where the context does not entail the hypothesis. The first and second line in each cell respectively represents the context and hypothesis. Examples here are slightly modified for aesthetic reasons.

2 Motivation & Background

2.1 Why automatic recasting?

Compared to other methods used to construct NLI datasets, we believe that recasting, i.e. leveraging existing annotations to automatically create labeled NLI pairs, can 1) help determine whether an NLU model performs distinct types of reasoning, 2) limit biases in NLI data, and 3) generate labeled NLI examples efficiently at large scales.

NLU insights

Recasting allows us to create targeted subsets of our NLI dataset that can be used as a benchmark to test a model’s ability to perform distinct types of reasoning. For example, by recasting annotations focused on labeling puns, we can determine if a model may be able to understand whether a certain text is figurative or literal.

Popular NLI datasets, e.g. Stanford Natural Language Inference (SNLI) Bowman et al. (2015) and its successor Multi-NLI Williams et al. (2017), were created by eliciting humans to generate hypotheses. Crowd-source workers were tasked with writing one sentence that is entailed, one that is neutral, and one that is contradicted by a given context extracted from the Flickr30k corpus Young et al. (2014). Although these datasets are widely used to evaluate sentence representations, a high accuracy is not indicative of what types of reasoning NLI models perform. Workers were free to create any type of hypothesis sentence as long as it was neutral, entailed, or contradicted with respect to the context. These datasets cannot be used to determine how well an NLI model captures paraphrastic inference, complex anaphora resolution White et al. (2017), or compositionality Pavlick and Callison-Burch (2016); Dasgupta et al. (2018), which are among the many desired capabilities of language understanding systems.

Limit biases

Recent studies demonstrate that many NLI datasets contain significant biases. For example, sentences in SNLI contain racial, gender, and age-related stereotypes Rudinger et al. (2017). Also, statistical irregularities, and annotation artifacts, within class labels allow a model with access only to the hypotheses to outperform the majority baseline by over Poliak et al. (2018b); Gururangan et al. (2018). Class label biases may be attributed to the human-elicited protocol used to generate hypotheses. The human-elicitation protocol can also cause inconsistencies in how different NLI examples deal with some linguistic phenomena, e.g. questions and person.444Appendix A further discusses this issue.

We limit some biases by not relying on humans to generate hypotheses. Recast NLI datasets may still contain some biases, e.g. non-uniform distributions over NLI labels caused by the distribution of labels in the original dataset that we recast.555This does not invalidate the recast dataset. For example, in a corpus annotated with part-of-speech tags, the distribution of labels for the word “the” will likely peak at the Det tag. Experimental results using \newcitehypoths-nli’s hypothesis-only model indicate to what degree the recast datasets retain biases that may be present in the original semantic datasets that we recast.

Large-scale NLI costs

Generating NLI datasets from scratch is costly. Humans must be paid to generate or label natural language text. This linearly scales costs as the amount of generated NLI-pairs increases. Existing annotations for a wide array of semantic NLP tasks are freely available. By leveraging such datasets, we can automatically generate and label NLI pairs at little cost.

2.2 Why these semantic phenomena?

A long term goal is to develop NLU systems that can achieve human levels of understanding and reasoning. Investigating how different architectures and training corpora can help a system perform human-level general NLU is an important step in this direction. Our dataset contains recast NLI pairs that are easily understandable by humans and can be used as to evaluate different sentence encoders and NLU systems. These semantic phenomena cover distinct types of reasoning that an NLU system may often encounter in the wild. While higher performance on these benchmarks might not be conclusive proof of a system achieving human-level reasoning, a system that performs poorly should not be viewed as performing human-level NLU. We argue that these semantic phenomena, e.g event factuality, figurative language, lexical entailment and presupposition, play integral roles in NLU, and acknowledge that understanding other semantic phenomena are also integral to NLU.

3 Recasting Semantic Phenomena

We recast 7 semantic phenomena from a total of 11 datasets. Here we describe efforts to recast event factuality, named entity recognition, figurative language, sentiment analysis, and relation extraction into labeled NLI examples. We are currently recasting more semantic phenomena from datasets, like thematic roles in VerbNet Schuler (2005), and will release them online666 as they become available. These 7 semantic phenomena represent distinct types of reasoning and inferences that play a key role in general NLU.

Many of the recasting methods rely on simple templates that do not include nuances and variances typical of natural language, e.g. (7). This allows us to specifically test how sentence representations capture distinct types of reasoning. When recasting, we preserve each dataset’s train/development/test split. If a dataset does not contain such a split, we randomly sample pairs to create a split with a 80:10:10 ratio. Table 2 reports statistics about each already recast NLI dataset.

3.1 Event Factuality (EF)

Event factuality prediction is the task of determining whether an event described in text actually occurred. Determining whether an event occurred enables accurate inferences based on the event Rudinger et al. (2018b). For example, consider the following sentences: \ex. \a. She walked a beagle .̱ She walked a dog .̧ She walked a brown beagle

If the walking occurred, 3.1 entails 3.1 but not 3.1. If we negate the action in sentences 3.1, 3.1, and 3.1 to respectively become: \ex. \a. She did not walk a beagle .̱ She did not walk a dog .̧ She did not walk a brown beagle

the new hypothesis 3.1 is now entailed by the context 3.1 while 3.1 is not. In order to make monotonic inferences based on text that describes an event, a system must be able to latently predict whether the event happened. Incorporating factuality has been shown to improve natural language inference Saurf and Pustejovsky (2007).

We recast event factuality annotations from three existing datasets: UW Lee et al. (2015), MEANTIME Minard et al. (2016), and the JHU Universal Decompositional Semantics (Decomp) It Happened v2 Rudinger et al. (2018b).777The last dataset provides a larger coverage of \newcitewhite-EtAl:2016:EMNLP2016’s event factuality annotations on top of the English version of the Universal Dependencies treebank Silveira et al. (2014). Context sentences are simple sentences from the three datasets. We use the templates in 7 and 7 as hypotheses such that each context is paired with two hypotheses. \ex. \a. The Event happened .̱ The Event did not happen

If the predicate denoting the Event was annotated as having happened in the factuality dataset, the context paired with 7 is labeled as entailed and the context paired with 7 is labeled as not-entailed. If the Event was labeled as not having happened, we swap the labels.


width=1 Annotations Dataset # pairs Universal Dependencies Rudinger et al. (2018b) 42K (41,888) MeanTime Minard et al. (2016) .7K (738) Event Factuality UW Lee et al. (2015) 5K (5,094) Groningen Bos et al. (2017a) 260K (261,406) Named Entity Recognition CoNLL Tjong Kim Sang and De Meulder (2003) 60K (59,970) Gendered Anaphora Winogender Rudinger et al. (2018a) .4K (464) Lexicosyntactic Inference MegaAttitude White and Rawlins (2016, 2018) 11K (11,814) Relationship Extraction FACC1 Gabrilovich et al. (2013) 30K Sentiment Analysis Kotzias et al. (2015) 6K  Yang et al. (2015) 7K (7,706) Puns SemEval 2017 Task 7 Miller et al. (2017) 8K (8,054) Combined Unified NLI Framework 433K (433,134) SNLI Bowman et al. (2015) 570K Multi-NLI Williams et al. (2017) 433K

Table 2: Statistics summarizing the recast datasets. ‘Annotations’ column refers to the original annotation that was recast, the ‘Combined‘ row refers to the combination of our recast datasets. The ‘Dataset’ column indicates the datasets that were recast, and the ‘# Pairs’ column reports how many labeled NLI pairs were extracted from the corresponding dataset (numbers in parentheses indicate the exact number of NLI pairs). We include the number of examples in Multi-NLI and SNLI in order to compare the scale of our Unified NLI Framework to popular NLI datasets.

3.2 Named Entity Recognition

Distinct types of entities have different properties and relational objects Prince (1978). Relational objects can be used to help infer facts from a given context. For example, if a system can detect that an entity is a name of a nation, then that entity likely has a leader, a language, and a culture Prince (1978); Van Durme (2010). When classifying NLI pairs, a model can determine if an object mentioned in the hypothesis can be a relational object typically associated with the type of entity described in the context. NER tags can also be directly used to determine if a hypothesis is not entailed by a context. If entities in contexts and hypotheses do not share NER tags, e.g. PERSON or ORGANIZATION, then the hypothesis may be unlikely to be entailed by the context. Pipeline approaches to NLI successfully incorporate this intuition Castillo and Alemany (2008); Sammons et al. (2009); Pakray et al. (2010).

When recasting NER annotations, we preserve the original sentences as premises and create hypotheses using the following template:888We ensure grammatical hypotheses by appropriately conjugating “is a” when needed. \ex. NP is a Label

To generate an entailed hypothesis we replace Label with the correct NER label of the NP and to generate a not-entailed hypothesis we replace it with an incorrect label. When creating not-entailed hypotheses, we choose the incorrect label from the prior distribution of NER tags for the given phrase. This prevents us from adding additional annotation artifacts besides any class-label statistical irregularities present in the original NER data.

We apply this recasting strategy on the Gronigen Meaning Bank Bos et al. (2017b) and the ConLL-2003 NER Shared Task Tjong Kim Sang and De Meulder (2003). In total, we generate about labeled context-hypothesis pairs from NER annotations.

3.3 Gendered Anaphora Resolution

The ability to perform pronoun resolution is essential to language understanding, in many cases requiring common-sense reasoning about the world, as demonstrated by the Winograd Schema Challenge Levesque et al. (2012). White et al. (2017) show that this task can be directly recast as a natural language inference problem by transforming Winograd schemas into NLI sentence pairs; in the transformation, the pronoun in question is simply replaced with one of the two possible antecedents.

Using a formula similar to Winograd schemas, Rudinger et al. (2018a) introduce Winogender schemas, minimal sentence pairs that differ only by pronoun gender.999A similar resource for detection of gender bias in coreference resolution, called “WinoBias,” is released by Zhao et al. (2018). Using this adapted pronoun resolution task, Rudinger et al. (2018a) demonstrate the presence of systematic gender bias in coreference resolution systems. Here we recast Winogender schemas as an NLI task, introducing a potential method of detecting gender bias in NLI systems or sentence embeddings. In recasting, the context is the original, unmodified Winogender sentence; the hypothesis is a short, manually constructed sentence that entails one of the two possible pronoun resolutions.

3.4 Lexicosyntactic Inference

While many inferences in natural language are triggered by lexical items alone, there exist pervasive inferences that arise from interactions between lexical items and their syntactic contexts. This is particularly apparent among propositional attitude verbs – e.g. think, want, know – which display complex distributional profiles White and Rawlins (2016). For instance, the verb remember can take both finite clausal complements and infinitival clausal complements.


. \a. Jo didn’t remember that she ate. .̱ Jo didn’t remember to eat.

This small change in the syntactic structure gives rise to large changes in the inferences that are licensed: 3.4 presupposes that Jo ate while 3.4 entails that Jo didn’t eat.

We recast data collected by \newcitewhite_role_2018, which is relevant to these sorts of lexicosyntactic interactions. White and Rawlins selected verbs from the MegaAttitude data White and Rawlins (2016) based on their acceptability in the [NP _ that S] and [NP was _ed that S] frames (a rating of 4 out of 7 or better).101010NP is always instantiated by someone; and S is always instantiated by a particular thing happened. They then asked annotators to answer questions of the form in 3.4 using three possible responses: yes, maybe or maybe not, and no (cp. Karttunen et al., 2014).


. \a. Someone {knew, didn’t know} that a particular thing happened. .̱ Did that thing happen?

We use the same procedure to annotate sentences containing verbs that take various types of infinitival complement: [NP _ for NP to VP], [NP _ to VP], [NP _ NP to VP], and [NP was _ed to VP].111111NP is always instantiated by either someone, a particular person, or a particular thing; and VP is always instantiated by happen, do a particular thing, or have a particular thing.

To recast these annotations, we call the sentences like 3.4 the context sentence and assign each to the majority class – yes, maybe or maybe not, no – across 10 different annotators, after applying an ordinal model-based normalization to those annotators’ responses. We then pair each context sentence with three hypothesis sentences.


. \a. That thing happened. .̱ That thing may or may not have happened. .̧ That thing didn’t happen.

If the sentence is assigned to yes, the pairing of the context sentence with 3.4 is assigned entailed and the other pairings are assigned not-entailed; if the sentence is assigned maybe or maybe not, the pairing of the context sentence with 3.4 is assigned entailed and the other pairings are assigned not-entailed; and if the sentence is assigned no, the pairing of the context sentence with 3.4 is assigned entailed and the other pairings are assigned not-entailed. Training, development, and test split labels are randomly assigned by context sentence to every pair that context sentence appears in.

3.5 Relation extraction (KG relations)

The goal of the relation extraction task is to infer the real-world relationships between pairs of entities from natural language text. The task is “grounded” in the sense that the input is natural language text and the output is tuples defined in the schema of some knowledge base. Relation extraction requires a system to understand many different surface forms which entail the same underlying relation, and to distinguish those from surface forms which involve the same entities but do not entail the relation of interest. For example, 3.5 is entailed by both 3.5 and 3.5 but not by 3.5.


. \a. Name was born in PlaceName is from PlaceName (b. Year, Place) .̣ Name visited Place

Natural language surface forms are often used in relation extraction in a weak-supervision setting Mintz et al. (2009); Hoffmann et al. (2011); Riedel et al. (2013). That is, if entity1 and entity2 are known to be related by relation, it is assumed that every sentence observed which mentions both entity1 and entity2 is assumed to be a realization of relation: i.e. 3.5 would (falsely) be taken as evidence of the birthPlace relation. Therefore, in order to ensure high-quality in our recast data, we have human annotators vet each generated context/hypothesis pair. Our procedure for recasting the data is described below.

Unlike the previous recasting methods, here we first generate hypotheses and then corresponding contexts. To generate our hypotheses, we begin with entity-relation triples extracted from DBPedia infoboxes: e.g. . In order to recast each tuple to be a natural langauge string, we use the canonical forms of entities given by DBPedia, and generate a gloss for the relation as follows: for a given relation, we enumerate all pairs of entities appearing with that relation, and compute, on aggregate, the natural language predicate most likely to relate entity1 to entity2 according to a large OpenIE database built using the OpenIE 5.0 software Christensen et al. (2011); Pal et al. (2016); Saha et al. (2017). This translates a tuple such as that above to a string such as “Barack Obama was born in Hawaii”.

For each hypothesis generated as described above, we create a number of contexts, both true and false, as follows. We begin with the FACC1 corpus Gabrilovich et al. (2013) which contains natural language sentences from ClueWeb in which entities have been automatically linked to disambiguated Freebase entities, when possible. Then, given a tuple , we find every sentence which contains both entity1 and entity2.121212This results in a dataset with hypotheses that appear multiple times, similar to \newcitescitail’s SciTail. Since many of these sentences are false positives (see Example 3.5), we have human annotators vet each context/hypothesis pair, using the ordinal entailment scale described in \newcitezhang2017ordinal. In total, we generate K labeled context-hypothesis pairs using this process, 38% of which are entailments and 62% of which are non-entailments.

3.6 Subjectivity

Some of the previously discussed semantic phenomena deal with objective information – did an event occur or what type of entities does a specific name represent. Subjective information is often expressed differently than objective information Wiebe et al. (2005), making it important to use different tests to probe whether a NLU system understands language the expresses subjective information. Therefore, we are interested in determining whether general NLU models capture ‘subjective clues’ that can help identify and understand emotions, opinions, and sentiment within a subjective text Wilson et al. (2006), as opposed to differentiating between subjective and objective information Yu and Hatzivassiloglou (2003); Riloff et al. (2003).

We recast a sentiment analysis dataset since the task is the “expression of subjectivity as either a positive or negative opinion” Taboada (2016). We extract sentences from product, movie, and restaurant reviews labeled with positive or negative sentiment towards the reviewed item Kotzias et al. (2015). NLI contexts 3.6 and hypotheses 3.6, 3.6 are generated using the following templates: \ex. \a. When asked about Item, Name said ReviewName liked the ItemName did not like the Item

where Item . We sample names from a distribution based on data provided by the U.S. Social Security Agency.131313 If the original sentence was labeled as containing positive sentiment, the 3.6-3.6 pair is labeled as entailed and the 3.6-3.6 pair is labeled as not-entailed; otherwise, the labels are swapped. We use this method to generate K examples. A model performing well on this recast dataset may indicate that it may be able to sufficiently identify polarity in subjective text.

\diagbox[width=10em]ModelRecast Data Factuality NER Puns Sentiment Lex GAR 50.00 50.00 50.00 50.00 66.67 50.00 50.00
InferSent 82.98 92.65 50.00 50.00 84.26 - 83.23 92.55 90.08 84.67 85.19 - 91.10
InferSent (pre-train, fixed) 16.74 45.14 24.26 29.50 34.60 47.63 69.14 91.40 50.00 50.00 78.00 - 86.46
Hyp-only (pre-train, update) 70.68 91.25 50.00 48.17 78.00 - 31.23 46.76 29.54 24.17 33.25 42.24 44.27
Table 3: NLI accuracies on test data based on the different targeted semantic phenomena. Columns correspond to each semantic phenomena and rows correspond to the model used. ‘Lex’ refers to Lexicosyntactic Inference and ‘GAR’ refers to Gendered Anaphora Resolution. (pre-train, fixed) refers to a model that was trained on multi-NLI and then tested on these data sets, and (pre-train, update) refers to a model that was pre-trained on Multi-NLI and then re-trained on the corresponding row’s train set.

3.7 Figurative Language

Figurative language demonstrates natural language’s expressiveness and wide variations. Understanding and recognizing figurative language “entail[s] cognitive capabilities to abstract and meta-represent meanings beyond physical words” Reyes et al. (2012). Puns are prime examples of figurative language that may perplex general NLU systems as they are one of the more regular uses of linguistic ambiguity Binsted (1996) and rely on a wide-range of phonetic, morphological, syntactic, and semantic ambiguity Pepicello and Green (1984); Binsted (1996); Bekinschtein et al. (2011). To understand a pun, a NLU system may need to know that two words (or phrases) with different spelling may sound the same, associate misspelled words with multiple real words, or understand how a polysemous word’s senses may simultaneously be used in context.141414For decades, solutions to NLP tasks have not been robust to figurative language, e.g. idioms in machine translation Bar-Hillel (1953); Santos (1990); Arnold et al. (1993); Salton et al. (2014); Shao et al. (2017); Isabelle et al. (2017).

We recast puns from \newciteD15-1284151515Puns were originally extracted from and sentences without puns from newswire and proverbs and are labeled as containing a pun or not. and \newcitemiller-hempelmann-gurevych:2017:SemEval161616Puns were sampled from prior pun detection datasets Miller and Gurevych (2015); Miller and Turković (2016) and includes new examples generated from scratch for the shared task; the labels denote whether the sentences contain homographic, heterographic, or no pun at all. using the templates to generate contexts 16 and hypotheses 16, 16. Each context appears in two labeled NLI pairs. \ex. \a. Name heard that PunName heard a pun .̧ Name did not hear a pun

For the contexts, we insert the original pun into Pun. We use the same method as previously described to insert names into the template. If the original sentence was labeled as containing a pun, the 16-16 pair is labeled as entailed and the 16-16 pair is labeled as not-entailed, otherwise we swap the NLI labels. In total, we generate roughly K labeled pairs.

4 Experiments

Our experiments demonstrate how these recast datasets can be used to evaluate how well models capture different types of semantic reasoning necessary for general language understanding. We also include results from a model with access to just hypotheses as a strong baseline. This may indicate whether the recast datasets retain statistical irregularities with-in class labels from the original, task-specific annotations.

4.1 Models

For demonstrating how well an NLI model performs these fine-grained types of reasoning, we use InferSent Conneau et al. (2017). Our hypothesis-only model is a modified version of InferSent that only accesses hypotheses Poliak et al. (2018b). This determines whether a model can perform well on each recast NLI dataset when only considering the context. We use the same training settings and hyper-parameters as described in \newciteconneau-EtAl:2017:EMNLP2017 and \newcitehypoths-nli.

4.2 Results

Table 3 reports the models’ accuracies across the recast NLI datasets. We use each model under three different scenarios. In the first two scenarios, we train and test a model separately on each recast dataset. The model’s parameters are either initialized randomly or with parameters extracted from a model pre-trained on Multi-NLI (pre-train, update). In the third scenario, we test a model trained only on Multi-NLI and test it on each semantic phenomena (pre-train, fixed). Because of the small size of the recast Gendered Anaphora Resolution dataset, we do not use it for training.

Evaluating NLI models

When evaluating NLI models, our baseline is the maximum of the accuracies of the hypothesis-only model and always predicting the majority class label. The hypothesis-only model can demonstrate how likely a NLI label applies to a hypothesis, regardless of its context. Our strong baseline indicates how well each recast dataset tests a model’s ability to perform each specific type of reasoning when determining whether a context entails a hypothesis. Our results suggest that InferSent might capture specific semantic phenomena better than others. InferSent seems to learn the most about determining if a described event occurred since the difference between its accuracy and baseline is largest on the recast Event Factuality dataset compared to the recast annotations.

The high hypothesis-only accuracy on the recast NER dataset may demonstrate that the hypothesis-only model is able to detect that the distribution of class labels for a given word may be peaky. Based on this, we may need to consider different methods to recast NER annotations into labeled NLI examples, or limit the size of the dataset’s training partition.

Models trained on Multi-NLI

williams2017broad argue that Multi-NLI makes “it possible to evaluate systems on nearly the full complexity of the language.” However, how well does Multi-NLI test a models capability to understand distinct semantic phenomena often occurring in language? We posit that if a model, trained on and performing well on Multi-NLI, does not perform well on our recast datasets, then it is possible that Multi-NLI might not evaluate a model’s ability to understand the “full complexity” of language as thought.

When trained on Multi-NLI, our InferSent model achieves an accuracy of 70.22% on Multi-NLI. When we test the model on the recast datasets, we see significant drops as it performs below every majority baseline.171717InferSent (pre-trained, fixed) in  Table 3. This might suggest that Multi-NLI does not evaluate whether sentence representations capture these distinct semantic phenomena. Multi-NLI’s train set contains text from five domains: fiction, government, slate, telephone, and travel. We would expect the fiction section, and especially its humor subset, to contain some figurative language that might be similar to puns, and the travel guides (and possibly telephone conversations) to contain text related to sentiment.

Initializing with pre-trained parameters

We would like to know whether initializing models with parameters trained on Multi-NLI improve scores. On three of the recast datasets, there is minimal difference when using a pre-trained model. We notice a large improvement on the Puns and Sentiment recast datasets.

5 Related Work

Targeted Tests for Natural Language Understanding

We follow a long line of work focused on building datasets to test how well NLU systems perform distinct types of semantic reasoning. FraCaS uses a limited number of sentence-pairs to test whether systems understand semantic phenomena such as generalized quantifiers, temporal references, and (nominal) anaphora Cooper et al. (1996). FraCas cannot be used to train neural models – it includes just roughly high-quality instances manually created by linguists. \newcitemaccartney2009natural created the FraCaS textual inference test suite by automatically “convert[ing] each FraCaS question into a declarative hypothesis.” \newcitelevesque2012winograd’s Winograd Schema Challenge force a model two choose between two possible answers for a question based on a sentence describing an event. Additionally, public benchmarks exist that test whether RTE models are unable to handle adjective-noun composition Pavlick and Callison-Burch (2016), other types of composition Dasgupta et al. (2018), paraphrastic inference, anaphora resolution, and semantic proto-roles White et al. (2017).

Exploring what linguistic phenomena neural models learn

Many tests have been used to probe how well neural models learn different linguistic phenomena. \newcitelinzen2016assessing use “number agreement in English subject-verb dependencies” to show that LSTMs learn about syntax-sensitive dependencies. In addition to syntax Shi et al. (2016), researchers have used other labeling tasks to investigate whether neural machine translation (NMT) models learn other types of linguistic phenomena Belinkov et al. (2017b, a); Dalvi et al. (2017); Marvin and Koehn (2018). Recently, \newcitepoliakNAACL18 recently argued for using \newcitewhite-EtAl:2017:I17-1’s recast NLI datasets to investigative whether NMT encoders capture semantic phenomena.

6 Conclusion

We have described how we recast a wide range of semantic phenomena from many NLP datasets into labeled NLI sentence pairs. These examples serve as a unified NLI framework that may help diagnose whether NLU models capture and perform distinct types of reasoning. Our experiments demonstrate how to use this framework can as a NLU benchmark. Our dataset is actively growing as we continue recasting more datasets into labeled NLI examples. The dataset, along with baselines, trained models, and a leader board, will be publicly available at


We thank Diyi Yang for help with the PunsOfTheDay dataset and Tongfei Chen for suggestions regarding designing the figures and tables.


  • Arnold et al. (1993) D.J. Arnold, Lorna Balkan, Siety Meijer, R.Lee Humphreys, and Louisa Sadler. 1993. Machine Translation: an Introductory Guide. Blackwells-NCC, London.
  • Bar-Hillel (1953) Yehoshua Bar-Hillel. 1953. Some linguistic problems connected with machine translation. Philosophy of Science, 20(3):217–225.
  • Bekinschtein et al. (2011) Tristan A Bekinschtein, Matthew H Davis, Jennifer M Rodd, and Adrian M Owen. 2011. Why clowns taste funny: the relationship between humor and semantic ambiguity. Journal of Neuroscience, 31(26):9665–9671.
  • Belinkov et al. (2017a) Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017a. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872. Association for Computational Linguistics.
  • Belinkov et al. (2017b) Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017b. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • Binsted (1996) Kim Binsted. 1996. Machine humour: An implemented model of puns. Ph.D. thesis, University of Edinburgh, Edinburgh, Scotland.
  • Bos et al. (2017a) Johan Bos, Valerio Basile, Kilian Evang, Noortje Venhuizen, and Johannes Bjerva. 2017a. The groningen meaning bank. In Nancy Ide and James Pustejovsky, editors, Handbook of Linguistic Annotation, volume 2, pages 463–496. Springer.
  • Bos et al. (2017b) Johan Bos, Valerio Basile, Kilian Evang, Noortje J Venhuizen, and Johannes Bjerva. 2017b. The groningen meaning bank. In Handbook of Linguistic Annotation, pages 463–496. Springer.
  • Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  • Castillo and Alemany (2008) Julio Javier Castillo and Laura Alonso Alemany. 2008. An approach using named entities for recognizing textual entailment. In Notebook Papers of the Text Analysis Conference, TAC Workshop.
  • Christensen et al. (2011) Janara Christensen, Stephen Soderland, Oren Etzioni, et al. 2011. An analysis of open information extraction based on semantic role labeling. In Proceedings of the sixth international conference on Knowledge capture, pages 113–120. ACM.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
  • Cooper et al. (1996) Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Johan Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, et al. 1996. Using the framework. Technical report.
  • Dagan et al. (2006) Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The pascal recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pages 177–190. Springer.
  • Dagan et al. (2013) Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. 2013. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220.
  • Dalvi et al. (2017) Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, and Stephan Vogel. 2017. Understanding and improving morphological learning in the neural machine translation decoder. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 142–151, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • Dasgupta et al. (2018) I. Dasgupta, D. Guo, A. Stuhlmüller, S. J. Gershman, and N. D. Goodman. 2018. Evaluating Compositionality in Sentence Embeddings. ArXiv e-prints.
  • Gabrilovich et al. (2013) Evgeniy Gabrilovich, Michael Ringgaard, and Amarnag Subramanya. 2013. Facc1: Freebase annotation of clueweb corpora, version 1 (release date 2013-06-26, format version 1, correction level 0). Note: http://lemurproject. org/clueweb09/FACC1/Cited by, 5.
  • Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Proc. of NAACL.
  • Hoffmann et al. (2011) Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. 2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 541–550, Portland, Oregon, USA. Association for Computational Linguistics.
  • Isabelle et al. (2017) Pierre Isabelle, Colin Cherry, and George Foster. 2017. A challenge set approach to evaluating machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2486–2496, Copenhagen, Denmark. Association for Computational Linguistics.
  • Karttunen et al. (2014) Lauri Karttunen, Stanley Peters, Annie Zaenen, and Cleo Condoravdi. 2014. The Chameleon-like Nature of Evaluative Adjectives. In Empirical Issues in Syntax and Semantics 10, pages 233–250. CSSP-CNRS.
  • Khot et al. (2018) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTail: A textual entailment dataset from science question answering. In AAAI.
  • Kotzias et al. (2015) Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. 2015. From group to individual labels using deep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 597–606. ACM.
  • Lee et al. (2015) Kenton Lee, Yoav Artzi, Yejin Choi, and Luke Zettlemoyer. 2015. Event detection and factuality assessment with non-expert supervision. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1643–1648.
  • Levesque et al. (2012) Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2012. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, pages 552–561. AAAI Press.
  • Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
  • MacCartney (2009) Bill MacCartney. 2009. Natural language inference. Ph.D. thesis, Stanford University.
  • Marvin and Koehn (2018) Rebecca Marvin and Philipp Koehn. 2018. Exploring Word Sense Disambiguation Abilities of Neural Machine Translation Systems. In Proceedings of the 13th Conference of The Association for Machine Translation in the Americas (Volume 1: Research Track, pages 125–131.
  • Miller and Gurevych (2015) Tristan Miller and Iryna Gurevych. 2015. Automatic disambiguation of english puns. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 719–729.
  • Miller et al. (2017) Tristan Miller, Christian Hempelmann, and Iryna Gurevych. 2017. Semeval-2017 task 7: Detection and interpretation of english puns. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 58–68, Vancouver, Canada. Association for Computational Linguistics.
  • Miller and Turković (2016) Tristan Miller and Mladen Turković. 2016. Towards the automatic detection and identification of english puns. The European Journal of Humour Research, 4(1):59–75.
  • Minard et al. (2016) Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, and Chantal van Son. 2016. Meantime, the newsreader multilingual event and time corpus. In Language Resources and Evaluation Conference (LREC).
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore. Association for Computational Linguistics.
  • Pakray et al. (2010) Partha Pakray, Santanu Pal, Soujanya Poria, Sivaji Bandyopadhyay, and Alexander F Gelbukh. 2010. Ju_cse_tac: Textual entailment recognition system at tac rte-6. In TAC Workshop.
  • Pal et al. (2016) Harinder Pal et al. 2016. Demonyms and compound relational nouns in nominal open ie. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction, pages 35–39.
  • Pavlick and Callison-Burch (2016) Ellie Pavlick and Chris Callison-Burch. 2016. Most “babies” are “little” and most “problems” are “huge”: Compositional entailment in adjective-nouns. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2164–2173. Association for Computational Linguistics.
  • Pepicello and Green (1984) William J Pepicello and Thomas A Green. 1984. Language of riddles: new perspectives. The Ohio State University Press.
  • Poliak et al. (2018a) Adam Poliak, Yonatan Belinkov, James Glass, and Benjamin Van Durme. 2018a. On the evaluation of semantic phenomena in neural machine translation using natural language inference. In Proceedings of the Annual Meeting of the North American Association of Computational Linguistics (NAACL).
  • Poliak et al. (2018b) Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018b. Hypothesis only baselines for natural language inference. In The Seventh Joint Conference on Lexical and Computational Semantics (*SEM).
  • Prince (1978) Ellen F Prince. 1978. On the function of existential presupposition in discourse. In Papers from the… Regional Meeting. Chicago Ling. Soc. Chicago, Ill., volume 14, pages 362–376.
  • Reyes et al. (2012) Antonio Reyes, Paolo Rosso, and Davide Buscaldi. 2012. From humor recognition to irony detection: The figurative language of social media. Data & Knowledge Engineering, 74:1–12.
  • Riedel et al. (2013) Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 74–84, Atlanta, Georgia. Association for Computational Linguistics.
  • Riloff et al. (2003) Ellen Riloff, Janyce Wiebe, and Theresa Wilson. 2003. Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 25–32. Association for Computational Linguistics.
  • Rudinger et al. (2017) Rachel Rudinger, Chandler May, and Benjamin Van Durme. 2017. Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 74–79, Valencia, Spain. Association for Computational Linguistics.
  • Rudinger et al. (2018a) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018a. Gender bias in coreference resolution. In Proceedings of the Annual Meeting of the North American Association of Computational Linguistics (NAACL).
  • Rudinger et al. (2018b) Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018b. Neural models of factuality. In Proceedings of the Annual Meeting of the North American Association of Computational Linguistics (NAACL).
  • Saha et al. (2017) Swarnadeep Saha, Harinder Pal, et al. 2017. Bootstrapping for numerical open ie. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 317–323.
  • Salton et al. (2014) Giancarlo Salton, Robert Ross, and John Kelleher. 2014. Evaluation of a substitution method for idiom transformation in statistical machine translation. In Proceedings of the 10th Workshop on Multiword Expressions (MWE), pages 38–42.
  • Sammons et al. (2009) Mark Sammons, VG Vinod Vydiswaran, Tim Vieira, Nikhil Johri, Ming-Wei Chang, Dan Goldwasser, Vivek Srikumar, Gourab Kundu, Yuancheng Tu, Kevin Small, et al. 2009. Relation alignment for textual entailment recognition. In TAC Workshop.
  • Santos (1990) Diana Santos. 1990. Lexical gaps and idioms in machine translation. In COLNG 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics, volume 2.
  • Saurf and Pustejovsky (2007) Roser Saurf and James Pustejovsky. 2007. Determining modality and factuality for text entailment. In Semantic Computing, 2007. ICSC 2007. International Conference on, pages 509–516. IEEE.
  • Schuler (2005) Karin Kipper Schuler. 2005. Verbnet: A broad-coverage, comprehensive verb lexicon.
  • Shao et al. (2017) Yutong Shao, Rico Sennrich, Bonnie L. Webber, and Federico Fancellu. 2017. Evaluating machine translation performance on chinese idioms with a blacklist method. CoRR, abs/1711.07646.
  • Shi et al. (2016) Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural mt learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, Austin, Texas. Association for Computational Linguistics.
  • Silveira et al. (2014) Natalia Silveira, Timothy Dozat, Marie-Catherine de Marneffe, Samuel Bowman, Miriam Connor, John Bauer, and Christopher D. Manning. 2014. A gold standard dependency corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014).
  • Taboada (2016) Maite Taboada. 2016. Sentiment analysis: an overview from linguistics. Annual Review of Linguistics, 2:325–347.
  • Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 142–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Van Durme (2010) Benjamin Van Durme. 2010. Extracting Implicit Knowledge from Text. Ph.D. thesis, University of Rochester, Rochester, NY 14627.
  • White et al. (2017) Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. 2017. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 996–1005, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • White and Rawlins (2016) Aaron Steven White and Kyle Rawlins. 2016. A computational model of s-selection. In Semantics and linguistic theory, volume 26, pages 641–663.
  • White and Rawlins (2018) Aaron Steven White and Kyle Rawlins. 2018. The role of veridicality and factivity in clause selection. In Proceedings of the 48th Annual Meeting of the North East Linguistic Society, page to appear, Amherst, MA. GLSA Publications.
  • White et al. (2016) Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. Universal decompositional semantics on universal dependencies. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1713–1723, Austin, Texas. Association for Computational Linguistics.
  • Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165–210.
  • Williams et al. (2017) Adina Williams, Nikita Nangia, and Samuel R Bowman. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  • Wilson et al. (2006) Theresa Wilson, Janyce Wiebe, and Rebecca Hwa. 2006. Recognizing strong and weak opinion clauses. Computational intelligence, 22(2):73–99.
  • Yang et al. (2015) Diyi Yang, Alon Lavie, Chris Dyer, and Eduard Hovy. 2015. Humor recognition and humor anchor extraction. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2367–2376. Association for Computational Linguistics.
  • Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
  • Yu and Hatzivassiloglou (2003) Hong Yu and Vasileios Hatzivassiloglou. 2003. Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 129–136. Association for Computational Linguistics.
  • Zhang et al. (2017) Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017. Ordinal common-sense inference. Transactions of the Association of Computational Linguistics, 5(1):379–395.
  • Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana. Association for Computational Linguistics.

Appendix A Linguistic considerations


williams2017broad define labels for context-hypothesis pairs in SNLI and Multi-NLI as:

“one which is necessarily true or appropriate in the same situations as the premise (to be paired with the premise under the label entailment), one which is necessarily false or inappropriate whenever the premise is true (contradiction), and one where neither condition applies (neutral).”

Depending on different linguistic phenomena in a context, there are multiple equally-valid, yet contradictory ways to understand what the same situations actually are. Multi-NLI is inconsistent in understanding the same situations when different linguistic structures make the same situations ambiguous. Table 4 shows some examples where Multi-NLI views the same situations inconsistently. We briefly discuss some issues that may arise in NLI due to linguistic phenomena.

a.1 Person

Consider the following sentences where A.1 is the context and A.1 and A.1 are two hypotheses entailed by the context: \ex. \a. She walked a beagle .̱ A dog got exercise .̧ A woman exercised .̣ You walked a beagle

When we replace She in context A.1 with You, A.1 would be entailed by the new context A.1 but what about A.1? From the single sentence context, it is unclear whether the person being addressed is the woman who walked the beagle or someone else.

There are at least three potential solutions: The conservative approach ignores all premise sentences that are not in the third person. This solution prevents issues that arise in Multi-NLI but decreases the size of our dataset. The second solution, which we refer to as the observer approach, is to change personal pronouns when generating hypotheses. Under this design, A.1 would now become A.1. \ex. \a. I exercised.

The third approach can be considered the speaker’s perspective. Here, we define our hypotheses labeled as entailed as being new utterances that any agent who provided the natural language context would be willing to commit to as well. Here, A.1 is changed to A.1. \ex. \a. You exercised.


width=1 Person You killed my friend contradiction I killed my friend do you work for TI in any way neutral Does TI pay you? Question how can you prove it entailment Can you tell me how to prove it? Who could there be?" neutral The speaker doesn’t know who it is.

Table 4: Example context-hypothesis pairs from Multi-NLI that demonstrate some linguistic inconsistencies. The line starting with is a context and the following line is a corresponding hypothesis.

a.2 Questions

Imagine changing the context-sentence in A.1 to be now posed as the question A.2. \ex. \a. She walked a beagle?

In NLI, is a question a rhetorical statement invalidating the plausibility of a scenario (A.2’s speaker claiming there is no way she walked the beagle) or a way to clarify understanding about a situation that occurred (A.2’s speaker wants to know whether she walked a beagle or a different type of dog)?\ex. \a. She walked a beagle? .̱ She walked a beagle?

The context-hypotheses pair’s entailment label depends on how one understands this question, whereby A.1 is entailed by A.2 but not A.2.

The most obvious solution is to remove questions in the set of context sentences. A downside to this solution is that potentially many NLI contexts will be disregarded. Other solutions may include forcing valid hypotheses to answer the question, or forcing valid hypotheses to be other questions that are subsets (entailed) or supersets (not-entailed) or unrelated (not-entailed) with respect to the question in the context.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description