Using syntactical and logical forms to evaluate textual inference competence
In the light of recent breakthroughs in transfer learning for Natural Language Processing, much progress was achieved on Natural Language Inference. Different models are now presenting high accuracy on popular inference datasets such as SNLI, MNLI and SciTail. At the same time, there are different indicators that those datasets can be exploited by using some simple linguistic patterns. This fact poses difficulties to our understanding of the actual capacity of machine learning models to solve the complex task of textual inference. We propose a new set of tasks that require specific capacities over linguistic logical forms such as: i) Boolean coordination, ii) quantifiers, iii) definitive description, and iv) counting operators. By evaluating a model on our stratified dataset, we can better pinpoint the specific inferential difficulties of a model in each kind of textual structure. We evaluate two kinds of neural models that implicitly exploit language structure: recurrent models and the Transformer network BERT. We show that although BERT is clearly more efficient to generalize over most logical forms, there is space for improvement when dealing with counting operators.
Natural Language Inference (NLI) is a complex problem of Natural Language Understanding which is usually defined as follows: given a pair of textual inputs and we need to determine if entails , or contradicts , or and have no logical relationship (they are neutral) [Consortium et al.1996]. and , known as “premise” and “hypothesis” respectively, can be either simple sentences or full texts.
The task can focus either on the entailment or the contradiction part. The former, which is known as Recognizing Textual Entailment (RTE), classifies the pair , in “entailment” or “non-entailment”. The latter, which is know as Contradiction Detection (CD), classifies that pair in terms of “contradiction” or “non-contradiction”. Independently of the form that we frame the problem, the concept of inference is the critical issue here.
With this formulation, NLI has been treated as a text classification problem suitable to be solved by a variety of machine learning techniques [Bar-Haim et al.2014, Bowman et al.2015a, Williams et al.2017]. Inference itself is also a complex problem. As shown in the following sentence pairs:
“A woman plays with my dog”, “A person plays with my dog”
“Jenny and Sally play with my dog”, “Jenny plays with my dog”
Both examples are cases of entailment, with different properties. In (1) the entailment is caused by the hypernym relationship between “person” and “woman”. Example (2) deals with interpretation of the coordinating conjunction “and” as a Boolean connective. As (1) relies on the meaning of the noun phrases we call it “lexical inference”. As (2) is invariant under substitution we call it “structural inference”. The latter is the focus of this work.
In this paper, we propose a new synthetic dataset that enables us to:
compare the NLI accuracy of different neural models.
diagnose of the structural (logical and syntactic) competence of each model.
verify cross-linguistic structural competence of each method.
The contributions presented in this paper are: i) the presentation of a structure oriented dataset, ii) the comparison of traditional neural recurrent models against the Transformer network BERT with a clear advantage for the latter, however we still can identify specific gaps in its performance. iii) Finally we present a success case of cross-language transfer learning for structural NLI between English and Portuguese.
The size of NLI datasets have been increasing since the initial proposition of the FraCas test suit composed of examples [Consortium et al.1996]. Some old datasets like RTE-6 [Bentivogli et al.2009] and SICK [Marelli et al.2014], with K and K examples, respectively, are relatively small if compared with the current ones like SNLI [Bowman et al.2015a] and MNLI [Williams et al.2017], with K and K examples, respectively. This increase was possible with the use of crowdsource platforms like the Amazon Mechanical Turk [Bowman et al.2015a, Williams et al.2017]. Hence the annotation of a highly specialized researcher, like in RTE 1-3 done by formal semanticist [Giampiccolo et al.2007, Bar-Haim et al.2014], was replaced with the labelling done by an average English speakers. This approach has been criticised with the argument that it is hard for an average speaker to produce different and creative examples of entailment and contradiction pairs [Gururangan et al.2018]. By looking at premise alone a simple text classifier can achieve an accuracy significantly larger than a random classifier in datasets such as SNLI and MNLI. This was explained by a high correlation of occurrences of negative words (“no”, “nobody”, “never”, “nothing”) in contradiction instances, and high correlation of generic words (such as “animal”, “instrument”, “outdoors”) with entailment instances. So despite of the large size of the corpora the task was easier to perform than expected.
The new wave of pre-trained models [Howard and Ruder2018, Devlin et al.2018, Liu et al.2019] poses both a challenge and an opportunity for the NLI field; the large-scale datasets are close to being solved (the benchmark for SNLI, MNLI, and SciTail is , , and , respectively, as reported in [Liu et al.2019]), giving the impression that NLI will become a trivial problem. The opportunity lies in the fact that, by using pre-trained models, training will no longer need such a large dataset. Then we can focus our efforts in creating small, well-thought datasets that reflect the variety of inferential tasks, and so determine the real competence of a model.
Here we present a collection of small datasets designed to measure the competence of detecting contradictions in structural inferences. We have chosen the CD task because it is harder for an average annotator to create examples of contradictions without excessively relying on the same patterns. At the same time, CD has practical importance since it can be used to improve consistency in real case applications, such as chat-bots [Welleck et al.2018].
We choose to focus on structural inference because we have detected that the current datasets are not appropriately addressing this particular feature. In an experiment, we verify the deficiency reported in [Gururangan et al.2018, Glockner et al.2018]. First, we transformed the SNLI and MNLI datasets in a CD task. The transformation is done by converting all instances of entailment and neutral into non-contradiction, and by balancing the classes in both training and test data. Second, we applied a simple Bag-of-Words classifier, destroying any structural information. The accuracy was significantly higher than the random classifier, and for SNLI and MNLI, respectively. Even the recent dataset focusing on contradiction, Dialog NLI [Welleck et al.2018], presents a similar pattern. The same Bag-of-Words model achieved accuracy in this corpus.
3 Data Collection
The different datasets that we propose are divided by tasks, such that each task introduces a new linguistic construct. Each task is designed by applying structurally dependent rules to automatically generate the sentence pairs. We first define the pairs in a formal language and then we use it to generate instances in natural language. In this paper, we have decided to work with English and Portuguese.
There are two main reasons to use a formal language as a basis for the dataset. First, this approach allows us to minimize the influence of common knowledge and lexical knowledge, highlighting structural features. Second, we can obtain a structural symmetry in both corpora.
Hence, our dataset is a tool to measure inference in two dimensions: one defined by the structural forms, which correspond to different levels in our hierarchical corpus; and other defined by the instantiation of these forms in multiple natural languages.
3.1 Template Language
The template language is a formal language used to generate instances of contradictions and non-contradictions in a natural language. This language is composed of two basic entities: people, and places, . We also define three binary relations: , , . It is a simplistic universe with the intended meaning for binary relations such as “ has visited ”, “ is taller than ” and “ is as tall as ”, respectively.
A realisation of the template language is a function mapping and to nouns such that ; and it also maps the relation symbols and logic operators to corresponding forms in some natural language.
Each task is defined by the introduction of a new structural and logical operator. We define the tasks in an hierarchical fashion: if a logical operator appears on task , it can appear in any task (with ). The main advantage of our approach compared to other datasets is that we can isolate the occurrences of each operator to have a clear notion in what forces the models to fail (or succeed).
For each task, we provide training and test data with 10K and 1K examples, respectively. All data is balanced; and, as usual, the model’s accuracy is evaluated on the test data. To test the model’s generalization capability, we have defined two distinct realization functions and such that and . For example, in the English version and are composed of common English masculine names and names of countries, respectively. Similarly, and are composed of feminine names and names of cities from the United States. In the Portuguese version we have done a similar construction, using common masculine and feminine names together with names of countries and names of Brazilian cities.
3.2 Data Generation
A logical rule can be seen as a mapping that transforms a premise into a conclusion . To obtain examples of contradiction we start with a premise and define as the negation of . The examples of non-contradiction are different negations that do not necessarily violate . We repeat this process for each task. What defines the difference from one task to another is the introduction of logical and linguist operators, and subsequently, new rules. We have used more than one template pair to define each task; however, for the sake of brevity, in the description below we will give only a brief overview of each task.
The full dataset in both languages, together with the code to generate it, can be found online [Salvatore2019].
Task 1: Simple negation We introduce the negation operator , “not”. The premise is a collection of facts about some agents visiting different places. Example, (“Charles has visited Chile, Joe has visited Japan”). The hypothesis can be either a negation of one fact that appears in , (“Joe didn’t visit Japan”); or a new fact not related to , (“Lana didn’t visit France”). The number of facts that appear in vary from two to twelve.
Task 2: Boolean coordination In this task, we add the Boolean conjunction , the coordinating conjunction “and”. Example, (“Felix, Ronnie, and Tyler have visited Bolivia”). The new information can state that one of the mentioned agents did not travel to a mentioned place, (“Tyler didn’t visit Bolivia”). Or it can represent a new fact, (“Bruce didn’t visit Bolivia”).
Task 3: Quantification By adding the quantifiers and , “for every” and “some”, respectively, we can construct example of inferences that explicitly exploit the difference between the two basic entities, people and places. Example, states a general fact about all people, (“Everyone has visited every place”) . can be the negation of one particular instance of , (“Timothy didn’t visit El Salvador”). Or a fact that does not violate , (“Timothy didn’t visit Anthony”).
Task 4: Definitive description One way to test if a model can capture reference is by using definitive description, i.e., by adding the operator to perform description and the equality relation . Hence, is to be read as “ is the one that has property Q”. Here we describe one property of one agent and ask the model to combine the description with a new fact. For example, (“Carlos is the person that has visited every place, Carlos has visited John”). Two new hypotheses can be introduced: (“Carlos did not visit Germany”) or (“John did not visit Germany”). Only the first hypothesis is a contradiction. Although the names “Carlos” and “John” appear on the premise, we expected the model to relate the property “being the one that has visited every place” to “John” and not “Carlos”.
Task 5: Comparatives In this task we are interested to know if the model can recognise a basic property of a binary relation: transitivity. The premise is compose of a collection of simple facts . (“Francis is taller than Joe, Joe is taller than Ryan”). Assuming the transitivity of , the hypothesis can be a consequence of , (“Francis is taller than Ryan”), or a fact that violates the transitivity property, (“Ryan is taller than Francis”). The size of the varies from four to ten. Negation is not employed here.
Task 6: Counting In Task 3 we have added only the basic quantifiers and , but there is a broader family of operators called generalised quantifiers. In this task we introduce the counting quantifier (“exactly ”). Example, (“Philip has visited only three places and only two people”). can be an information consistent with , (“Philip has visited John”), or something that contradicts , (“Philip has visited John, Carla, and Bruce”). We have added counting quantifiers corresponding to numbers from one to thirty.
Task 7: Mixed To avoid overexpose the same linguistic structures, we created a dataset composed of different samples of the previous tasks.
Basic statistics for the English and Portuguese realisations of all tasks can be found in Table 1.
|Task 1 (Eng)||3561||77||230.6||459|
|Task 2 (Eng)||4117||128||151.4||343|
|Task 3 (Eng)||3117||70||101.5||329|
|Task 4 (Eng)||1878||62||100.81||134|
|Task 5 (Eng)||1311||25||208.8||377|
|Task 6 (Eng)||3900||150||168.4||468|
|Task 7 (Eng)||3775||162||160.6||466|
|Task 1 (Pt)||7762||254||209.4||445|
|Task 2 (Pt)||9990||393||148.5||388|
|Task 3 (Pt)||5930||212||102.7||395|
|Task 4 (Pt)||5540||135||91.8||140|
|Task 5 (Pt)||5970||114||235.2||462|
|Task 6 (Pt)||9535||386||87.8||531|
|Task 7 (Pt)||8880||391||159.9||487|
4 Models and Evaluation
To evaluate the accuracy of each CD task we employed three kinds of models:
Baseline The baseline model (Base) is a Random Forest classifier that models the input text, the concatenation of and , as a Bag-of-Words. Since we constructed the dataset centered on the notion of structure-based contradictions, we believe that it should perform slightly better than random. At the same time, by using such baseline, we can certify if the proposed tasks are indeed requiring structural knowledge.
Recurrent Models The dominant family of neural models in Natural Language Processing specialised in modelling sequential data is the one composed by the Recurrent Neural Networks (RNNs) and its variations, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) [Goldberg2015]. We consider both the standard and the bidirectional variants of this family of models. Traditional multilayer recurrent models are not the best choice to improve the benchmark on NLI [Glockner et al.2018]. However, in recent works, it has been reported that recurrent models achieve a better performance than Transformer based models to capture structural patterns for logical inference [Evans et al.2018, Tran et al.2018]. We want to investigate if the same result can be achieved using our tasks as the base of comparison.
Transformer Based Models A recent non-recurrent family of neural models known as Transformer networks was introduced in [Vaswani et al.2017]. Differently from the recurrent models that recursively summarises all previous input into a single representation, the Transformer network employes a self-attention mechanism to directly attend to all previous inputs (more details of this architecture can be found in [Vaswani et al.2017]). Although, by performing regular training using this architecture alone we do not see surprising results in inference prediction [Evans et al.2018, Tran et al.2018], when we pre-trained a Transformer network in the language modeling task and fine-tuned afterwards on an inference task we see a significant improvement [Radford et al.2018, Devlin et al.2018].
Among the different Transformer based models we will focus our analysis on the multilayer bidirectional architecture known as Bidirectional Encoder Representation from Transformers (BERT) [Devlin et al.2018]. This bidirectional model, pre-trained as a masked language model and as a next sentence predictor, has two versions: BERT and BERT. The difference lies in the size of each architecture, the number of layers and self-attention heads. Since BERT is unstable on small datasets [Devlin et al.2018] we have used only BERT.
The strategy to perform NLI classification using BERT is the same the one presented in [Devlin et al.2018]: together with the pair we add new special tokens [CLS] (classification token) and [SEP] (sentence separator). Hence, the textual input is the result of the concatenation: [CLS] [SEP] [SEP]. After we obtain the vector representation of the [CLS] token, we pass it through a classification layer to obtain the prediction class (contradiction/non-contradiction). We fine-tune the model for the CD task in a standard way, the original weights are co-trained with the weights from the new layer.
By comparing BERT with other models we are not only comparing different architectures but different techniques of training. The baseline model uses no additional information. The recurrent models use only a soft version of transfer learning with fine-tuning of pre-trained embeddings (the fine-tuning of one layer only). On the other side, BERT is pre-trained on a large corpus as a language model. It is expected that this pre-training helps the model to capture some general properties of language [Howard and Ruder2018]. Since the tasks that we proposed are basic and cover very specific aspects of reasoning, we can use it to evaluate which properties are being learned in the pre-training phase.
The simplicity of the tasks motivated us to use transfer-learning differently: instead of simply using the multilingual version of BERT222Multilingual BERT is a model trained on the concatenation of the entire Wikipedia from 100 languages, Portuguese included. https://github.com/google-research/bert/blob/master/multilingual.md and fine-tune it on the Portuguese version of the tasks, we have decided to compare how the differently pre-trained versions of the BERT model can be fine-tuned on the Portuguese corpus. This can be done because for each pre-trained model there is a tokenizer that transforms the Portuguese input into a collection of tokens that the model can process. Thus, we have decided to use the regular version of BERT trained on an English corpus (BERT), the already mentioned Multilingual BERT (BERT), and the version of the BERT model trained on a Chinese corpus (BERT).
We hypothesize that most basic logical pattern learned by the model in English can be transferred to Portuguese. By the same reasoning, we believe that BERT should perform poorly. Not only the tokenizer associated to BERT will add noise to the input text, but also Portuguese and Chinese are grammatically different; for example, the latter is overwhelmingly right-branching while the former is more mixed [Levy and Manning2003]).
4.1 Experimental settings
The experiment was done in two stages. In the first stage we have evaluated the performance of the models in different ways:
We trained each model on different proportions of the dataset, by increasingly sampling the size of the training data. In this case, and .
To understand how much the different models rely on the occurrence of noun phrases, we have trained the models on a version of the dataset where we allow full intersection of the train and test vocabulary, i.e., and .
For the Portuguese corpus, we have fine-tuned the three pre-trained models mentioned above: BERT, BERT, and BERT.
In the second phase, we have trained the best model from the first stage with fake examples to observe if it is learning all the intended structures and not some unexpected text pattern. We proceeded by training the model on the following modified versions of the dataset: (Noise label) each pair , is unchanged but we randomly labeled the pair as contradiction or non-contradiction. (Premise only) we keep the labels the same and omit the hypothesis . (Hypothesis only) the premise is removed, but the labels remain intact.
4.2 Implementation and Hyperparameters
All deep learning architectures were implemented using the library Pytorch [Paszke et al.2017]. To make use of the pre-trained version of BERT we have based our implementation on the public repository https://github.com/huggingface/pytorch-pretrained-BERT.
The different recurrent architectures were optimized with Adam [Kingma and Ba2014]. We have used pre-trained word embedding from Glove [Pennington et al.2014] and Fasttext [Joulin et al.2016], but we also used random initialized embeddings. We random searched across embedding dimensions in , hidden layer size of the recurrent model in , number of recurrent layer in , learning rate in , dropout in and batch sizes in . The hyperparameter search for BERT follows the one presented in [Devlin et al.2018] that uses Adam with learning rate warmup and linear decay. We randomly searched the learning rate in , batch sizes in and number of epochs in .
All the code for the experiments is public available [Salvatore2019].
For stage 1(a), in most of the tasks, BERT presents a clear advantage when compared to all other models. Task 3 and 6 are the only ones where the difference in accuracy between BERT and the recurrent models is small, as can be seen in Table 2. Even when we look at BERT’s results on the Portuguese corpus, which are slightly worst when compared to the English one, we still see a similar pattern. Pre-training plays a role here. When we fine-tuned BERT to the Portuguese version of Tasks 6 and 7 we have achieved a new accuracy of and respectively.
Surprisingly, BERT is able to solve some simple tasks but, overall, it has a medium performance. Using this model we have obtained the accuracy of , , , , , , from Tasks 1 to 7, respectively.
|Task 1 (Eng)||52.1||50.1||50.6||50.4||99.8|
|Task 2 (Eng)||50.7||50.2||50.2||50.8||100|
|Task 3 (Eng)||63.5||50.3||66.1||63.5||90.5|
|Task 4 (Eng)||51.0||51.7||52.7||51.6||100|
|Task 5 (Eng)||50.6||50.1||50.2||50.2||100|
|Task 6 (Eng)||55.5||84.4||82.7||75.1||87.5|
|Task 7 (Eng)||54.1||50.9||53.7||50.0||94.6|
|Task 1 (Pt)||53.9||50.1||50.2||50.0||99.9|
|Task 2 (Pt)||49.8||50.0||50.0||50.0||99.9|
|Task 3 (Pt)||61.7||50.0||70.6||50.1||78.7|
|Task 4 (Pt)||50.9||50.0||50.4||50.0||100|
|Task 5 (Pt)||49.9||50.1||50.8||50.0||99.8|
|Task 6 (Pt)||58.9||66.4||79.7||67.2||79.1|
|Task 7 (Pt)||55.4||51.1||51.6||51.1||82.7|
In stage 1(b) with the full intersection of the vocabulary, we have observed that the average accuracy improvement differs from model to model: Baseline, GRU, BERT, LSTM and RNN present an average improvement of , , , , , respectively. This may indicate that the recurrent models are relying more on noun phrases than BERT. However, since we have results in opposite directions, more investigation is required.
Figure 1 shows that by taking the average accuracy on all tasks, we can observe that BERT is the only model improved by training on more data. All other models remain close to random independently of the amount of training data.
Accuracy improvement over training size indicates the difference in difficulty of each task. On the one hand, Tasks 1,2 and 4 are practically solved by BERT using only 4K examples of training (, , accuracy, respectively). On the other hand, the results for Tasks 3 and 6 remain below average, as seen in Figure 2.
In stage 2, by taking BERT as the best classifier, we repeated the training using all the listed data modification techniques. The results, as shown in Figure 3, indicate that BERT is not memorizing random textual patterns, neither excessively relying on information that appears only in the premise or the hypothesis . When we applied it on these versions of the data, BERT behaves as a random classifier.
The results that we have found here are in the opposite direction to the ones reported on [Evans et al.2018, Tran et al.2018]. BERT, a Transformer based model, is significantly more efficient to capture sentence structure than the recurrent models. We offer the following explanation. In both previews papers the Transformer models are trained from scratch, while here we have used pre-trained models. This difference in results seems to indicate that language model pre-training has vital importance on obtaining structural knowledge.
Table 2 seems to confirm our initial hypothesis on the effectiveness of transfer learning in a cross-language fashion. We expect these results from English to Portuguese, but we did not expect the result for Chinese to Portuguese regarding Tasks 1, 2 and 4. This can be explained by the following remarks. Take the contradiction pair defined in the template language:
(“ is the person that has visited everybody, has visited ”)
(“ didn’t visit ”)
If we take one possible Portuguese realization of the pair above and apply the different tokenizers we have the following strings:
Original sentence: “[CLS] gabrielle é a pessoa que visitou todo mundo gabrielle visitou luís [SEP] gabrielle não visitou ianesis [SEP]”.
Multilingual tokenizer: “[CLS] gabrielle a pessoa que visito u todo mundo gabrielle visito u lu s [SEP] gabrielle no visito u ian esis [SEP]”
English tokenizer: “[CLS] gabrielle a pe sso a que visit ou tod o mundo gabrielle visit ou lu s [SEP] gabrielle no visit ou ian esis [SEP]”
Chinese tokenizer: “[CLS] ga b rie lle a pe ss oa q ue vi sit ou to do mu nd o ga b rie lle vi sit ou lu s [SEP] ga b rie lle no vi sit ou ian es is [SEP]”
Although the Portuguese words are destroyed by the tokenizers, the model is still able to learn in the fine-tuning phase the simple structural pattern between the tokens highlighted above. This may explain why Task 6 (Counting) presents the highest difficulty for BERT. There is some structural grounding for finding contradictions in counting expressions, but to detect contradiction in all cases one must fully grasp the meaning of the multiple counting operators.
6 Related Work
The use of a synthetic benchmark to measure inference is present in early works of NLI such as the Fracas Test Suit [Consortium et al.1996]333A public version of this dataset has been made available by the Stanford NLP group https://nlp.stanford.edu/~wcmac/downloads/fracas.xml. The move towards the creation of large and realistic datasets was very beneficial to the field because real-life applications that deals with contradiction and inference must rely on linguistic and common background knowledge [Bentivogli et al.2009, Bar-Haim et al.2014, Marelli et al.2014, Bowman et al.2015a, Williams et al.2017, Khot et al.2018, Welleck et al.2018]. Our approach of isolating structural forms by using synthetic data to analyze the logical and syntactical competence of different neural models is similar to [Bowman et al.2015b, Evans et al.2018, Tran et al.2018]. One main difference between their approach and ours is that we are interested in using a formal language as a tool for performing a cross-language analysis.
7 Conclusion and Further Work
With the possibility of using pre-trained models we can successfully craft small datasets ( 10K sentences) to perform fine grained analysis on machine learning models. In this paper, we have presented a new dataset that is able to isolate a few competence issues regarding structural inference. It also allows us to bring to the surface some interesting comparisons between recurrent and Transform-based neural models. As our results show, compared to the recurrent models, BERT presents a considerable advantage in learning structural inference. The same result appears even when fine-tuned one version of the model that was not pre-trained on the target language.
By the stratified nature of our dataset, we can pinpoint BERT’s inference difficulties: there is space for improving the model’s counting understanding. Hence, we can either craft a more realistic NLI dataset centered on the notion of counting or modify BERT’s training to achieve better results in Task 6. We plan to explore these paths in the future.
- [Bar-Haim et al.2014] Roy Bar-Haim, Ido Dagan, and Idan Szpektor. Benchmarking applied semantic inference: The PASCAL recognising textual entailment challenges. In Language, Culture, Computation. Computing - Theory and Technology - Essays Dedicated to Yaacov Choueka on the Occasion of His 75th Birthday, Part I, pages 409–424, 2014.
- [Bentivogli et al.2009] Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The sixth pascal recognizing textual entailment challenge. In TAC, 2009.
- [Bowman et al.2015a] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 632–642, 2015.
- [Bowman et al.2015b] Samuel R. Bowman, Christopher D. Manning, and Christopher Potts. Tree-structured composition in neural networks without tree-structured architectures. CoRR, abs/1506.04834, 2015.
- [Consortium et al.1996] The Fracas Consortium, Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Josef Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, Steve Pulman, Ted Briscoe, Holger Maier, and Karsten Konrad. Using the framework, 1996.
- [Devlin et al.2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
- [Evans et al.2018] Richard Evans, David Saxton, David Amos, Pushmeet Kohli, and Edward Grefenstette. Can neural networks understand logical entailment? CoRR, abs/1802.08535, 2018.
- [Giampiccolo et al.2007] Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL@ACL 2007 Workshop on Textual Entailment and Paraphrasing, Prague, Czech Republic, June 28-29, 2007, pages 1–9, 2007.
- [Glockner et al.2018] Max Glockner, Vered Shwartz, and Yoav Goldberg. Breaking NLI systems with sentences that require simple lexical inferences. CoRR, abs/1805.02266, 2018.
- [Goldberg2015] Yoav Goldberg. A primer on neural network models for natural language processing. CoRR, abs/1510.00726, 2015.
- [Gururangan et al.2018] Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. CoRR, abs/1803.02324, 2018.
- [Howard and Ruder2018] Jeremy Howard and Sebastian Ruder. Fine-tuned language models for text classification. CoRR, abs/1801.06146, 2018.
- [Joulin et al.2016] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. CoRR, abs/1607.01759, 2016.
- [Khot et al.2018] Tushar Khot, Ashish Sabharwal, and Peter Clark. Scitail: A textual entailment dataset from science question answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5189–5197, 2018.
- [Kingma and Ba2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- [Levy and Manning2003] Roger Levy and Christopher Manning. Is it harder to parse chinese, or the chinese treebank? In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL ’03, pages 439–446, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
- [Liu et al.2019] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. CoRR, abs/1901.11504, 2019.
- [Marelli et al.2014] Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. A sick cure for the evaluation of compositional distributional semantic models. In LREC, 2014.
- [Paszke et al.2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
- [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In In EMNLP, 2014.
- [Radford et al.2018] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
- [Salvatore2019] Felipe Salvatore. Contrabert. https://github.com/felipessalvatore/ContraBERT, 2019.
- [Tran et al.2018] Ke M. Tran, Arianna Bisazza, and Christof Monz. The importance of being recurrent for modeling hierarchical structure. CoRR, abs/1803.03585, 2018.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
- [Welleck et al.2018] Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. Dialogue natural language inference. CoRR, abs/1811.00671, 2018.
- [Williams et al.2017] Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. CoRR, abs/1704.05426, 2017.