A LogicDriven Framework for Consistency of Neural Models
Abstract
While neural models show remarkable accuracy on individual predictions, their internal beliefs can be inconsistent across examples. In this paper, we formalize such inconsistency as a generalization of prediction error. We propose a learning framework for constraining models using logic rules to regularize them away from inconsistency. Our framework can leverage both labeled and unlabeled examples and is directly compatible with offtheshelf learning schemes without model redesign. We instantiate our framework on natural language inference, where experiments show that enforcing invariants stated in logic can help make the predictions of neural models both accurate and consistent.
1 Introduction
Recent NLP advances have been powered by improved representations (\eg, ELMo, BERT— peters2018deep; devlin2018bert), novel neural architectures (\eg, cheng2016long; seo2016bidirectional; parikh2016decomposable; vaswani2017attention), and large labeled corpora (\eg, bowman2015large; rajpurkar2016squad; N181101). Consequently, we have seen progressively improving performances on benchmarks such as GLUE wang2018glue. But, are models really becoming better? We take the position that, while tracking performance on a leaderboard is necessary to characterize model quality, it is not sufficient. Reasoning about language requires that a system has the ability not only to draw correct inferences about textual inputs, but also to be consistent its beliefs across various inputs.
To illustrate this notion of consistency, let us consider the task of natural language inference (NLI) which seeks to identify whether a premise entails, contradicts or is unrelated to a hypothesis dagan2013recognizing. Suppose we have three sentences , and , where entails and contradicts . Using these two facts, we can infer that contradicts . In other words, these three decisions are not independent of each other. Any model for textual inference should not violate this invariant defined over any three sentences, even if they are not labeled.
Neither are today’s models trained to be consistent in this fashion, nor is consistency evaluated. The decomposable attention model of parikh2016decomposable updated with ELMo violates the above constraint for the following sentences:^{1}^{1}1We used the model available through the Allen NLP online demo: http://demo.allennlp.org/textualentailment.

John is on a train to Berlin.

John is traveling to Berlin.

John is having lunch in Berlin.
Highly accurate models can be inconsistent in their beliefs over groups of examples. For example, using a BERTbased NLI model that achieves about 90% Fscore on the SNLI test set bowman2015large, we found that in about 46% of unlabeled sentence triples where entails and contradicts , the first sentence does not contradict the third. Observations of a similar spirit were also made by minervini2018adversarially, glockner2018breaking and nie2018analyzing.
To characterize and eliminate such errors, first, we define a method to measure the inconsistency of models with respect to invariants stated as firstorder logic formulas over model predictions. We show that our definition of inconsistency strictly generalizes the standard definition of model error.
Second, we develop a systematic framework for mitigating inconsistency in models by compiling the invariants into a differentiable loss functions using tnorms klement2013triangular; gupta1991theory to soften logic. This allows us to take advantage of unlabeled examples and enforce consistency of model predictions over them. We show that the commonly used crossentropy loss emerges as a specific instance of our framework. Our framework can be easily instantiated with modern neural network architectures.
To show the effectiveness of our approach, we instantiate it on the NLI task. We show that even stateoftheart models can be highly inconsistent in their predictions, but our approach significantly reduces inconsistency.
In summary, our contributions are:

We define a mechanism to measure model inconsistency with respect to declaratively specified invariants.

We present a framework that compiles knowledge stated in firstorder logic to loss functions that mitigate inconsistency.

We show that our learning framework can reduce prediction inconsistencies even with small amount of annotated examples without sacrificing predictive accuracy.^{2}^{2}2Our code to replay our experiments is archived at https://github.com/utahnlp/consistency.
2 A Framework for (In)consistency
In this section, we will present a systematic approach for measuring and mitigating inconsistent predictions. A prediction is incorrect if it disagrees with what is known to be true. Similarly, predictions are inconsistent if they do not follow a known rule. Therefore, a model’s errors can be defined by their concordance with declarative knowledge. We will formalize this intuition by first developing a uniform representation for both labeled examples and consistency constraints (§2.1). Then, we will present a general definition of errors in the context of such a representation (§2.2). Finally, we will show a logicdriven approach for designing training losses (§ 2.3).
As a running example, we will use the NLI task whose goal is to predict one of three labels: Entailment (), Contradiction (), or Neutral ().
2.1 Representing Knowledge
Suppose is a collection of examples (perhaps labeled). We write constraints about them as a conjunction of statements in logic:
(1) 
Here, and are Boolean formulas, \ieantecedents and consequents, constructed from model predictions on examples in .
One example of such an invariant is the constraint from §1, which can be written as , where, \eg, predicate denotes that model predicted label . We can also represent labeled examples as constraints: “If an example is annotated with label , then model should predict so.” In logic, we write .^{3}^{3}3The symbol denotes the Boolean true. Seen this way, the expression (1) could represent labeled data, unlabeled groups of examples with constraints between them, or a combination.
2.2 Generalizing Errors as Inconsistencies
Using the representation defined above, we can define how to evaluate predictors. We seek two properties of an evaluation metric: It should 1) quantify the inconsistency of predictions, and 2) also generalize classification error. To this end, we define two types of errors: global and conditional violation. Both are defined for a dataset consisting of example collections as described above.
Global Violation ()
The global violation is the fraction of examples in a dataset where any constraint is violated. We have:
(2) 
Here, is the indicator function.
Conditional Violation ()
For a conditional statement, if the antecedent is not satisfied, the statement becomes trivially true. Thus, with complex antecedents, the number of examples where the constraint is true can be trivially large. To only consider those examples where the antecedent holds, we define the conditional violation as:
(3) 
Discussion
The two metrics are complementary to each other. On one hand, to lower the global metric , a model could avoid satisfying the antecedents. In this case, the conditional metric is more informative. On the other hand, the global metric reflects the impact of domain knowledge in a given dataset, while the conditional one does not. Ideally, both should be low.
Both violations strictly generalize classification error. If all the knowledge we have takes the form of labeled examples, as exemplified at the end of §2.1, both violation metrics are identical to model error. The appendix formally shows this.
Name  Boolean Logic  Product  Gödel  Łukasiewicz 

Negation  
Tnorm  
Tconorm  
Residuum 
2.3 Learning by Minimizing Inconsistencies
With the notion of errors, we can now focus on how to train models to minimize them. A key technical challenge involves the unification of discrete declarative constraints with the standard lossdriven learning paradigm.
To address this, we will use relaxations of logic in the form of tnorms to deterministically compile rules into differentiable loss functions.^{4}^{4}4A full description of tnorms is beyond the scope of this paper; we refer the interested reader to klement2013triangular. We treat predicted label probabilities as soft surrogates for Boolean decisions. In the rest of the paper, we will use lower case for model probabilities—\eg, , and upper case—\eg, —for Boolean predicates.
Different tnorms map the standard Boolean operations into different continuous functions. Table 1 summarizes this mapping for three tnorms: product, Gödel, and Łukasiewicz. Complex Boolean expressions can be constructed from these four operations. Thus, with tnorms to relax logic, we can systematically convert rules as in (1) into differentiable functions, which in turn serve as learning objectives to minimize constraint violations. We can use any offtheshelf optimizer (\eg, ADAM kingma2014adam). We will see concrete examples in the NLI case study in §3.
Picking a tnorm is both a design choice and an algorithmic one. Different tnorms have different numerical characteristics and their comparison is a question for future research.^{5}^{5}5For example, the Gödel tnorm, used by minervini2018adversarially, has a discountinuous but semidifferentiable residuum. The Łukasiewicz tnorm can lead to zero gradients for large disjunctions, rendering learning difficult. Here, we will focus on the product tnorm to allow comparisons to previous work: as we will see in the next section, the product tnorm strictly generalizes the widely used cross entropy loss.
3 Case Study: NLI
We study our framework using the NLI task as a case study. First, in §3.1, we will show how to represent a training set as in (1). We will also introduce two classes of domain constraints that apply to groups of premisehypothesis pairs. Next, we will show how to compile these declaratively stated learning objectives to loss functions (§3.2). Finally, we will end this case study with a discussion about practical issues (§3.3).
3.1 Learning Objectives in Logic
Our goal is to build models that minimize inconsistency with domain knowledge stated in logic. Let us look at three such consistency requirements.
Annotation Consistency
For labeled examples, we expect that a model should predict what an annotator specifies. That is, we require
(4) 
where represents the ground truth label for the example . As mentioned at the end of §2.2, for the annotation consistency, both global and conditional violation rates are the same, and minimizing them is maximizing accuracy. In our experiments, we will report accuracy instead of violation rate for annotation consistency (to align with the literature).
Symmetry Consistency
Given any premisehypothesis pair, the grounds for a model to predict Contradiction is that the events in the premise and the hypothesis cannot coexist simultaneously. That is, a pair is a contradiction if, and only if, the pair is also a contradiction:
(5) 
Transitivity Consistency
This constraint is applicable to any three related sentences , and . If we group the sentences into three pairs, namely , and , the label definitions mandate that not all of the assignments to these three pairs are allowed. The example in §1 is an allowed label assignment. We can enumerate all such valid labels as the conjunction:
(6) 
3.2 Inconsistency Losses
Using the consistency constraints stated in §3.1, we can now derive the inconsistency losses to minimize. For brevity, we will focus on the annotation and symmetry consistencies.
First, let us examine annotation consistency. We can write the universal quantifier in (4) as a conjunction to get:
(7) 
Using the product tnorm from Table 1, we get the learning objective of maximizing the probability of the true labels:
(8) 
Or equivalently, by transforming to the negative log space, we get the annotation loss:
(9) 
We see that we get the familiar crossentropy loss function using the definition of inconsistency with the product tnorm^{6}^{6}6rocktaschel2015injecting had a similar finding.!
Next, let us look at symmetry consistency:
(10) 
Using the product tnorm, we get:
(11) 
Transforming to the negative log space as before, we get the symmetry loss:
(12) 
The loss for transitivity can also be similarly derived. We refer the reader to the appendix for details.
The important point is that we can systematically convert logical statements to loss functions and crossentropy is only one of such losses. To enforce some or all of these constraints, we add their corresponding losses. In our case study, with all constraints, the goal of learning is to minimize:
(13) 
Here, the ’s are hyperparameters to control the influence of each loss term.
3.3 Training Constrained Models
The derived loss functions are directly compatible with offtheshelf optimizers. The symmetry/transitivity consistencies admit using unlabeled examples, while annotation consistency requires labeled examples. Thus, in §4, we will use both labeled and unlabeled data to power training.
Ideally, we want the unlabeled dataset to be absolutely informative, meaning a model learns from every example. Unfortunately, obtaining such a dataset remains an open question since new examples are required to be both linguistically meaningful and difficult enough for the model. minervini2018adversarially used a language model to generate unlabeled adversarial examples. Another way is via pivoting through a different language, which has a long history in machine translation (\eg, kay1997proper; mallinson2017paraphrasing).
Since our focus is to study inconsistency, as an alternative, we propose a simple method to create unlabeled examples: we randomly sample sentences from the same topic. In §4, we will show that even random sentences can be surprisingly informative because the derived losses operate in realvalued space instead on discrete decisions.
5%  100%  

Config  
BERT w/ SNLI  26.3  64.4  4.9  14.8  18.6  60.3  4.7  14.9 
BERT w/ MultiNLI  28.4  69.3  7.0  18.5  20.6  58.9  5.6  17.5 
BERT w/ SNLI+MultiNLI  25.3  62.4  4.8  14.8  18.1  59.6  4.5  14.8 
BERT w/ SNLI+MultiNLI  22.1  67.1  4.1  13.7  19.3  59.7  4.5  15.2 
LSTM w/ SNLI+MultiNLI  25.8  69.5  9.9  21.0  16.8  53.6  5.3  16.0 
4 Experiments
In this section, we evaluate our framework using (near) stateoftheart approaches for NLI, primarily based on BERT, and also compare to an LSTM model. We use the SNLI and MultiNLI wang2018glue datasets to define annotation consistency. Our LSTM model is based on the decomposable attention model with a BiLSTM encoder and GloVe embeddings pennington2014glove. Our BERT model is based on the pretrained BERT, finetuned on SNLI/MultiNLI. The constrained models are initialized with the finetuned BERT and finetuned again with inconsistency losses.^{7}^{7}7This is critical when label supervision is limited. For fair comparison, we also show results of BERT models finetuned twice.
Our constrained models are trained on both labeled and unlabeled examples. We expect that the different inconsistencies do not conflict with each other. Hence, we select hyperparameters (\eg, the ’s) using development accuracy only (\ie, annotation consistency). We refer the reader to the appendix for details of our experimental setup.
4.1 Datasets
To be comprehensive, we will use both of the SNLI and MultiNLI to train our models, but we also show individual results.
We study the impact of the amount of label supervision by randomly sampling different percentages of labeled examples. For each case, we also sample the same percentages from the corresponding development sets for model selection. For the MultiNLI dataset, we use the matched dev for validation and mismatched dev for evaluation.
Mirrored Instances (M)
Given a labeled example, we construct its mirrored version by swapping the premise and the hypothesis. This results in the same number of unlabeled sentence pairs as the annotated dataset. When sampling by percentage, we will only use the sampled examples to construct mirrored examples. We use this dataset for symmetry consistency.
Unlabeled Instance Triples (T)
For the transitivity constraint, we sample k sentence triples from MS COCO lin2014microsoft captions. From these, we construct three examples as in §3.1: sentences gives the pairs , , and . In all, we have k example unlabeled triples for the transitivity constraint.
Unlabeled Instance Pairs (U)
For each sentence triple in the dataset T, we take the first example and construct mirrored examples, \ie. This yields k unlabeled instance pairs for training with the symmetry loss.
Evaluation Dataset
We sample a different set of k example triples for measuring transitivity consistency. For symmetry consistency, we follow the above procedure for the dataset U to construct evaluation instance pairs. Recall that the definition of inconsistency allows measuring model quality with unlabeled data.
4.2 Inconsistency of Neural Models
In Table 2, we report the impact of the amount of annotated data on symmetry/transitivity consistencies by using different percentages of labeled examples. We see that both LSTM and BERT models have symmetry consistency violations, while the transitivity consistency has lower violations. Surprisingly, the LSTM model performed on par with BERT in terms of symmetry/transitivity consistency; stronger representations does not necessarily mean more consistent models.
The table shows that, given an example and its mirrored version, if the BERT baseline predicts a Contradiction on one, it has about % chance () to make an inconsistent judgement on the other. Further, we see that the inconsistencies are not affected much by different datasets. Models trained on the SNLI are as inconsistent as ones trained on MultiNLI. Combining them only gives slight improvements. Also, finetuning twice does not improve much over models finetuned once.
Finally, with more annotation, a model has fewer symmetry consistency violations. However, the same observation does not apply to the transitivity consistency. In the following sections, we will show that we can almost annihilate these inconsistencies using the losses from §3.2.
1%  5%  20%  100%  
Config  SNLI  MultiNLI  SNLI  MultiNLI  SNLI  MultiNLI  SNLI  MultiNLI 
SNLI+MultiNLI  79.7  70.1  84.6  77.2  87.8  80.6  90.1  83.5 
SNLI+MultiNLI  80.3  71.0  85.3  77.4  87.9  80.7  90.3  84.0 
w/ M  80.1  71.0  85.3  77.8  88.1  80.6  90.3  84.1 
w/ M,U  80.2  71.0  85.4  77.2  88.1  80.9  90.5  84.3 
w/ M,U,T  80.6  71.1  85.4  77.2  88.1  80.9  90.2  84.2 
4.3 Reducing Inconsistencies
We will study the effect of symmetry and transitivity consistency losses in turn using the BERT models. To the baseline models, we incrementally include the M, U, and T datasets. We expect that the constrained models should have accuracies at least on par with the baseline (though one of the key points of this paper is that accuracy by itself is not a comprehensive metric).
In Fig. 1, we present both of the global and conditional violation rates of baselines and the constrained models. We see that mirrored examples (\ie, the w/ M curve) greatly reduced the symmetry inconsistency. Further, with k unlabeled example pairs (the w/ M,U curve), we can further reduce the error rate. The same observation also applies when combining symmetry with transitivity constraint.
Fig. 2 shows the results for transitivity inconsistency. The transitivity loss is, again, greatly reduced both for the global and conditional violations. We refer the reader to the appendix for exact numbers.
We see that with our augmented losses, even a model using % label supervision can be much more consistent than the baselines trained on % training set! This suggests that label supervision does not explicitly encode the notion of consistency, and consequently models do not get this information from the training data.
With the simultaneous decline in global and conditional violation rate, the constrained models learn to agree with the consistency requirements specified declaratively. We will see in the next section, doing so does not sacrifice model accuracies.
4.4 Interaction of Losses
In Table 3, we show the impact of symmetry and transitivity consistency on test accuracy. And the interaction between symmetry and transitivity consistency is covered in Fig 1 and 2.
Our goal is to minimize all inconsistencies without sacrificing one for another. In Table 3, we see that lower symmetry/transitivity inconsistency generally does not reduce test accuracy, but we do not observe substantial improvement either. In conjunction with the observations from above, this suggests that test sets do not explicitly measure symmetry/transitivity consistency.
From Fig 1 and 2, we see that models constrained by both symmetry and transitivity losses are generally more consistent than models using symmetry loss alone. Further, we see that in Fig. 2, using mirrored dataset alone can even mitigate the transitivity errors. With dataset P, the transitivity inconsistency is strongly reduced by the symmetry inconsistency loss. These observations suggest that the compositionality of constraints does not pose internal conflict to the model. They are in fact beneficial to each other.
Interestingly, in Fig 2, the models trained with mirrored dataset (w/ M) become more inconsistent in transitivity measurement when using more training data. We believe there are two factors causing this. Firstly, there is a vocabulary gap between SNLI/MultiNLI data and our unlabeled datasets (U and T). Secondly, the w/ M models are trained with symmetry consistency but evaluated with transitivity consistency. The slightly rising inconsistency implies that, without vocabulary coverage, training with one consistency might not always benefit another consistency, even using more training data.
When label supervision is limited (\ie%), the models can easily overfit via the transitivity loss. As a result, models trained on the combined losses (\iew/ M,U,T) have slightly larger transitivity inconsistency than models trained with mirrored data (\iew/ M) alone. In fact, if we use no label supervision at all, the symmetry and transitivity losses can push every prediction towards label Neutral. But such predictions sacrifice annotation consistency. Therefore, we believe that some amount of label supervision is necessary.
5 Analysis
In this section, we present an analysis of how the different losses affect model prediction and how informative they are during training.
5.1 Coverage of Unlabeled Dataset
Table 4 shows the coverage of the three unlabeled datasets during the first training epoch. Specifically, we count the percentage of unlabeled examples where the symmetry/transitivity loss is positive. The coverage decreases in subsequent epochs as the model learns to minimize constraint violations. We see that both datasets M and U have high coverage. This is because that, as mentioned in §2, our loss function works in realvalued space instead of discrete decisions. The coverage of the dataset T is much lower because the compositional antecedent in transitivity statements holds less often, which naturally leads to smaller coverage, unlike the unary antecedent for symmetry.
Data  M  U  T 

5% w/ M,U,T  99.8  99.4  12.0 
100% w/ M,U,T  98.7  97.6  6.8 
5.2 Distribution of Predictions
In Table 5, we present the distribution of model predictions on the k evaluation example pairs for symmetry consistency. Clearly, the number of constraintviolating (offdiagonal) predictions significantly dropped. Also note that the number of Neutral nearly doubled in our constrained model. This meets our expectation because the example pairs are constructed from randomly sampled sentences under the same topic.
We also present the distribution of predictions on example triples for the transitivity consistency in Table 6. As expected, with our transitivity consistency, the distribution of the label Neutral gets significantly higher as well. Further, in Table 7, we show the error rates of each individual transitivity consistencies. Clearly our framework mitigated the violation rates on all four statements.
While the logicderived regularization pushes model prediction on unlabeled datasets towards Neutral, the accuracies on labeled test sets are not compromised. We believe this relates to the design of current NLI datasets where the three labels are balanced. But in the real world, neutrality represents potentially infinite negative space while entailments and contradictions are rarer. The total number of neutral examples across both the SNLI and MultiNLI test sets is about 7k. Can we use these k examples to evaluate the nearly infinite negative space? We believe not.
BERT  w/ M,U,T  

E  C  N  E  C  N  

E  4649  1491  14708  2036  29  9580 
C  1508  10712  6459  33  4025  627  
N  14609  6633  39231  9632  613  73425 
Model  Example  E  C  N 

BERT 
20848  18679  60473  
20919  18768  60313  
20779  18721  60500  
w/ M,U,T 
11645  4685  83670  
11671  4703  83626  
11585  4597  83818 
BERT  w/ M,U,T  

Transitivity  
0.7  16.0  0.2  15.1  
1.8  49.6  0.2  46.5  
1.2  9.0  0.2  1.8  
1.0  9.3  0.1  4.8 
6 Related Works and Discussion
Logic, Knowledge and Statistical Models
Using soft relaxations of Boolean formulas as loss functions has rich history in AI. The Łukasiewicz tnorm drives knowledgedriven learning and inference in probabilistic soft logic kimmig2012short. li2019augmenting show how to augment existing neural network architectures with domain knowledge using the Łukasiewicz tnorm. pmlrv80xu18h proposed a general framework for designing a semantically informed loss, without tnorms, for constraining a complex output space. In the same vein, fischer2019dl2 also proposed a framework for designing losses with logic, but using a bespoke mapping of the Boolean operators.
Our work is also conceptually related to posterior regularization ganchev2010posterior and constrained conditional models chang2012structured, which integrate knowledge with statistical models. Using posterior regularization with imitation learning, hu2016harnessing transferred knowledge from rules into neural parameters. rocktaschel2015injecting embedded logic into distributed representations for entity relation extraction. alberti2019synthetic imposed answer consistency over generated questions for machine comprehension. Adhoc regularizers have been proposed for process comprehension du2019consistent, semantic role labeling mehta2018towards, and summarization hsu2018unified.
Natural Language Inference
In the literature, it has been shown that even highly accurate models show a decline in performance with perturbed examples. This lack of robustness of NLI models has been shown by comparing model performance on predefined propositional rules for swapped datasets wang2018if or outlining largescale stress tests to measure stability of models to semantic, lexical and random perturbations naik2018stress. Moreover, adversarial training examples produced by paraphrasing training data iyyer2018adversarial or inserting additional seemingly important, yet unrelated, information to training instances jia2017adversarial have been used to show model inconsistency. Finally, adversarially labeled examples have been shown to improve prediction accuracy kang2018adventure . Also related in this vein is the idea of dataset inoculation liu2019inoculation, where models are finetuned by exposing them to a challenging dataset.
The closest related work to this paper is probably that of minervini2018adversarially, which uses the Gödel tnorm to discover adversarial examples that violate constraints. There are three major differences compared to this paper: 1) our definition of inconsistency is a strict generalization of errors of model predictions, giving us a unified framework for that includes crossentropy as a special case, 2) our framework does not rely on the construction of adversarial datasets, and 3) we studied the interaction of annotated examples vs. unlabeled examples via constraint, showing that our constraints can yield strongly consistent model with even a small amount of label supervision.
7 Conclusion
In this paper, we proposed a general framework to measure and mitigate model inconsistencies. Our framework systematically derives loss functions from domain knowledge stated in logic rules to constrain model training. As a case study, we instantiated the framework on a stateoftheart model for the NLI task, showing that models can be highly accurate and consistent at the same time. Our framework is easily extensible to other domains with rich output structure, \eg, entity relation extraction, and multilabel classification.
Acknowledgements
We thank members of the NLP group at the University of Utah for their valuable insights and suggestions, especially Mattia Medina Grespan for pointing out fuzzy logic and fuzzy logic; and reviewers for pointers to related works, corrections, and helpful comments. We also acknowledge the support of NSF SaTC1801446, and gifts from Google and NVIDIA.
References
Appendix A Appendices
a.1 Violations as Generalizing Errors
Both global and conditional violations defined in the body of the paper generalize classifier error. In this section, we will show that for a dataset with only labeled examples, and no additional constraints, both are identical to error.
Recall that an example annotated with label can be written as . If we have a dataset of such examples and no constraints, in our unified representation of examples, we can write this as the following conjunction:
We can now evaluate the two definitions of violation for this dataset.
First, note that the denominator in the definition of the conditional violation counts the number of examples because the antecedent for all examples is always true. This makes and equal. Moreover, the numerator is the number of examples where the label for an example is not . In other words, the value of and represents the fraction of examples in that are mislabeled.
The strength of the unified representation and the definition of violation comes from the fact that they apply to arbitrary constraints.
a.2 Loss for Transitivity Consistency
This section shows the loss associated with the transitivity consistency in the NLI case study. For an individual example , applying the product tnorm to the definition of the transitivity consistency constraint, we get the loss
(14) 
That is, the total transitivity loss is the sum of this expression over the entire dataset.
a.3 Details of Experiments
1%  5%  

Train  SNLI  MultiNLI  SNLI  MultiNLI  
SNLI  79.3  na  36.7  70.6  6.1  17.1  84.5  na  26.3  64.4  4.9  14.8 
MultiNLI  na  69.0  29.1  83.1  8.2  18.4  na  76.1  28.4  69.3  7.0  18.5 
SNLI+MultiNLI  79.7  70.1  38.6  71.7  4.3  13.4  84.6  77.2  25.3  62.4  4.8  14.8 
SNLI+MultiNLI  80.3  71.0  32.4  75.0  3.9  12.8  85.3  77.4  22.1  67.1  4.1  13.7 
w/ M  80.1  71.0  7.5  39.2  2.1  9.1  85.3  76.8  7.1  34.8  2.8  10.5 
w/ M,U  80.2  71.0  6.1  38.2  2.5  9.8  85.4  77.2  4.6  32.5  2.0  8.3 
w/ M,U,T  80.6  71.1  7.8  34.0  2.6  10.4  85.4  77.2  3.2  31.0  1.8  7.9 
20%  100%  
Train  SNLI  MultiNLI  SNLI  MultiNLI  
SNLI  87.5  na  21.2  63.0  4.1  13.6  90.1  na  18.6  60.3  4.7  14.9 
MultiNLI  na  80.4  25.8  58.1  5.1  16.5  na  83.7  20.6  58.9  5.6  17.5 
SNLI+MultiNLI  87.8  80.6  18.6  64.3  4.4  14.4  90.1  83.5  18.1  59.6  4.5  14.8 
SNLI+MultiNLI  87.9  80.7  19.0  64.0  4.3  14.5  90.3  84.0  19.3  59.7  4.5  15.2 
w/ M  88.1  80.6  7.3  34.0  3.2  11.7  90.3  84.1  6.2  28.1  3.0  11.6 
w/ M,U  88.1  80.9  1.4  31.2  1.3  5.8  90.5  84.3  1.4  26.8  1.3  6.3 
w/ M,U,T  88.1  80.9  1.3  29.6  1.2  5.7  90.2  84.2  1.1  25.5  0.6  4.2 
a.3.1 Setup
For BERT baselines, we finetune them for epochs with learning rate , warmed up for all gradient updates. For constrained models, we further finetune them for another epochs with lowered learning rate . When dataset is present, we further lower the learning rate to . Optimizer is Adam across all runs. During training, we adopt Dropout rate srivastava2014dropout inside of BERT transformer encoder while at the final linear layer of classification.
For different types of data and different consistency constraints, we used different weighting factors ‘s. In general, we found that the smaller amount of labeled examples, the smaller for the symmetry and transitivity consistency. In Table 9, we see that the ‘s for U and T grows exponentially with the size of annotated examples. In contrast, the for M dataset can be much higher. We found a good value for M is . This is because the size of dataset and are fixed to be k, while the size of dataset is the same as the amount of labeled examples.
Having larger leads to significantly worse accuracy on the development set, especially that of SNLI. Therefore we did not select such models for evaluation. We hypothesize that it is because the SNLI and MultiNLI are crowdsourced from different domains while the MS COCO shares the same domain as the SNLI. Larger scaling factor could push unlabeled examples towards Neutral, thus sacrificing the annotation consistency on SNLI examples.
Data  %  %  %  % 

M  
U  
T 
a.3.2 Results
We present the full experiment results on the natural language inference task in Table 8. Note that the accuracies of baselines finetuned twice are slightly better than models only finetuned once, while their symmetry/transitivity consistencies are roughly on par. We found such observation is consistent with different finetuning hyperparameters (\egwarming, epochs, learning rate).