A Logic-Driven Framework for Consistency of Neural Models

A Logic-Driven Framework for Consistency of Neural Models

Tao Li, Vivek Gupta, Maitrey Mehta, Vivek Srikumar
School of Computing, University of Utah
{tli,vgupta,maitrey,svivek}@cs.utah.edu
Abstract

While neural models show remarkable accuracy on individual predictions, their internal beliefs can be inconsistent across examples. In this paper, we formalize such inconsistency as a generalization of prediction error. We propose a learning framework for constraining models using logic rules to regularize them away from inconsistency. Our framework can leverage both labeled and unlabeled examples and is directly compatible with off-the-shelf learning schemes without model redesign. We instantiate our framework on natural language inference, where experiments show that enforcing invariants stated in logic can help make the predictions of neural models both accurate and consistent.

1 Introduction

Recent NLP advances have been powered by improved representations (\eg, ELMo, BERT— peters2018deep; devlin2018bert), novel neural architectures (\eg, cheng2016long; seo2016bidirectional; parikh2016decomposable; vaswani2017attention), and large labeled corpora (\eg, bowman2015large; rajpurkar2016squad; N18-1101). Consequently, we have seen progressively improving performances on benchmarks such as GLUE wang2018glue. But, are models really becoming better? We take the position that, while tracking performance on a leaderboard is necessary to characterize model quality, it is not sufficient. Reasoning about language requires that a system has the ability not only to draw correct inferences about textual inputs, but also to be consistent its beliefs across various inputs.

To illustrate this notion of consistency, let us consider the task of natural language inference (NLI) which seeks to identify whether a premise entails, contradicts or is unrelated to a hypothesis dagan2013recognizing. Suppose we have three sentences , and , where entails and contradicts . Using these two facts, we can infer that contradicts . In other words, these three decisions are not independent of each other. Any model for textual inference should not violate this invariant defined over any three sentences, even if they are not labeled.

Neither are today’s models trained to be consistent in this fashion, nor is consistency evaluated. The decomposable attention model of parikh2016decomposable updated with ELMo violates the above constraint for the following sentences:111We used the model available through the Allen NLP online demo: http://demo.allennlp.org/textual-entailment.

  • John is on a train to Berlin.

  • John is traveling to Berlin.

  • John is having lunch in Berlin.

Highly accurate models can be inconsistent in their beliefs over groups of examples. For example, using a BERT-based NLI model that achieves about 90% F-score on the SNLI test set bowman2015large, we found that in about 46% of unlabeled sentence triples where entails and contradicts , the first sentence does not contradict the third. Observations of a similar spirit were also made by minervini2018adversarially, glockner2018breaking and nie2018analyzing.

To characterize and eliminate such errors, first, we define a method to measure the inconsistency of models with respect to invariants stated as first-order logic formulas over model predictions. We show that our definition of inconsistency strictly generalizes the standard definition of model error.

Second, we develop a systematic framework for mitigating inconsistency in models by compiling the invariants into a differentiable loss functions using t-norms klement2013triangular; gupta1991theory to soften logic. This allows us to take advantage of unlabeled examples and enforce consistency of model predictions over them. We show that the commonly used cross-entropy loss emerges as a specific instance of our framework. Our framework can be easily instantiated with modern neural network architectures.

To show the effectiveness of our approach, we instantiate it on the NLI task. We show that even state-of-the-art models can be highly inconsistent in their predictions, but our approach significantly reduces inconsistency.

In summary, our contributions are:

  1. We define a mechanism to measure model inconsistency with respect to declaratively specified invariants.

  2. We present a framework that compiles knowledge stated in first-order logic to loss functions that mitigate inconsistency.

  3. We show that our learning framework can reduce prediction inconsistencies even with small amount of annotated examples without sacrificing predictive accuracy.222Our code to replay our experiments is archived at https://github.com/utahnlp/consistency.

2 A Framework for (In)consistency

In this section, we will present a systematic approach for measuring and mitigating inconsistent predictions. A prediction is incorrect if it disagrees with what is known to be true. Similarly, predictions are inconsistent if they do not follow a known rule. Therefore, a model’s errors can be defined by their concordance with declarative knowledge. We will formalize this intuition by first developing a uniform representation for both labeled examples and consistency constraints (§2.1). Then, we will present a general definition of errors in the context of such a representation (§2.2). Finally, we will show a logic-driven approach for designing training losses (§ 2.3).

As a running example, we will use the NLI task whose goal is to predict one of three labels: Entailment (), Contradiction (), or Neutral ().

2.1 Representing Knowledge

Suppose is a collection of examples (perhaps labeled). We write constraints about them as a conjunction of statements in logic:

(1)

Here, and are Boolean formulas, \ieantecedents and consequents, constructed from model predictions on examples in .

One example of such an invariant is the constraint from §1, which can be written as , where, \eg, predicate denotes that model predicted label . We can also represent labeled examples as constraints: “If an example is annotated with label , then model should predict so.” In logic, we write .333The symbol denotes the Boolean true. Seen this way, the expression (1) could represent labeled data, unlabeled groups of examples with constraints between them, or a combination.

2.2 Generalizing Errors as Inconsistencies

Using the representation defined above, we can define how to evaluate predictors. We seek two properties of an evaluation metric: It should 1) quantify the inconsistency of predictions, and 2) also generalize classification error. To this end, we define two types of errors: global and conditional violation. Both are defined for a dataset consisting of example collections as described above.

Global Violation ()

The global violation is the fraction of examples in a dataset where any constraint is violated. We have:

(2)

Here, is the indicator function.

Conditional Violation ()

For a conditional statement, if the antecedent is not satisfied, the statement becomes trivially true. Thus, with complex antecedents, the number of examples where the constraint is true can be trivially large. To only consider those examples where the antecedent holds, we define the conditional violation as:

(3)
Discussion

The two metrics are complementary to each other. On one hand, to lower the global metric , a model could avoid satisfying the antecedents. In this case, the conditional metric is more informative. On the other hand, the global metric reflects the impact of domain knowledge in a given dataset, while the conditional one does not. Ideally, both should be low.

Both violations strictly generalize classification error. If all the knowledge we have takes the form of labeled examples, as exemplified at the end of §2.1, both violation metrics are identical to model error. The appendix formally shows this.

Name Boolean Logic Product Gödel Łukasiewicz
Negation
T-norm
T-conorm
Residuum
Table 1: Mapping discrete statements to differentiable functions using t-norms. Literals are upper-cased (e.g. ) while real-valued probabilities are lower-cased (e.g. ). In this paper, we focus on the product t-norm.

2.3 Learning by Minimizing Inconsistencies

With the notion of errors, we can now focus on how to train models to minimize them. A key technical challenge involves the unification of discrete declarative constraints with the standard loss-driven learning paradigm.

To address this, we will use relaxations of logic in the form of t-norms to deterministically compile rules into differentiable loss functions.444A full description of t-norms is beyond the scope of this paper; we refer the interested reader to klement2013triangular. We treat predicted label probabilities as soft surrogates for Boolean decisions. In the rest of the paper, we will use lower case for model probabilities—\eg, , and upper case—\eg, —for Boolean predicates.

Different t-norms map the standard Boolean operations into different continuous functions. Table 1 summarizes this mapping for three t-norms: product, Gödel, and Łukasiewicz. Complex Boolean expressions can be constructed from these four operations. Thus, with t-norms to relax logic, we can systematically convert rules as in (1) into differentiable functions, which in turn serve as learning objectives to minimize constraint violations. We can use any off-the-shelf optimizer (\eg, ADAM kingma2014adam). We will see concrete examples in the NLI case study in §3.

Picking a t-norm is both a design choice and an algorithmic one. Different t-norms have different numerical characteristics and their comparison is a question for future research.555For example, the Gödel t-norm, used by minervini2018adversarially, has a discountinuous but semi-differentiable residuum. The Łukasiewicz t-norm can lead to zero gradients for large disjunctions, rendering learning difficult. Here, we will focus on the product t-norm to allow comparisons to previous work: as we will see in the next section, the product t-norm strictly generalizes the widely used cross entropy loss.

3 Case Study: NLI

We study our framework using the NLI task as a case study. First, in §3.1, we will show how to represent a training set as in (1). We will also introduce two classes of domain constraints that apply to groups of premise-hypothesis pairs. Next, we will show how to compile these declaratively stated learning objectives to loss functions (§3.2). Finally, we will end this case study with a discussion about practical issues (§3.3).

3.1 Learning Objectives in Logic

Our goal is to build models that minimize inconsistency with domain knowledge stated in logic. Let us look at three such consistency requirements.

Annotation Consistency

For labeled examples, we expect that a model should predict what an annotator specifies. That is, we require

(4)

where represents the ground truth label for the example . As mentioned at the end of §2.2, for the annotation consistency, both global and conditional violation rates are the same, and minimizing them is maximizing accuracy. In our experiments, we will report accuracy instead of violation rate for annotation consistency (to align with the literature).

Symmetry Consistency

Given any premise-hypothesis pair, the grounds for a model to predict Contradiction is that the events in the premise and the hypothesis cannot coexist simultaneously. That is, a pair is a contradiction if, and only if, the pair is also a contradiction:

(5)
Transitivity Consistency

This constraint is applicable to any three related sentences , and . If we group the sentences into three pairs, namely , and , the label definitions mandate that not all of the assignments to these three pairs are allowed. The example in §1 is an allowed label assignment. We can enumerate all such valid labels as the conjunction:

(6)

3.2 Inconsistency Losses

Using the consistency constraints stated in §3.1, we can now derive the inconsistency losses to minimize. For brevity, we will focus on the annotation and symmetry consistencies.

First, let us examine annotation consistency. We can write the universal quantifier in (4) as a conjunction to get:

(7)

Using the product t-norm from Table 1, we get the learning objective of maximizing the probability of the true labels:

(8)

Or equivalently, by transforming to the negative log space, we get the annotation loss:

(9)

We see that we get the familiar cross-entropy loss function using the definition of inconsistency with the product t-norm666rocktaschel2015injecting had similar finding.!

Next, let us look at symmetry consistency:

(10)

Using the product t-norm, we get:

(11)

Transforming to the negative log space as before, we get the symmetry loss:

(12)

The loss for transitivity can also be similarly derived. We refer the reader to the appendix for details.

The important point is that we can systematically convert logical statements to loss functions and cross-entropy is only one of such losses. To enforce some or all of these constraints, we add their corresponding losses. In our case study, with all constraints, the goal of learning is to minimize:

(13)

Here, the ’s are hyperparameters to control the influence of each loss term.

3.3 Training Constrained Models

The derived loss functions are directly compatible with off-the-shelf optimizers. The symmetry/transitivity consistencies admit using unlabeled examples, while annotation consistency requires labeled examples. Thus, in §4, we will use both labeled and unlabeled data to power training.

Ideally, we want the unlabeled dataset to be absolutely informative, meaning a model learns from every example. Unfortunately, obtaining such a dataset remains an open question since new examples are required to be both linguistically meaningful and difficult enough for the model. minervini2018adversarially used a language model to generate unlabeled adversarial examples. Another way is via pivoting through a different language, which has a long history in machine translation (\eg, kay1997proper; mallinson2017paraphrasing).

Since our focus is to study inconsistency, as an alternative, we propose a simple method to create unlabeled examples: we randomly sample sentences from the same topic. In §4, we will show that even random sentences can be surprisingly informative because the derived losses operate in real-valued space instead on discrete decisions.

5% 100%
Config
BERT w/ SNLI 26.3 64.4 4.9 14.8 18.6 60.3 4.7 14.9
BERT w/ MultiNLI 28.4 69.3 7.0 18.5 20.6 58.9 5.6 17.5
BERT w/ SNLI+MultiNLI 25.3 62.4 4.8 14.8 18.1 59.6 4.5 14.8
BERT w/ SNLI+MultiNLI 22.1 67.1 4.1 13.7 19.3 59.7 4.5 15.2
LSTM w/ SNLI+MultiNLI 25.8 69.5 9.9 21.0 16.8 53.6 5.3 16.0
Table 2: Inconsistencies (%) of models on our k evaluation dataset. Each number represents the average of three random runs. Models are trained using % and % of the train sets. SNLI+MultiNLI: finetuned twice. and : symmetry consistency violations. and : transitivity consistency violations.

4 Experiments

In this section, we evaluate our framework using (near) state-of-the-art approaches for NLI, primarily based on BERT, and also compare to an LSTM model. We use the SNLI and MultiNLI wang2018glue datasets to define annotation consistency. Our LSTM model is based on the decomposable attention model with a BiLSTM encoder and GloVe embeddings pennington2014glove. Our BERT model is based on the pretrained BERT, finetuned on SNLI/MultiNLI. The constrained models are initialized with the finetuned BERT and finetuned again with inconsistency losses.777This is critical when label supervision is limited. For fair comparison, we also show results of BERT models finetuned twice.

Our constrained models are trained on both labeled and unlabeled examples. We expect that the different inconsistencies do not conflict with each other. Hence, we select hyperparameters (\eg, the ’s) using development accuracy only (\ie, annotation consistency). We refer the reader to the appendix for details of our experimental setup.

4.1 Datasets

To be comprehensive, we will use both of the SNLI and MultiNLI to train our models, but we also show individual results.

We study the impact of the amount of label supervision by randomly sampling different percentages of labeled examples. For each case, we also sample the same percentages from the corresponding development sets for model selection. For the MultiNLI dataset, we use the matched dev for validation and mismatched dev for evaluation.

Mirrored Instances (M)

Given a labeled example, we construct its mirrored version by swapping the premise and the hypothesis. This results in the same number of unlabeled sentence pairs as the annotated dataset. When sampling by percentage, we will only use the sampled examples to construct mirrored examples. We use this dataset for symmetry consistency.

Unlabeled Instance Triples (T)

For the transitivity constraint, we sample k sentence triples from MS COCO lin2014microsoft captions. From these, we construct three examples as in §3.1: sentences gives the pairs , , and . In all, we have k example unlabeled triples for the transitivity constraint.

Unlabeled Instance Pairs (U)

For each sentence triple in the dataset T, we take the first example and construct mirrored examples, \ie. This yields k unlabeled instance pairs for training with the symmetry loss.

Evaluation Dataset

We sample a different set of k example triples for measuring transitivity consistency. For symmetry consistency, we follow the above procedure for the dataset U to construct evaluation instance pairs. Recall that the definition of inconsistency allows measuring model quality with unlabeled data.

4.2 Inconsistency of Neural Models

In Table 2, we report the impact of the amount of annotated data on symmetry/transitivity consistencies by using different percentages of labeled examples. We see that both LSTM and BERT models have symmetry consistency violations, while the transitivity consistency has lower violations. Surprisingly, the LSTM model performed on par with BERT in terms of symmetry/transitivity consistency; stronger representations does not necessarily mean more consistent models.

The table shows that, given an example and its mirrored version, if the BERT baseline predicts a Contradiction on one, it has about % chance () to make an inconsistent judgement on the other. Further, we see that the inconsistencies are not affected much by different datasets. Models trained on the SNLI are as inconsistent as ones trained on MultiNLI. Combining them only gives slight improvements. Also, finetuning twice does not improve much over models finetuned once.

Finally, with more annotation, a model has fewer symmetry consistency violations. However, the same observation does not apply to the transitivity consistency. In the following sections, we will show that we can almost annihilate these inconsistencies using the losses from §3.2.

Figure 1: Symmetry inconsistencies on the k evaluation example pairs. Each point represents the average of three random runs. M, U, and T are our unlabeled datasets with corresponding losses.
Figure 2: Transitivity inconsistencies on the k evaluation example pairs. Each point represents the average of three random runs. M, U, and T are our unlabeled datasets with corresponding losses.
1% 5% 20% 100%
Config SNLI MultiNLI SNLI MultiNLI SNLI MultiNLI SNLI MultiNLI
SNLI+MultiNLI 79.7 70.1 84.6 77.2 87.8 80.6 90.1 83.5
SNLI+MultiNLI 80.3 71.0 85.3 77.4 87.9 80.7 90.3 84.0
w/ M 80.1 71.0 85.3 77.8 88.1 80.6 90.3 84.1
w/ M,U 80.2 71.0 85.4 77.2 88.1 80.9 90.5 84.3
w/ M,U,T 80.6 71.1 85.4 77.2 88.1 80.9 90.2 84.2
Table 3: Impact of symmetry/transitivity consistencies on test set accuracies. Each number represents the average of three random runs of BERT. Columns are accuracies on our SNLI/MultiNLI test sets. SNLI+MultiNLI: finetuned twice. M, U, and T are unlabeled datasets with respective inconsistency losses.

4.3 Reducing Inconsistencies

We will study the effect of symmetry and transitivity consistency losses in turn using the BERT models. To the baseline models, we incrementally include the M, U, and T datasets. We expect that the constrained models should have accuracies at least on par with the baseline (though one of the key points of this paper is that accuracy by itself is not a comprehensive metric).

In Fig. 1, we present both of the global and conditional violation rates of baselines and the constrained models. We see that mirrored examples (\ie, the w/ M curve) greatly reduced the symmetry inconsistency. Further, with k unlabeled example pairs (the w/ M,U curve), we can further reduce the error rate. The same observation also applies when combining symmetry with transitivity constraint.

Fig. 2 shows the results for transitivity inconsistency. The transitivity loss is, again, greatly reduced both for the global and conditional violations. We refer the reader to the appendix for exact numbers.

We see that with our augmented losses, even a model using % label supervision can be much more consistent than the baselines trained on % training set! This suggests that label supervision does not explicitly encode the notion of consistency, and consequently models do not get this information from the training data.

With the simultaneous decline in global and conditional violation rate, the constrained models learn to agree with the consistency requirements specified declaratively. We will see in the next section, doing so does not sacrifice model accuracies.

4.4 Interaction of Losses

In Table 3, we show the impact of symmetry and transitivity consistency on test accuracy. And the interaction between symmetry and transitivity consistency is covered in Fig 1 and 2.

Our goal is to minimize all inconsistencies without sacrificing one for another. In Table 3, we see that lower symmetry/transitivity inconsistency generally does not reduce test accuracy, but we do not observe substantial improvement either. In conjunction with the observations from above, this suggests that test sets do not explicitly measure symmetry/transitivity consistency.

From Fig 1 and 2, we see that models constrained by both symmetry and transitivity losses are generally more consistent than models using symmetry loss alone. Further, we see that in Fig. 2, using mirrored dataset alone can even mitigate the transitivity errors. With dataset P, the transitivity inconsistency is strongly reduced by the symmetry inconsistency loss. These observations suggest that the compositionality of constraints does not pose internal conflict to the model. They are in fact beneficial to each other.

Interestingly, in Fig 2, the models trained with mirrored dataset (w/ M) become more inconsistent in transitivity measurement when using more training data. We believe there are two factors causing this. Firstly, there is a vocabulary gap between SNLI/MultiNLI data and our unlabeled datasets (U and T). Secondly, the w/ M models are trained with symmetry consistency but evaluated with transitivity consistency. The slightly rising inconsistency implies that, without vocabulary coverage, training with one consistency might not always benefit another consistency, even using more training data.

When label supervision is limited (\ie%), the models can easily overfit via the transitivity loss. As a result, models trained on the combined losses (\iew/ M,U,T) have slightly larger transitivity inconsistency than models trained with mirrored data (\iew/ M) alone. In fact, if we use no label supervision at all, the symmetry and transitivity losses can push every prediction towards label Neutral. But such predictions sacrifice annotation consistency. Therefore, we believe that some amount of label supervision is necessary.

5 Analysis

In this section, we present an analysis of how the different losses affect model prediction and how informative they are during training.

5.1 Coverage of Unlabeled Dataset

Table 4 shows the coverage of the three unlabeled datasets during the first training epoch. Specifically, we count the percentage of unlabeled examples where the symmetry/transitivity loss is positive. The coverage decreases in subsequent epochs as the model learns to minimize constraint violations. We see that both datasets M and U have high coverage. This is because that, as mentioned in §2, our loss function works in real-valued space instead of discrete decisions. The coverage of the dataset T is much lower because the compositional antecedent in transitivity statements holds less often, which naturally leads to smaller coverage, unlike the unary antecedent for symmetry.

Data M U T
5% w/ M,U,T 99.8 99.4 12.0
100% w/ M,U,T 98.7 97.6 6.8
Table 4: Coverage (%) of unlabeled training sentences during the first epoch of training. Percentages are calculated from models with random seed 1.

5.2 Distribution of Predictions

In Table 5, we present the distribution of model predictions on the k evaluation example pairs for symmetry consistency. Clearly, the number of constraint-violating (off-diagonal) predictions significantly dropped. Also note that the number of Neutral nearly doubled in our constrained model. This meets our expectation because the example pairs are constructed from randomly sampled sentences under the same topic.

We also present the distribution of predictions on example triples for the transitivity consistency in Table 6. As expected, with our transitivity consistency, the distribution of the label Neutral gets significantly higher as well. Further, in Table 7, we show the error rates of each individual transitivity consistencies. Clearly our framework mitigated the violation rates on all four statements.

While the logic-derived regularization pushes model prediction on unlabeled datasets towards Neutral, the accuracies on labeled test sets are not compromised. We believe this relates to the design of current NLI datasets where the three labels are balanced. But in the real world, neutrality represents potentially infinite negative space while entailments and contradictions are rarer. The total number of neutral examples across both the SNLI and MultiNLI test sets is about 7k. Can we use these k examples to evaluate the nearly infinite negative space? We believe not.

BERT w/ M,U,T
E C N E C N

E 4649 1491 14708 2036 29 9580
C 1508 10712 6459 33 4025 627
N 14609 6633 39231 9632 613 73425
Table 5: Distribution of predictions on the k evaluation data using BERT trained on SNLI+MultiNLI data with random seed 1. Bold entries are symmetrically inconsistent.
Model Example E C N

BERT

20848 18679 60473
20919 18768 60313
20779 18721 60500

w/ M,U,T

11645 4685 83670
11671 4703 83626
11585 4597 83818
Table 6: Distribution of predictions on the k evaluation example triples. BERT: trained on the full SNLI+MultiNLI data. Predictions are from random run with seed 1.
BERT w/ M,U,T
Transitivity
0.7 16.0 0.2 15.1
1.8 49.6 0.2 46.5
1.2 9.0 0.2 1.8
1.0 9.3 0.1 4.8
Table 7: Individual transitivity inconsistency (%) on the k evaluation example triples. BERT: trained on the full SNLI+MultiNLI data. Predictions are from random run with seed 1.

6 Related Works and Discussion

Logic, Knowledge and Statistical Models

Using soft relaxations of Boolean formulas as loss functions has rich history in AI. The Łukasiewicz t-norm drives knowledge-driven learning and inference in probabilistic soft logic kimmig2012short. li2019augmenting show how to augment existing neural network architectures with domain knowledge using the Łukasiewicz t-norm. pmlr-v80-xu18h proposed a general framework for designing a semantically informed loss, without t-norms, for constraining a complex output space. In the same vein, fischer2019dl2 also proposed a framework for designing losses with logic, but using a bespoke mapping of the Boolean operators.

Our work is also conceptually related to posterior regularization ganchev2010posterior and constrained conditional models chang2012structured, which integrate knowledge with statistical models. Using posterior regularization with imitation learning, hu2016harnessing transferred knowledge from rules into neural parameters. rocktaschel2015injecting embedded logic into distributed representations for entity relation extraction. alberti2019synthetic imposed answer consistency over generated questions for machine comprehension. Ad-hoc regularizers have been proposed for process comprehension du2019consistent, semantic role labeling mehta2018towards, and summarization hsu2018unified.

Natural Language Inference

In the literature, it has been shown that even highly accurate models show a decline in performance with perturbed examples. This lack of robustness of NLI models has been shown by comparing model performance on pre-defined propositional rules for swapped datasets wang2018if or outlining large-scale stress tests to measure stability of models to semantic, lexical and random perturbations naik2018stress. Moreover, adversarial training examples produced by paraphrasing training data iyyer2018adversarial or inserting additional seemingly important, yet unrelated, information to training instances jia2017adversarial have been used to show model inconsistency. Finally, adversarially labeled examples have been shown to improve prediction accuracy kang2018adventure . Also related in this vein is the idea of dataset inoculation liu2019inoculation, where models are finetuned by exposing them to a challenging dataset.

The closest related work to this paper is probably that of minervini2018adversarially, which uses the Gödel t-norm to discover adversarial examples that violate constraints. There are three major differences compared to this paper: 1) our definition of inconsistency is a strict generalization of errors of model predictions, giving us a unified framework for that includes cross-entropy as a special case, 2) our framework does not rely on the construction of adversarial datasets, and 3) we studied the interaction of annotated examples vs. unlabeled examples via constraint, showing that our constraints can yield strongly consistent model with even a small amount of label supervision.

7 Conclusion

In this paper, we proposed a general framework to measure and mitigate model inconsistencies. Our framework systematically derives loss functions from domain knowledge stated in logic rules to constrain model training. As a case study, we instantiated the framework on a state-of-the-art model for the NLI task, showing that models can be highly accurate and consistent at the same time. Our framework is easily extensible to other domains with rich output structure, \eg, entity relation extraction, and multilabel classification.

Acknowledgements

We thank members of the NLP group at the University of Utah for their valuable insights and suggestions; and reviewers for pointers to related works, corrections, and helpful comments. We also acknowledge the support of NSF SaTC-1801446, and gifts from Google and NVIDIA.

References

Appendix A Appendices

a.1 Violations as Generalizing Errors

Both global and conditional violations defined in the body of the paper generalize classifier error. In this section, we will show that for a dataset with only labeled examples, and no additional constraints, both are identical to error.

Recall that an example annotated with label can be written as . If we have a dataset of such examples and no constraints, in our unified representation of examples, we can write this as the following conjunction:

We can now evaluate the two definitions of violation for this dataset.

First, note that the denominator in the definition of the conditional violation counts the number of examples because the antecedent for all examples is always true. This makes and equal. Moreover, the numerator is the number of examples where the label for an example is not . In other words, the value of and represents the fraction of examples in that are mislabeled.

The strength of the unified representation and the definition of violation comes from the fact that they apply to arbitrary constraints.

a.2 Loss for Transitivity Consistency

This section shows the loss associated with the transitivity consistency in the NLI case study. For an individual example , applying the product t-norm to the definition of the transitivity consistency constraint, we get the loss

(14)

That is, the total transitivity loss is the sum of this expression over the entire dataset.

a.3 Details of Experiments

1% 5%
Train SNLI MultiNLI SNLI MultiNLI
SNLI 79.3 na 36.7 70.6 6.1 17.1 84.5 na 26.3 64.4 4.9 14.8
MultiNLI na 69.0 29.1 83.1 8.2 18.4 na 76.1 28.4 69.3 7.0 18.5
SNLI+MultiNLI 79.7 70.1 38.6 71.7 4.3 13.4 84.6 77.2 25.3 62.4 4.8 14.8
SNLI+MultiNLI 80.3 71.0 32.4 75.0 3.9 12.8 85.3 77.4 22.1 67.1 4.1 13.7
w/ M 80.1 71.0 7.5 39.2 2.1 9.1 85.3 76.8 7.1 34.8 2.8 10.5
w/ M,U 80.2 71.0 6.1 38.2 2.5 9.8 85.4 77.2 4.6 32.5 2.0 8.3
w/ M,U,T 80.6 71.1 7.8 34.0 2.6 10.4 85.4 77.2 3.2 31.0 1.8 7.9
20% 100%
Train SNLI MultiNLI SNLI MultiNLI
SNLI 87.5 na 21.2 63.0 4.1 13.6 90.1 na 18.6 60.3 4.7 14.9
MultiNLI na 80.4 25.8 58.1 5.1 16.5 na 83.7 20.6 58.9 5.6 17.5
SNLI+MultiNLI 87.8 80.6 18.6 64.3 4.4 14.4 90.1 83.5 18.1 59.6 4.5 14.8
SNLI+MultiNLI 87.9 80.7 19.0 64.0 4.3 14.5 90.3 84.0 19.3 59.7 4.5 15.2
w/ M 88.1 80.6 7.3 34.0 3.2 11.7 90.3 84.1 6.2 28.1 3.0 11.6
w/ M,U 88.1 80.9 1.4 31.2 1.3 5.8 90.5 84.3 1.4 26.8 1.3 6.3
w/ M,U,T 88.1 80.9 1.3 29.6 1.2 5.7 90.2 84.2 1.1 25.5 0.6 4.2
Table 8: Symmetry/Transitivity inconsistencies (%) for models using %, %, %, and % training data. Each number represents the average of three random runs. SNLI+MultiNLI: BERT finetuned twice for fair comparison. SNLI/MultiNLI column: accuracies on corresponding text sets. M: mirrored labeled examples. U: unlabeled instance pairs. T: unlabeled instance triples.

a.3.1 Setup

For BERT baselines, we finetune them for epochs with learning rate , warmed up for all gradient updates. For constrained models, we further finetune them for another epochs with lowered learning rate . When dataset is present, we further lower the learning rate to . Optimizer is Adam across all runs. During training, we adopt Dropout rate srivastava2014dropout inside of BERT transformer encoder while at the final linear layer of classification.

For different types of data and different consistency constraints, we used different weighting factors ‘s. In general, we found that the smaller amount of labeled examples, the smaller for the symmetry and transitivity consistency. In Table 9, we see that the ‘s for U and T grows exponentially with the size of annotated examples. In contrast, the for M dataset can be much higher. We found a good value for M is . This is because the size of dataset and are fixed to be k, while the size of dataset is the same as the amount of labeled examples.

Having larger leads to significantly worse accuracy on the development set, especially that of SNLI. Therefore we did not select such models for evaluation. We hypothesize that it is because the SNLI and MultiNLI are crowdsourced from different domains while the MS COCO shares the same domain as the SNLI. Larger scaling factor could push unlabeled examples towards Neutral, thus sacrificing the annotation consistency on SNLI examples.

Data % % % %
M
U
T
Table 9: Choice of ‘s for different consistency and corresponding unlabeled datasets. For different sizes of annotation and different types of data, we adopt different ‘s.

a.3.2 Results

We present the full experiment results on the natural language inference task in Table 8. Note that the accuracies of baselines finetuned twice are slightly better than models only finetuned once, while their symmetry/transitivity consistencies are roughly on par. We found such observation is consistent with different finetuning hyperparameters (\egwarming, epochs, learning rate).

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
388435
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description