Explainable Natural Language Reasoning via Conceptual Unification
This paper presents an abductive framework for multi-hop and interpretable textual inference. The reasoning process is guided by the notions of unification power and plausibility of an explanation, computed through the interaction of two major architectural components: (a) An analogical reasoning model that ranks explanatory facts by leveraging unification patterns in a corpus of explanations; (b) An abductive reasoning model that performs a search for the best explanation, which is realised via conceptual abstraction and subsequent unification.
We demonstrate that the Step-wise Conceptual Unification can be effective for unsupervised question answering, and as an explanation extractor in combination with state-of-the-art Transformers. An empirical evaluation on the Worldtree corpus and the ARC Challenge resulted in the following conclusions: (1) The question answering model outperforms competitive neural and multi-hop baselines without requiring any explicit training on answer prediction; (2) When used as an explanation extractor, the proposed model significantly improves the performance of Transformers, leading to state-of-the-art results on the Worldtree corpus; (3) Analogical and abductive reasoning are highly complementary for achieving sound explanatory inference, a feature that demonstrates the impact of the unification patterns on performance and interpretability.
A central research line in the field aims at developing explainable inference models capable of performing accurate predictions and, at the same time, generating explanations for the underlying reasoning process Miller (2019); Biran and Cotton (2017). The construction of explanations for science questions is typically framed as a multi-hop reasoning problem, where multiple pieces of evidence need to be combined to arrive at the final answer. Recent approaches adopt global and local semantic constraints to guide the generation of plausible multi-hop explanations Khashabi et al. (2018); Jansen et al. (2017); Khashabi et al. (2016). However, the use of explicit constraints for reasoning with natural language often results in semantic drift – i.e. the tendency of composing spurious inference chains that lead to wrong conclusions Khashabi et al. (2019). To deal with semantic drift, recent work have proposed the crowd-sourcing of explanation-centred corpora Xie et al. (2020); Jansen et al. (2018, 2016) which can enable the identification of common explanatory patterns. Although these resources have been applied for explanation regeneration Valentino et al. (2020); Jansen and Ustalov (2019), it is not yet clear how they can support the downstream answer prediction task. In this paper, we aim at moving a step forward in this direction, exploring how explanatory patterns can be leveraged for multi-hop reasoning.
Research in Philosophy of Science suggests that explanations act through unification Friedman (1974); Kitcher (1989). The function of an explanation is to unify a set of disconnected phenomena showing that they are the expression of a common regularity – e.g. Newton’s law of universal gravitation unifies the motion of planets and falling bodies showing that they obey the same law. The higher the number of distinct phenomena explained by a given statement, the higher its unification power. Therefore, explanations with high unification power tend to create unification patterns – i.e. the same statement is reused to explain a large variety of similar phenomena. We hypothesise that the unification patterns emerging in a corpus of explanations can ultimately guide the abductive reasoning process. Consider the example in Figure 1. The explanation performs unification by connecting a concrete phenomenon – i.e. a ball falling on the floor, to a general regularity that applies to a broader set of phenomena and that explains a large number of questions – i.e. gravity affects all the objects that have mass. As a result, multi-hop inference for science questions can be modelled as an abstraction from the original context in search of an underlying explanatory law, which in turn manifests its unification power by being frequently reused in explanations for similar questions. In this paper, we build upon the concept of explanatory unification and provide the following contributions: (1) We present the Step-wise Conceptual Unification, an abductive framework that combines explicit semantic constraints with the notion of unification power for multi-hop inference, computed via analogical reasoning on a corpus of explanations; (2) We empirically show the efficacy of the framework for unsupervised question answering and explanation extraction; (3) We study the impact of the unification patterns on abductive reasoning, demonstrating their role in improving the accuracy of prediction and the soundness of explanations.
1 Step-wise Conceptual Unification
A multiple-choice science question is a tuple characterised by a question and a set of candidate answers . A set of hypotheses can be derived by concatenating with each – i.e. . Given , we frame multi-hop inference for multiple-choice question answering as the problem of selecting the hypothesis that is supported by the best explanation. The Step-wise Conceptual Unification constructs and scores explanations by composing multiple sentences from a knowledge base, which we refer to as Facts KB (). This resource includes the knowledge necessary to answer and explain science questions, ranging from common sense and taxonomic relations (e.g. a ball is a kind of object ) to scientific statements and laws (e.g. gravity; gravitational force causes objects that have mass; substances to be pulled down; to fall on a planet). The abductive process is guided by the unification patterns emerging in a second knowledge base, named Explanations KB (), which contains a set of true hypotheses with their respective explanations. The framework is based on the following research hypotheses: RH1: Explanations can be constructed through two major inference steps, namely abstraction and unification: (1) Retrieving a set of abstractive facts whose role is to expand the context of the hypothesis in search of an underlying regularity; (2) Selecting an unification fact, which represents an explanatory scientific statement; RH2: The best explanation can be determined by considering two properties of the unification: (a) The plausibility of the unification, that is a measure of the semantic connection between the unification statement and the original hypothesis; (b) The unification power, that depends on how often the unification explains similar hypotheses.
1.1 The Structure of Explanations
In general, we consider the facts in and the hypotheses in as natural language statements composed of a set of distinct concepts (e.g. “gravity”, “ball”, “living thing”). To formalise our research hypotheses, we divide the facts in into two categories. The sentences expressing taxonomic relations between concepts (i.e. “x is a kind of y”), synonyms (i.e. “x means y”) and antonyms (i.e. “x is the opposite of y”) are classified as abstractive, while all the other facts (e.g. properties, causes, processes, scientific laws) are considered for unification. We say that two arbitrary facts and are conceptually connected if the intersection between and is not empty, . On the other hand, we say that two facts and are indirectly connected if and there exists a fact such that and . We consider as part of the explanation only pairs of facts that are at least indirectly connected. Moreover, following our first research hypothesis (RH1) we consider compositions formed by an arbitrary number of abstractive facts and one unification fact. Therefore, a generic explanation for can be reformulated as a tuple defined by the following elements:
is a set of abstractive facts such that ;
is a singleton including one unification fact – i.e. .
Additional constraints are determined by the conceptual connections between these sets. Specifically, each abstractive fact must be conceptually connected with both the hypothesis and the unification fact . On the other hand, must be conceptually connected with each abstractive fact if , and with the hypothesis if . In other words, to ensure that the unification fact is semantically plausible, we force it to be linked with in one or two hops through abstractive facts (e.g. Fig 1).
1.2 Inference to the Best Explanation
To determine which hypothesis in is supported by the best explanation, we define a framework consisting of four major algorithmic steps (Fig. 2). For each hypothesis , the first step is aimed at retrieving a set of candidate explanatory facts. The architectural component responsible for this task performs analogical reasoning by leveraging unification patterns for similar hypotheses in . The output of this component is represented by two distinct subsets of : (a) A set of candidate abstractive facts ; (b) A set of candidate unification facts . Each is associated with an analogical score computed with respect to the hypothesis and reflecting the unification power of . The second step uses the output of the analogical component to perform abductive reasoning. Specifically, the elements of and are combined to build a set of plausible explanations . For each explanation , the abductive component computes an explanatory score by taking into account the analogical score and the plausibility of the unification , which is derived from the conceptual connections with . The top unifications ranked according to their explanatory scores are adopted to determine the final score for the hypothesis . Finally, the answer selection component (step 3) collects the scores computed for each and selects the candidate answer associated to the best hypothesis. For explainability, the predicted answer can be enriched with the unification performed by the system (Step 4).
1.3 Analogical Reasoning
The analogical reasoning component adopts the Unification-based Reconstruction model Valentino et al. (2020). For each fact , the model computes a score (i.e. ) that is derived by the combination of its lexical relevance, i.e. the Relevance Score (RS), and its unification power, defined as Unification Score (US) (Fig. 2):
The unification score is described by the following formula:
is the set of k-nearest neighbours of that includes hypothesis and explanation pairs (, ) retrieved according to a similarity measure , while is a function that returns 1 if is used to explain , 0 otherwise. Therefore, the more a fact explains similar hypotheses in , the higher its unification score. In our experiments, both and are implemented using BM25 vectors and cosine similarity.
|Information Retrieval (IR)|
|BM25 IR solver Clark et al. (2018)||Yes||41.22||44.94||32.99|
|BM25 Unification-based IR solver||Yes||43.86||49.94||30.41|
|BM25 IR + PathNet Kundu et al. (2019)||No||41.50||43.32||36.42|
|BM25 Unification-based IR + PathNet Kundu et al. (2019)||No||43.64||47.38||34.50|
|BERT-base Devlin et al. (2019); Valentino et al. (2020)||No||41.78||48.54||26.28|
|RoBERTa-large Liu et al. (2019)||No||50.20||57.04||35.05|
|Transformers with Explanation|
|BM25 IR + BERT-base Valentino et al. (2020)||No||49.39||53.20||40.97|
|BM25 Unification-based IR + BERT-base Valentino et al. (2020)||No||51.62||55.46||41.97|
|BM25 IR + RoBERTa-large||No||56.86||60.88||47.94|
|BM25 Unification-based IR + RoBERTa-large||No||58.54||65.42||43.30|
|Step-wise Conceptual Unification|
|SWCU (K = 1)||Yes||52.36||56.93||42.27|
|SWCU (K = 2)||Yes||55.65||61.23||43.30|
|SWCU (K = 3)||Yes||53.49||59.25||40.72|
|SWCU (K = 2) + BERT-base||No||52.29||56.00||44.07|
|SWCU (K = 2) + RoBERTa-large||No||63.59||69.38||50.77|
|TupleInf Khot et al. (2017)||Yes||Yes||No||Yes||23.83|
|TableILP Khashabi et al. (2016)||Yes||Yes||No||Yes||26.97|
|DGEM Clark et al. (2018)||Yes||No||Yes||Yes||27.11|
|KG Zhang et al. (2018)||Yes||No||No||Yes||31.70|
|Bi-LSTM max-out Mihaylov et al. (2018)||No||No||Yes||Yes||33.87|
|Unsupervised AHE Yadav et al. (2019a)||Yes||Yes||Yes||No||33.87|
|Supervised AHE Yadav et al. (2019a)||Yes||No||Yes||No||34.47|
|BERT-large Yadav et al. (2019b)||No||No||Yes||No||35.11|
|ET-RR Ni et al. (2019)||Yes||No||Yes||Yes||36.60|
|BERT-large + AutoROCC Yadav et al. (2019b)||Yes||No||Yes||No||41.24|
|Reading Strategies Sun et al. (2019)||No||No||Yes||Yes||42.32|
|SWCU (K = 1)||Yes||Yes||No||Yes||34.64|
|SWCU (K = 2)||Yes||Yes||No||Yes||35.32|
|SWCU (K = 3)||Yes||Yes||No||Yes||36.01|
1.4 Abductive Reasoning
The abductive reasoning model constructs and scores a set of explanations using abstraction and unification steps. For each concept in the hypothesis , the abstraction step computes an expansion set considering each candidate abstractive fact :
The set represents the union of all the concepts that occur in abstractive facts mentioning . For example, considering the hypothesis in figure 1, the set will include the concept “object” extracted from the fact “a ball is a kind of object”. Therefore, will include plus its hypernyms, hyponyms, synonyms and opposite concepts contained in . In the unification step, the abductive component analyses each candidate unification fact in and checks whether there exists at least a concept such that . If this condition is respected (e.g. Fig. 1), the component adds a new explanation to composed of the unification and all the abstractive facts that are connected with and . Conversely, if the condition is not respected, the unification fact is discarded. Once the set is created, the abductive component assigns an explanatory score to each explanation by considering the unification fact :
Here, is the analogical score computed for , while represents the plausibility score defined as follows:
The plausibility score represents the percentage of concepts in the hypothesis that have at least an indirect link with the unification fact . Therefore, the higher the degree of conceptual coverage between the unification and the original hypothesis, the higher the plausibility score. In line with our research hypotheses (RH2), the full explanatory score of a unification fact jointly depends on its semantic plausibility and unification power. Finally, the abductive model computes the hypothesis score by considering the top unifications for ranked by their explanatory scores:
The final answer is selected by considering the hypothesis in with the highest score:
2 Empirical Evaluation
|BM25 IR + Plausibility Score (PS)||No||40.58||43.19||34.79||65.28|
|BM25 IR + Abstraction (ABS) + PS||No||43.46||46.57||36.60||67.68|
|BM25 IR + ABS + PS + Relevance Score (RS)||No||50.36||55.30||39.43||72.65|
|BM25 Unification-based IR + ABS + PS + RS + Unification Score (US)||Yes||55.65||61.23||43.30||73.69|
We evaluate the Step-wise Conceptual Unification (SWCU) on multiple-choice question anwering. First, we test the efficacy of the framework for unsupervised question answering. Here, we adopt the algorithmic steps described in the previous section, using equation 8 for answer prediction. In addition, we evaluate the model for explanation extraction. In this case, the top unification facts (Equation 7) are used as supporting evidence for a Transformer model, which is then fine-tuned on answer prediction. We perform the experiments combining SWCU with BERT-base Devlin et al. (2019) and RoBERTa-large Liu et al. (2019). The SWCU model is implemented via BM25 vectors and cosine similarity, which are used for computing in equation 2 and in equation 1. The knowledge bases ( and ) are populated using the Worldtree corpus Jansen et al. (2018) which provides gold explanations for multiple-choice science questions. Here, an explanation is a composition of facts stored in a set of semi-structured tables, each of them representing a specific knowledge type. We extract the row sentences from the tables and use them to build the Facts KB (). The sentences in , and tables are used as abstractive facts, while the remaining sentences are adopted for unification. The questions in the corpus are split into train-set (1,190 questions), dev-set (264 questions) and test-set (1,247 questions). Questions and explanations in the train-set are used to populate the Explanations KB (), while the dev-set and the test-set are adopted for evaluation. The concepts in facts and hypotheses are extracted using WordNet Miller (1995). Specifically, given a sentence, we define a concept as a maximal sequence of words that corresponds to a valid synset. This process allows us to capture multi-word expressions (e.g. “living thing”) that typically occur in science questions.
2.1 Answer Prediction
In this section, we present the results achieved on the Worldtree corpus (test-set). We report the accuracy for SWCU with different numbers of unification facts ( in equation 7), while the accuracy for SWCU in combination with Transformers is achieved considering the best model (). Overall, we observe that the SWCU model is competitive with SWCU + BERT-base, while SWCU + RoBERTa-large achieves state-of-the-art results outperforming all the proposed models and baselines. We compare the framework against four categories of approaches: Information Retrieval, Multi-hop Inference, Transformers, and Transformers with Explanation. The results are reported in Table 1.
Information Retrieval (IR).
For the IR category, we employ two baselines similar to the one described in Clark et al. (2018). Given an hypothesis , the BM25 IR solver adopts BM25 vectors and cosine similarity to retrieve the sentence in that is most relevant to . The relevance score is then used to determine the final answer. The BM25 Unification-based IR solver adopts the same strategy complementing the relevance score with the unification score (equation 1). Similarly to the SWCU model, these approaches employ scalable IR techniques and do not require training for answer prediction. However, the results show that the SWCU model significantly outperforms these baselines on both easy and challenge questions.
|What force is needed to help stop a child from slipping on ice? (A) gravity, (B) friction, (C) electric, (D) magnetic||(B) friction||(1) counter means reduce; stop; resist; (2) ice is a kind of object; (3) slipping is a kind of motion; (4) stop means not move.||friction acts to counter the motion of two objects when their surfaces are touching.||Y|
|What causes a change in the speed of a moving object? (A) force, (B) temperature, (C) change in mass (D) change in location||(A) force||–||a force continually acting on an object in the same direction that the object is moving can cause that object’s speed to increase in a forward motion||N|
|Weather patterns sometimes result in drought. Which activity would be most negatively affected during a drought year? (A) boating, (B) farming, (C) hiking, (D) hunting||(B) farming||(1) affected means changed; (2) a drought is a kind of slow environmental change;||farming changes the environment||N|
|Beryl finds a rock and wants to know what kind it is. Which piece of information about the rock will best help her to identify it? (A) The size of the rock, (B) The weight of the rock, (C) The temperature where the rock was found, (D) The minerals the rock contains||(A) The size of the rock||(1) a property is a kind of information; (2) size is a kind of property; (3) knowing the properties of something means knowing information about that something.||the properties of something can be used to identify; used to describe that something.||Y|
|Jeannie put her soccer ball on the ground on the side of a hill. What force acted on the soccer ball to make it roll down the hill? (A) gravity, (B) electricity, (C) friction, (D) magnetism||(C) friction||(1) the ground means Earth’s surface; (2) rolling is a kind of motion; (3) a roll is a kind of movement.||friction acts to counter the motion of two objects when their surfaces are touching.||N|
|Model||Answer Acc.||Ex. Precision||Ex. Recall||Ex. F1 score||UNF Acc.|
|BM25 IR + ABS + PS||44.69||35.72||17.25||26.49||36.28|
|BM25 IR + ABS + PS + RS||60.62||52.75||23.77||38.26||55.75|
|BM25 Unification IR + ABS + PS + RS + US||62.83||54.01||24.21||39.11||61.50|
|Prediction||% Accurate Unification||% Spurious Unification||Ex. Precision||Ex. Recall||Ex. F1 score|
We consider PathNet Kundu et al. (2019) as a multi-hop and explainable reasoning baseline. This model constructs paths connecting question and candidate answer, and subsequently scores them through a neural architecture. We reproduce PathNet on the Worldtree corpus using the source code available at the following URL: https://github.com/allenai/PathNet. The best results are obtained considering the top 15 facts selected by the IR models. The differences between PathNet and SWCU are two-folds: (1) PathNet assumes that an explanation has always the shape of a single, linear path; (2) PathNet does not leverage unification patterns to guide the construction of multi-hop explanations. Our experiments show that these characteristics play a significant role for the final accuracy of the systems.
We compare our framework against BERT-base Devlin et al. (2019) and RoBERTa-large Liu et al. (2019) fine-tuned on the multiple-choice question answering task. We observe that the SWCU model outperforms both baselines with significantly less number of parameters and without direct supervision. At the same time, the improvement achieved using SWCU as an evidence extractor demonstrates the impact of the constructed explanations on Transformers.
Transformers with explanation.
Finally, we compare our approach against Transformers enhanced with IR baselines (i.e. BM25 IR and BM25 Unification-based IR) Valentino et al. (2020). The best results for these models are obtained considering the top 3 sentences retrieved by the IR models. We observe that the use of SWCU as an explanation extractor improves these baselines on both easy and challenge questions, confirming that the Step-wise Conceptual Unification provides more discriminating evidence for answer prediction.
2.2 ARC Challenge
To evaluate the generalisation of the SWCU model on a larger set of questions requiring multi-hop reasoning, we run additional experiments on the ARC Challenge Clark et al. (2018). Regarding the knowledge bases, we keep the set of unification facts and explanations from the Worldtree corpus Xie et al. (2020) and substitute the set of abstractive facts with hypernyms, hyponyms, and antonyms from WordNet Miller (1995). This process allows us to reuse the core unification facts representing general scientific knowledge (e.g. gravity, friction) and, at the same time, being able to perform abstraction from novel concepts in the questions. Table 2 reports the results on the test-set (1172 challenge questions). We compare the SWCU model against a set of state-of-the-art baselines, classifying them according to 4 dimensions: (1) Explanation: the model produces an explanation for their prediction; (2) Unsupervised: the system does not require training on answer prediction; (3) Pre-trained: the model adopts pre-trained neural components such as Language Models or Word Embeddings; (4) External: the system uses external knowledge bases or it is pre-trained on additional datasets (e.g. RACE Lai et al. (2017), SciTail Khot et al. (2018)). The results show that the SWCU model outperforms the existing unsupervised systems based on Integer Linear Programming (ILP) Khot et al. (2017); Khashabi et al. (2016) and pre-trained embeddings Yadav et al. (2019a). At the same time, our model obtains competitive results with most of the supervised approaches, including BERT-large Devlin et al. (2019). The SWCU model is still outperformed by reading strategies that adopt pre-training on external question answering datasets Sun et al. (2019), which, however, do not produce explanations for their predicted answers.
2.3 Ablation Study
We carried out an ablation study to investigate the contribution of the main architectural components. To perform the study, we gradually combine individual features to recreate the best SWCU model (). Table 3 reports the obtained results. The basic model – BM25 IR + Plausibility Score, constructs and scores explanations without abstraction step and analogical reasoning, considering only unification facts that are connected in one-hop to the original hypothesis. The first observation is that the abstraction step has a positive impact on the abductive inference, improving the accuracy of the basic model by 2.88%. In the same way, a consistent improvement is achieved when the plausibility score (PS) is combined with the BM25 relevance score (RS) (+ 6.9%). In line with our research hypotheses, the use of analogical reasoning to compute the unification score (US) via is crucial to achieve the final accuracy, leading to a substantial improvement on both easy (+5.83%) and challenge questions (+3.87%).
2.4 Explanatory Inference
In this section, we investigate the relation between explanation and answer prediction. To this end, we correlate the accuracy achieved by different combinations of the SWCU model () with a set of quantitative metrics for explanation evaluation – i.e. Precision, Recall, F1 score, and unification accuracy. Since the gold explanations in the Worldtree test-set are masked, we perform this analysis on the dev-set, comparing the best explanations generated for the predicted answers against the gold explanations in the corpus. The results reported in table 5 (top) highlight a positive correlation between accuracy in answer prediction and quality of the explanations. In particular, the performance increases according to the unification accuracy – i.e. the percentage of unifications for the predicted hypotheses that are part of the gold explanations. Therefore, these results confirm that the improvement on answer prediction is a consequence of better explanatory inference. The second part of the analysis focuses on investigating the extent to which accurate unification is also necessary for answer prediction (Tab. 5, bottom). In line with the expectations, the table shows that the majority of correct answers are derived from accurate unification (78.72%), while the majority of wrong predictions are the results of erroneous or spurious unification (67.06%). However, a minor percentage of correct and wrong answers are inferred from spurious and correct unification respectively, suggesting that alternative ways of constructing explanations are exploited by the model, and that, at the same time, accurate unification can in some occasions lead to wrong conclusions. Table 4 shows a set of qualitative examples that help clarify these results. The first example shows the case in which both selected answer and unification are correct. The second row shows an example of correct answer prediction and spurious unification. In this case, however, the selected unification fact represents a plausible alternative way of constructing explanations, that is marked as spurious due to the difference with the corpus annotation. The third example represents the situation in which, despite wrong unification, the system is able to infer the correct answer. On the other hand, the subsequent example shows the case in which the unification is accurate, but the information it contains is not sufficient to discriminate the correct answer from the alternative choices. Finally, the last row describes the case in which spurious unification leads to wrong answer prediction.
|Choices Concepts Overlap (AVG)||US||US|
|From 0% to 20%||56.54||51.12|
|From 20% to 40%||53.95||46.71|
|From 40% to 60%||54.22||54.22|
|From 60% to 80%||50.00||38.89|
|From 80% to 100%||31.25||31.25|
|Distinct Question Concepts||US||US|
|From 1 to 5||61.46||54.17|
|From 5 to 10||56.79||51.48|
|More than 10||48.11||44.65|
2.5 Error Analysis
In this section, we present an analysis to explore the robustness and limitations of the proposed approach. In this experiment (Table 6), we compute the accuracy of the SWCU model () with and without the Unification Score (US) on questions with varying degree of conceptual overlap between the alternative choices. The results show a drop in performance that is proportional to the number of shared concepts between the candidate answers. Since the explanatory score partly depends on the conceptual connections between hypotheses and unifications, the system struggles to discriminate choices that share a large proportion of concepts. A similar behaviour is observed when the accuracy is correlated with the number of distinct concepts in the questions. Long questions, in fact, tend to include distracting concepts that affect the abstraction step, increasing the probability of building spurious explanations. Nevertheless, the results highlight the positive impact of the Unification Score (US) on the robustness of the model, showing that the unification patterns contribute to a better accuracy for questions that are difficult to answer with plausibility and relevance score alone.
3 Related Work
Explanations for Science Questions.
Explanatory inference for science questions typically requires multi-hop reasoning – i.e. the ability to aggregate multiple facts from heterogeneous knowledge sources to arrive at the correct answer. This process is extremely challenging when dealing with natural language, with both empirical Fried et al. (2015) and theoretical work Khashabi et al. (2019) suggesting an intrinsic limitation in the composition of inference chains longer than 2 hops. This phenomenon, known as semantic drift, often results in the construction of spurious inference chains leading to wrong conclusions. Recent approaches have framed explanatory inference as the problem of building an optimal graph, whose generation is conditioned on a set of local and global semantic constraints Khashabi et al. (2018); Khot et al. (2017); Jansen et al. (2017); Khashabi et al. (2016). A parallel line of research tries to tackle the problem through the construction of explanation-centred corpora, which can facilitate the identification of common explanatory patterns Valentino et al. (2020); Jansen et al. (2018); Jansen (2017); Jansen et al. (2016). Our approach attempts to leverage the best of both worlds by imposing, on one hand, a set of structural and functional constraints that limit the inference process to two macro steps (abductive reasoning), and on the other hand, by identifying common unification patterns in explanations for similar questions (analogical reasoning). The explanatory patterns generated by the unification process, largely discussed in philosophy of science Friedman (1974); Kitcher (1981, 1989), have influenced the development of expert systems based on case-based reasoning Thagard and Litt (2008); Kolodner (2014). Similarly to our approach, case-based reasoning adopts analogy as a core component to retrieve explanations for known cases, and adapt them in the solution of unseen problems.
Explanations for Natural Language Reasoning.
Recent work have highlighted issues related to the interpretability of deep learning models Miller (2019); Biran and Cotton (2017), which, among other things, affects the design of proper benchmark for assessing natural language reasoning capabilities Schlegel et al. (2020). To deal with lack of interpretability, an emerging line of research explores the design of datasets including gold explanations, that support the construction and evaluation of explainable models in different domains, ranging from open domain question answering Yang et al. (2018); Thayaparan et al. (2019), to textual entailment Camburu et al. (2018) and reasoning with mathematical text Ferreira and Freitas (2020a, b). Other approaches explore the construction of explanations through the use of distributional and similarity-based models applied on external commonsense knowledge bases Silva et al. (2019, 2018); Freitas et al. (2014). In line with this work, we demonstrate that the use of unification patterns for multi-hop explanations can enhance both accuracy and explainability of neural models on a challenging question answering task Rajani et al. (2019); Yadav et al. (2019b).
This paper presented the Step-wise Conceptual Unification, a multi-hop reasoning framework that leverages unification patterns through analogical and abductive reasoning. We empirically demonstrated the efficacy of the model for unsupervised question answering and explanation extraction, remarking the impact of unification power on sound explanatory inference.
- Explanation and justification in machine learning: a survey. In IJCAI-17 workshop on explainable AI (XAI), Vol. 8. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- E-snli: natural language inference with natural language explanations. In Advances in Neural Information Processing Systems, pp. 9539–9549. Cited by: §3.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457. Cited by: Table 1, Table 2, §2.1, §2.2, Explainable Natural Language Reasoning via Conceptual Unification.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: Table 1, §2.1, §2.2, §2.
- Natural language premise selection: finding supporting statements for mathematical text. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 2175–2182. Cited by: §3.
- Premise selection in natural language mathematical texts. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7365–7374. Cited by: §3.
- A distributional semantics approach for selective reasoning on commonsense graph knowledge bases. In International Conference on Applications of Natural Language to Data Bases/Information Systems, pp. 21–32. Cited by: §3.
- Higher-order lexical semantic models for non-factoid answer reranking. Transactions of the Association for Computational Linguistics 3, pp. 197–210. Cited by: §3.
- Explanation and scientific understanding. The Journal of Philosophy 71 (1), pp. 5–19. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- A study of automatically acquiring explanatory inference patterns from corpora of explanations: lessons from elementary science exams. In 6th Workshop on Automated Knowledge Base Construction (AKBC 2017), Cited by: §3.
- Whatâs in an explanation? characterizing knowledge and inference requirements for elementary science exams. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2956–2965. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- Framing qa as building and ranking intersentence answer justifications. Computational Linguistics 43 (2), pp. 407–449. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- TextGraphs 2019 shared task on multi-hop inference for explanation regeneration. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 63–77. Cited by: Explainable Natural Language Reasoning via Conceptual Unification.
- WorldTree: a corpus of explanation graphs for elementary science questions supporting multi-hop inference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §2, §3, Explainable Natural Language Reasoning via Conceptual Unification.
- On the capabilities and limitations of reasoning for natural language understanding. arXiv preprint arXiv:1901.02522. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- Question answering via integer programming over semi-structured knowledge. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 1145–1152. Cited by: Table 2, §2.2, §3, Explainable Natural Language Reasoning via Conceptual Unification.
- Question answering as global reasoning over semantic abstractions. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- QASC: a dataset for question answering via sentence composition. arXiv preprint arXiv:1910.11473. Cited by: Explainable Natural Language Reasoning via Conceptual Unification.
- Answering complex questions using open information extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 311–316. Cited by: Table 2, §2.2, §3.
- SciTaiL: a textual entailment dataset from science question answering.. Cited by: §2.2.
- Explanatory unification. Philosophy of science 48 (4), pp. 507–531. Cited by: §3.
- Explanatory unification and the causal structure of the world. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- Case-based reasoning. Morgan Kaufmann. Cited by: §3.
- Exploiting explicit paths for multi-hop reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2737–2747. Cited by: Table 1, §2.1.
- RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794. Cited by: §2.2.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Table 1, §2.1, §2.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391. Cited by: Table 2, Explainable Natural Language Reasoning via Conceptual Unification.
- WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §2.2, §2.
- Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. Cited by: §3, Explainable Natural Language Reasoning via Conceptual Unification.
- Learning to attend on essential terms: an enhanced retriever-reader model for open-domain question answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 335–344. Cited by: Table 2.
- Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4932–4942. Cited by: §3.
- A framework for evaluation of machine reading comprehension gold standards. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5359–5369. Cited by: §3.
- Recognizing and justifying text entailment through distributional navigation on definition graphs.. In AAAI, pp. 4913–4920. Cited by: §3.
- Exploring knowledge graphs in an interpretable composite approach for text entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7023–7030. Cited by: §3.
- Improving machine reading comprehension with general reading strategies. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2633–2643. Cited by: Table 2, §2.2.
- Models of scientific explanation. The Cambridge handbook of computational psychology, pp. 549–564. Cited by: §3.
- Identifying supporting facts for multi-hop question answering with document graph networks. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), pp. 42–51. Cited by: §3.
- Unification-based reconstruction of explanations for science questions. arXiv preprint arXiv:2004.00061. Cited by: §1.3, Table 1, §2.1, §3, Explainable Natural Language Reasoning via Conceptual Unification.
- WorldTree v2: a corpus of science-domain structured explanations and inference patterns supporting multi-hop inference. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 5456–5473. Cited by: §2.2, Explainable Natural Language Reasoning via Conceptual Unification.
- Alignment over heterogeneous embeddings for question answering. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2681–2691. Cited by: Table 2, §2.2.
- Quick and (not so) dirty: unsupervised selection of justification sentences for multi-hop question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2578–2589. Cited by: Table 2, §3.
- HotpotQA: a dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380. Cited by: §3.
- KG^ 2: learning to reason science exam questions with contextual knowledge graph embeddings. arXiv preprint arXiv:1805.12393. Cited by: Table 2.