Learning Invariants through Soft Unification
Abstract
Human reasoning involves recognising common underlying principles across many examples by utilising variables. The byproducts of such reasoning are invariants that capture patterns across examples such as “if someone went somewhere then they are there” without mentioning specific people or places. Humans learn what variables are and how to use them at a young age, and the question this paper addresses is whether machines can also learn and use variables solely from examples without requiring human preengineering. We propose Unification Networks that incorporate soft unification into neural networks to learn variables and by doing so lift examples into invariants that can then be used to solve a given task. We evaluate our approach on four datasets to demonstrate that learning invariants captures patterns in the data and can improve performance over baselines.
[gray]0.9
1 Introduction
Humans have the ability to process symbolic knowledge and maintain symbolic thought (Unger and Deacon, 1998). When reasoning, humans do not require combinatorial enumeration of examples but instead utilise invariant patterns with placeholders replacing specific entities. Symbolic cognitive models (Lewis, 1999) embrace this perspective with the human mind seen as an information processing system operating on formal symbols such as reading a stream of tokens in natural language. The language of thought hypothesis (Morton and Fodor, 1978) frames human thought as a structural construct with varying subcomponents such as “X went to Y”. By recognising what varies across examples, humans are capable of lifting examples into invariant principles that account for other instances. This symbolic thought with variables is learned at a young age through symbolic play (Piaget, 2001). For instance a child learns that a sword can be substituted with a stick (Frost et al., 2004) and engage in pretend play.
X:bernhard is a Y:frog 
Z:lily is a Y:frog 
Z:lily is A:green 
what colour is X:bernhard 
A:green 
Although variables are inherent in models of computation and symbolic formalisms, as in firstorder logic (Russell and Norvig, 2018), they are preengineered and used to solve specific tasks by means of unification or assignments that bound variables to given values. However, when learning from data only, being able to recognise when and which symbols should take on different values, i.e. symbols that can act as variables, is crucial for lifting examples into general principles that are invariant across multiple instances. Figure 1 shows the invariant learned by our approach: if someone is the same thing as someone else then they have the same colour. With this invariant, our approach can solve all of the training and test examples in task 16 of the bAbI dataset (Weston et al., 2015).
In this paper we address the question of whether a machine can learn and use the notion of a variable, i.e. a symbol that can take on different values. For instance, given an example of the form “bernhard is a frog” the machine would learn that the token “bernhard” could be someone else and the token “frog” could be something else. If we consider unification a selection of the most appropriate value for a variable given a choice of values, we can reframe it as a form of attention. Attention models (Bahdanau et al., 2014; Luong et al., 2015; Chaudhari et al., 2019) allow neural networks to focus, attend to certain parts of the input often for the purpose of selecting a relevant portion. Since attention mechanisms are also differentiable they are often jointly learned within a task. This perspective motivates our idea of a unification mechanism that utilises attention and is therefore fully differentiable which we refer to as soft unification.
Hence, we propose an endtoend differentiable neural network approach for learning and utilising the notion of a variable that in return can lift examples into invariants used by the network to perform reasoning tasks. Specifically, we (i) propose a novel architecture capable of learning and using variables by lifting a given example through soft unification, (ii) present the empirical results of our approach on four datasets and (iii) analyse the learned invariants that capture the underlying patterns present in the tasks. Our implementation using Chainer (Tokui et al., 2015) is publicly available at https://github.com/nuric/softuni with the accompanying data.
2 Soft Unification
Reasoning with variables involves identifying what variables are, the setting in which they are used as well as the process by which they are assigned values. When the varying components, i.e. variables, of an example are identified, the remaining structure can be lifted into an invariant which then accounts for multiple other instances.
Definition 1 (Variable).
Given a set of symbols , a variable X is defined as a pair X where is the default symbol of the variable and x is a discrete random variable of which the support is . The representation of a variable is equal to the expected value of the corresponding random variable x given the default symbol :
(1) 
where is a dimensional realvalued feature of a symbol .
For example, could be an embedding and would become a weighted sum of symbol embeddings as in conventional attention models. The default symbol of a variable is intended to capture the variable’s bound meaning following the idea of referants by Frege (1948). We denote variables using X, Y, A etc. such as X:bernhard where X is the name of the variable and bernhard the default symbol as shown in Figure 1.
Definition 2 (Invariant).
Given a structure (e.g. list, grid) over , an invariant is a pair where is the invariant example such as a tokenised story and is a function representing the degree to which the symbol is considered a variable. Thus, the final representation of a symbol included in , is:
(2) 
the linear interpolation between its representation and its variable bound value with itself as the default symbol .
We adhere to the term invariant and refrain from mentioning rules, unground rules, etc. used in logicbased formalisms, e.g. Muggleton and de Raedt (1994), since neither the invariant structure needs to be rulelike nor the variables carry logical semantics. This distinction is clarified in Section 6.
Definition 3 (Unification).
Given an invariant and an example , unification binds the variables in to symbols in . Defined as a function , unification binds variables by computing the probability mass functions, in equation 1, and returns the unified representation using equation 2. The probability mass function of a variable X: is:
(3) 
where is the unifying feature of a symbol and is applied element wise to symbols in . If is differentiable, it is referred to as soft unification.
We distinguish from to emphasise that the unifying properties of the symbols might be different from their representations. For example, could represent a specific person whereas the notion of someone.
Overall soft unification incorporates 3 learnable components: which denote the base features, variableness and unifying features of a symbol respectively. Given an upstream, potentially task specific, network , an invariant and an input example with a corresponding desired output , the following holds:
(4) 
where now predicts based on the unified representation produced by . In this work, we focus on , the invariants it produces together with the interaction of .
3 Unification Networks
Since soft unification is endtoend differentiable, it can be incorporated into existing taskspecific upstream architectures. We present 3 architectures that model using multilayer perceptrons (MLP), convolutional neural networks (CNN) and memory networks (Weston et al., 2014) to demonstrate the flexibility of our approach. In all cases, the dimensional representation of symbols are learnable embeddings with randomly initialised by and the onehot encoding of the symbol. The variableness of symbols is a learnable weight where and is the sigmoid function. We consider every symbol independently a variable irrespective of its surrounding context and leave further contextualised formulations as future work. The underlying intuition of this configuration is that a useful symbol for a correct prediction might need to take on other values for different inputs. This usefulness can be viewed as the inbound gradient to the corresponding parameter and acting as a gate. For further model details including the size of the embeddings, please refer to Appendix A.
Unification MLP (UMLP) (: MLP, : RNN) We combine soft unification into a multilayer perceptron to process fixed length inputs. In this case, the structure is a sequence of symbols with a fixed length , e.g. a sequence of digits 4234. Given an embedded input , the upstream MLP computes the output symbol based on the flattened representations where is the output of the last layer. However, to compute the unifying features , definition 3, uses a bidirectional GRU (Cho et al., 2014) running over such that where is the hidden state of the GRU at symbol and is a learnable parameter. This model emphasises the flexibility around the boundary of and that the unifying features can be computed in any differentiable manner.
Unification CNN (UCNN) (: CNN, : CNN) Given a grid of embedded symbols where is the width and the height, we use a convolutional neural network such that the final prediction is where this time is the result of global max pooling and are learnable parameters. We also model using a separate convolutional network with the same architecture as and set where are the convolutional layers. The grid is padded with 0s to obtain after each convolution such that every symbol has a unifying feature. This model conveys how soft unification can be adapted to the specifics of the domain for example by using a convolution in a spatially structured input.
Unification Memory Networks (UMN) (: MemNN, : RNN) Soft unification does not need to happen prior to in a fashion but can also be incorporated at any intermediate stage multiple times. To demonstrate this ability, we unify the symbols at different memory locations at each iteration of a Memory Network (Weston et al., 2014). Memory networks can handle a list of lists structure such as a tokenised story as shown in Figure 2. The memory network uses the final hidden state of a bidirectional GRU (outer squares in Figure 2) as the sentence representations to compute a context attention. At each iteration, we unify the words between the attended sentences using the same approach in UMLP with another bidirectional GRU (inner diamonds in Figure 2) for unifying features . Following equation 2, the new unified representation of the memory slot is computed and uses it to perform the next iteration. Concretely, produces an unification tensor where and is the number of sentences and words in the invariant respectively, and is the number of sentences in the example such that after the context attentions are applied over and , we obtain as the unified sentence at that iteration. Note that unlike in the UMLP case, the sentences can be of varying length. The prediction is then where is the hidden state of the invariant after iterations. This setup, however, requires pretraining such that the context attentions match the correct sentences.
A task might contain different questions such as “Where is X?” and “Why did X go to Y?”. To let the models differentiate between questions and potentially learn different invariants, we extend them with a repository of invariants and aggregate the predictions from each invariant. One simple approach is to sum the predictions of the invariants used in UMLP and UCNN. Another approach could be to use features from the invariants such as memory representations in the case of UMN. For UMN, we weigh the predictions using a bilinear attention based on the hidden states at the first iteration and such that . To initially form the repository of invariants, we use the bagofwords representation of the questions and find the most dissimilar ones based on their cosine similarity as a heuristic to obtain varied examples.
4 Datasets
We use 4 datasets consisting of context, query and an answer : fixed length sequences of symbols, shapes of symbols in a grid, story based natural language reasoning with the bAbI (Weston et al., 2015) dataset and logical reasoning represented as logic programs, examples shown in Table 1 with further samples in Appendix B. In each case we use an appropriate model: UMLP for fixed length sequences, UCNN for grid and UMN for iterative reasoning. We use synthetic datasets of which the data generating distributions are known to evaluate not only the quantitative performance but also the quality of the invariants learned by our approach.
Dataset  Context  Query  Answer  Training Size  

Sequence  8384  duplicate  8  1k, 50  
Grid 

corner  7  1k, 50  
bAbI 

Where is Mary?  kitchen  1k, 50  
Logic 

p(a).  True  1k, 10k, 50 
Fixed Length Sequences We generate sequences of length with 8 unique symbols represented as digits to predict (i) a constant, (ii) the head of the sequence, (iii) the tail and (iv) the duplicate symbol. We randomly generate 1000 triples and then only take the unique ones to ensure the test split contains unseen examples. The training is then performed over a 5fold crossvalidation.
Grid To spatially organise symbols, we generate a grid of size with 8 unique symbols organised into box of identical symbol, a vertical, diagonal or horizontal sequence of length 3, a cross or a plus shape and a triangle. In each task we predict (i) the identical symbol, (ii) the head of the sequence, (iii) the centre of the cross or plus and (iv) the corner of the triangle respectively. We follow the same procedure from sequences and randomly generate 1000 discarding duplicate triples.
bAbI The bAbI dataset has become a standard benchmark for evaluating memory based networks. It consists of 20 synthetically generated natural language reasoning tasks (refer to Weston et al. (2015) for task details). We take the 1k English set and use 0.1 of the training set as validation. Each token is lower cased and considered a unique symbol. Following previous works (Seo et al., 2016; Sukhbaatar et al., 2015), we take multiple word answers also to be a unique symbol in .
Logical Reasoning To demonstrate the flexibility of our approach and distinguish our notion of a variable from that used in logic based formalisms, we generate logical reasoning tasks in the form of logic programs using the procedure by Cingillioglu and Russo (2018). The tasks involve learning over 12 classes of logic programs exhibiting varying paradigms of logical reasoning including negation by failure (Clark, 1978). We generate 1k and 10k logic programs per task for training with 0.1 as validation and another 1k for testing. We set the arity of literals to 1 or 2 using one random character from the English alphabet for predicates and constants, e.g. and an upper case character for logical variables, e.g. .
5 Experiments
We probe three aspects of soft unification: the impact of unification on performance over unseen data, the effect of multiple invariants and data efficiency. To that end, we train UMLP and UCNN with and without unification, UMN with pretraining using 1 or 3 invariants over either the entire training set or only 50 examples. Every model is trained 3 times via backpropagation using Adam (Kingma and Ba, 2014) on an Intel Core i76700 CPU using the following objective function:
(5) 
where is the negative loglikelihood with sparsity regularisation over at to discourage the models from utilising spurious number of variables. For UMLP and UCNN, we set for training just the unified output and the converse for the nonunifying versions. To pretrain the UMN, we start with for 40 epochs then set to jointly train the unified output. For iterative tasks, the mean squared error between hidden states at each iteration and, in the strongly supervised cases, the negative loglikelihood for the context attentions are also added to the objective function. Further details such as batch size and total number of epochs are available in Appendix C.
Figure 3 portrays how soft unification generalises better to unseen examples in test sets  the same sequence or grid never appears in both the training and test sets as outlined in Section 4  over plain models. Despite having more trainable parameters than alone, the models with unification not only maintain higher accuracy in each iteration and solve the tasks in as few as 250 iterations with training examples but also improve accuracy by when trained with only per task. We believe soft unification architecturally biases the models towards learning structural patterns which in return achieves better results on recognising common patterns of symbols across examples. Results with multiple invariants are identical and the models seem to ignore the extra invariants due to the fact that the tasks can be solved with a single invariant and the regularisation applied on zeroing out unnecessary invariants; further results in Appendix D. The fluctuations in accuracy around iterations 750 to 1000 in UCNN are also caused by penalising which forces the model to relearn the task with less variables half way through training.
Training Size  1k  50  1k  

Supervision  Weak  Strong  Weak  Strong  
# Invs / Model  1  3  1  3  3  N2N  GN2N  EntNet  QRN  MemNN 
Mean  19.1  20.5  6.3  6.0  27.6  13.9  12.7  29.6  11.3  6.7 
#  8  10  4  4  17  11  10  15  5  4 
Following Tables 2 and 3, we observe a trend of better performance through strong supervision, more data per task and using only 1 invariant. We believe strong supervision aids with selecting the correct sentences to unify and in a weak setting the model attempts to unify arbitrary context sentences often failing to follow the iterative reasoning chain. The increase in performance with more data and strong supervision is consistent with previous work reflecting how can be bounded by the efficacy of modelled as a memory network. As a result, only in the supervised case do we observe a minor improvement over MemNN by 0.7 in Table 2 and no improvement in the weak case over DMN or IMA in Table 3 with failing 17/20 and 12/12 tasks when trained using only 50 examples. The increase in error rate with 3 invariants, we speculate, stems from having more parameters and more pathways in the model rendering training more difficult.
Training Size  1k  10k  50  20k  
Supervision  Weak  Strong  Weak  Strong  Weak  
Arity  1  2  1  2  1  2  1  2  2  2  2  2 
# Invs / Model  1  3  1  3  1  3  1  1  3  3  DMN  IMA 
Mean  36.4  39.3  14.3  28.9  21.5  31.8  2.4  12.2  16.0  47.1  21.2  9.1 
#  9  11  7  11  7  10  1  5  9  12  11  5 
6 Analysis
After training, we can extract the learned invariants by applying a threshold on indicating whether a symbol is used as a variable or not. We set for all datasets except for bAbI, we use . The magnitude of this threshold seems to depend on the amount of regularisation , equation 5, and the number of training steps along with batch size all controlling how much is pushed towards 0. Sample invariants shown in Figure 4 describe the common patterns present in the tasks with parts that contribute towards the final answer becoming variables. Extra symbols such as is or travelled do not emerge as variables, as shown in Figure 3(a); we attribute this behaviour to the fact that changing the token travelled to went does not influence the prediction but changing the action, the value of Z:left to ‘picked’ does. However, based on random initialisation, our approach can convert a random symbol into a variable and let compensate for the unifications it produces. For example, the invariant “X:8 5 2 2” could predict the tail of another example by unifying the head with the tail using , equation 3, of those symbols. Pretraining as done in UMN seems to produce more robust and consistent invariants compared to immediately training since, we speculate, by equation 4 might encourage .





Interpretability versus Ability A desired property of interpretable models is transparency (Lipton, 2018). A novel outcome of the learned invariants in our approach is that they provide an approximation of the underlying general principle present in the data such as the structure of multihop reasoning shown in Figure 3(e). However, certain aspects such as temporal reasoning are still hidden inside . In Figure 3(b), although we observe Z:morning as a variable, the overall learned invariant captures nothing about how changing the value of Z:morning alters the behaviour of . The model might look before or after a certain time point X:bill went somewhere depending what Z:morning binds to. Without the regularising term on , we initially noticed the models using, one might call extra, symbols as variables and binding them to the same value occasionally producing unifications such as “bathroom bathroom to the bathroom” and still predicting, perhaps unsurprisingly, the correct answer as bathroom. Hence, regularising with the correct amount in equation 5 seems critical in extracting not just any invariant but one that represents the common structure.
Soft unification from equation 3 reveals three main patterns: onetoone, onetomany or manytoone bindings as shown in Figure 5. Figures 4(a) and 4(d) capture what one might expect unification to look where variables unify with their corresponding counterparts. However, occasionally the model can optimise to use less variables and squeeze the required information into a single variable, for example by binding Y:bathroom to john and kitchen as in Figure 4(b). We believe this occurs due to the sparsity constraint on encouraging the model to be as conservative as possible. In a similar fashion, the unification can bind a single variable Y:o to both other constants as in Figure 4(e). Finally, if there are more variables than needed as in Figure 4(f), we observe a manytoone binding with Y:w and Z:e mapping to the same constant. This behaviour begs the question how does the model differentiate between and . We speculate the model uses the magnitude of and to encode the difference despite both variables unifying with the same constant.
7 Related Work
Learning an underlying general principle in the form of an invariant is often the means for arguing generalisation in neural networks. For example, Neural Turing Machines (Graves et al., 2014) are tested on previously unseen sequences to support the view that the model might have captured the underlying pattern or algorithm. In fact, Weston et al. (2014) claim “MemNNs can discover simple linguistic patterns based on verbal forms such as (X, dropped, Y), (X, took, Y) or (X, journeyed to, Y) and can successfully generalise the meaning of their instantiations.” However, this claim is based on the output of and unfortunately it is unknown whether the model has truly learned such a representation or indeed is utilising it. Our approach sheds light to this ambiguity and presents these linguistic patterns explicitly as invariants ensuring their utility through without solely analysing the output of on previously unseen symbols. Although we associate these invariants with our existing understanding of the task to mistakenly anthropomorphise the machine, for example by thinking it has learned X:mary as someone, it is important to acknowledge that these are just symbolic patterns. In these cases, our interpretations do not necessarily correspond to any understanding of the machine, relating to the Chinese room argument made by Searle (1980).
Learning invariants by lifting ground examples is related to least common generalisation (Reynolds, 1970) by which inductive inference is performed on facts (Shapiro, 1981) such as generalising went(mary,kitchen) and went(john,garden) to went(X,Y). Unlike in a predicate logic setting, our approach allows for soft alignment and therefore generalisation between varying length sequences. Existing neurosymbolic systems (Broda et al., 2002) focus on inducing rules that adhere to given logical semantics of what a variable and a rule are. For example, (Evans and Grefenstette, 2018) constructs a network by rigidly following the given semantics of firstorder logic. Similarly, Lifted Relational Neural Networks (Sourek et al., 2015) ground firstorder logic rules into a neural network while Neural Theorem Provers (Rocktäschel and Riedel, 2017) build neural networks using backwardchaining (Russell and Norvig, 2018) on a given background knowledge base with templates. This architectural approach for combining logical variables is also observed with TensorLog (Cohen, 2016) and Logic Tensor Networks (Serafini and d’Avila Garcez, 2016) while grounding logical rules can also be used as regularisation (Hu et al., 2016). However, the notion of a variable is preengineered rather than learned with a focus on presenting a practical approach to solving certain problems whereas our motivation stems from a cognitive perspective.
At first it may seem the learned invariants, Section 6, make the model more interpretable; however, this transparency is not of the model but of the data. The invariant captures patterns that potentially approximates the data generating distribution but we still do not know how the model uses them upstream. Thus, from the perspective of explainable artificial intelligence (XAI) (Adadi and Berrada, 2018), learning invariants or interpreting them do not constitute an explanation of the reasoning model even though “if someone goes somewhere then they are there” might look like one. Instead, it can be perceived as causal attribution (Miller, 2019) in which someone being somewhere is attributed to them going there. This perspective also relates to gradient based model explanation methods such as LayerWise Relevance Propagation (Bach et al., 2015) and GradCAM (Selvaraju et al., 2017; Chattopadhay et al., 2018). Consequently, a possible view on , equation 2, is a gradient based usefulness measure such that a symbol utilised upstream by to determine the answer becomes a variable similar to how a group of pixels in an image contribute more to its classification.
Finally, one can argue that our model maintains a form of counterfactual thinking (Roese, 1997) in which soft unification creates counterfactuals on the invariant example to alter the output of towards the desired answer, equation 4. The question where Mary would have been if Mary had gone to the garden instead of the kitchen is the process by which an invariant is learned through multiple examples during training. This view relates to methods of causal inference (Pearl, 2019; Holland, 1986) in which counterfactuals are vital as demonstrated in structured models by Pearl (1999).
8 Conclusion
We presented a new approach for learning variables and lifting examples into invariants through the usage of soft unification. Evaluating on four datasets, we analysed how Unification Networks perform comparatively to existing similar architectures while having the benefit of lifting examples into invariants that capture underlying patterns present in the tasks. Since our approach is endtoend differentiable, we plan to apply this technique to multimodal tasks in order to yield multimodal invariants for example in visual question answering.
Acknowledgments
We would like to thank Murray Shanahan for his helpful comments, critical feedback and insights regarding this work.
References
 Peeking inside the blackbox: a survey on explainable artificial intelligence (XAI). IEEE Access 6, pp. 52138–52160. External Links: Document Cited by: §7.
 On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PLOS ONE 10 (7), pp. e0130140. External Links: Document Cited by: §7.
 Neural machine translation by jointly learning to align and translate. External Links: http://arxiv.org/abs/1409.0473v7 Cited by: §1.
 Neuralsymbolic learning systems. Springer London. External Links: ISBN 1852335122 Cited by: §7.
 GradCAM++: generalized gradientbased visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 839–847. External Links: Document Cited by: §7.
 An attentive survey of attention models. External Links: http://arxiv.org/abs/1904.02874v1 Cited by: §1.
 On the properties of neural machine translation: encoder–decoder approaches. In Proceedings of SSST8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, External Links: Document Cited by: §3.
 DeepLogic: towards endtoend differentiable logical reasoning. External Links: http://arxiv.org/abs/1805.07433v3 Cited by: §4, Table 3.
 Negation as failure. In Logic and Data Bases, pp. 293–322. External Links: Document Cited by: §4.
 TensorLog: a differentiable deductive database. arXiv:1605.06523. External Links: http://arxiv.org/abs/1605.06523v2 Cited by: §7.
 Learning explanatory rules from noisy data. 61, pp. 1–64. External Links: Document, http://arxiv.org/abs/1711.04574v2 Cited by: §7.
 Sense and reference. 57 (3), pp. 209. External Links: Document Cited by: §2.
 The developmental benefits of playgrounds. Association for Childhood Education International. External Links: ISBN 0871731649 Cited by: §1.
 Neural turing machines. External Links: http://arxiv.org/abs/1410.5401v2 Cited by: §7.
 Hybrid computing using a neural network with dynamic external memory. Nature 538 (7626), pp. 471–476. External Links: Document Cited by: Table 7.
 Tracking the world state with recurrent entity networks. ICLR. External Links: http://arxiv.org/abs/1612.03969v3 Cited by: Table 7.
 Statistics and causal inference. Journal of the American Statistical Association 81 (396), pp. 945–960. External Links: Document Cited by: §7.
 Harnessing deep neural networks with logic rules. ACL. External Links: Document, http://arxiv.org/abs/1603.06318v5 Cited by: §7.
 Adam: a method for stochastic optimization. ICLR. External Links: http://arxiv.org/abs/1412.6980v9 Cited by: §5.
 Ask me anything: dynamic memory networks for natural language processing. In ICML, pp. 1378–1387. External Links: http://arxiv.org/abs/1506.07285v5 Cited by: Table 7.
 Cognitive modeling, symbolic. pp. 525–527. Cited by: §1.
 The mythos of model interpretability. Communications of the ACM 61 (10), pp. 36–43. External Links: Document, http://arxiv.org/abs/1606.03490v3 Cited by: §6.
 Gated endtoend memory networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 1–10. External Links: Document, http://arxiv.org/abs/1610.04211v2 Cited by: Table 7, Table 2.
 Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, External Links: Document, http://arxiv.org/abs/1508.04025v5 Cited by: §1.
 Explanation in artificial intelligence: insights from the social sciences. Artificial Intelligence 267, pp. 1–38. External Links: Document, http://arxiv.org/abs/1706.07269v3 Cited by: §7.
 The language of thought.. Vol. 75, Philosophy Documentation Center. External Links: Document Cited by: §1.
 Inductive logic programming: theory and methods. The Journal of Logic Programming 1920, pp. 629–679. External Links: Document Cited by: §2.
 Probabilities of causation: three counterfactual interpretations and their identification. Synthese 121 (12), pp. 93–149. Cited by: §7.
 The seven tools of causal inference with reflections on machine learning. Communications of the ACM 62 (3), pp. 54–60. External Links: Document Cited by: §7.
 The psychology of intelligence. Routledge. External Links: ISBN 0415254019 Cited by: §1.
 Transformational systems and algebraic structure of atomic formulas. Machine intelligence 5, pp. 135–151. Cited by: §7.
 Endtoend differentiable proving. pp. 3791–3803. External Links: http://arxiv.org/abs/1705.11040v2 Cited by: §7.
 Counterfactual thinking.. Psychological Bulletin 121 (1), pp. 133–148. External Links: Document Cited by: §7.
 Artificial intelligence: a modern approach (3^{rd} edition). Pearson. External Links: ISBN 1292153962 Cited by: §1, §7.
 Minds, brains, and programs. 3 (3), pp. 417–424. External Links: Document Cited by: §7.
 GradCAM: visual explanations from deep networks via gradientbased localization. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 618–626. External Links: Document Cited by: §7.
 Queryreduction networks for question answering. ICLR. External Links: http://arxiv.org/abs/1606.04582v6 Cited by: Table 7, §4.
 Logic tensor networks: deep learning and logical reasoning from data and knowledge. arXiv:1606.04422. External Links: http://arxiv.org/abs/1606.04422v2 Cited by: §7.
 Inductive inference of theories from facts. Yale University, Department of Computer Science. Cited by: §7.
 Lifted relational neural networks. arXiv:1508.05128. External Links: http://arxiv.org/abs/1508.05128v2 Cited by: §7.
 Endtoend memory networks. In NIPS, pp. 2440–2448. External Links: 1503.08895 Cited by: Table 7, §4, Table 2.
 Chainer: a nextgeneration open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twentyninth Annual Conference on Neural Information Processing Systems (NIPS), External Links: Link Cited by: §1.
 The symbolic species: the coevolution of language and the brain. Vol. 82, Wiley. External Links: Document Cited by: §1.
 Towards aicomplete question answering: a set of prerequisite toy tasks. External Links: http://arxiv.org/abs/1502.05698v10 Cited by: Table 7, §1, §4, §4.
 Memory networks. External Links: http://arxiv.org/abs/1410.3916v11 Cited by: Table 7, §3, §3, Table 2, §7.
 Dynamic memory networks for visual and textual question answering. pp. 2397–2406. External Links: http://arxiv.org/abs/1603.01417v1 Cited by: Table 7.
Appendix A Model Details
a.1 Unification MLP & CNN
Unification MLP (UMLP) To model as a multilayer perceptron, we take symbol embeddings of size and flatten sequences of length into an input vector of size . The MLP consists of 2 hidden layers with nonlinearity of sizes and respectively. To process the query, we concatenate the onehot encoding of the task id to yielding a final input of size . For unification features , we use a bidirectional GRU with hidden size and an initial state of . The hidden state at each symbol is taken with a linear transformation to give where is the hidden state of the biGRU. The variable assignment is then computed as an attention over the according to equation 3.
Unification CNN (UCNN) We take symbols embeddings of size to obtain an input grid . Similar to UMLP, for each symbol we append the task id as a onehot vector to get an input of shape . Then consists of 2 convolutional layers with filters each, kernel size of 3 and stride 1. We use nonlinearity in between the layers. We pad the grid with 2 columns and 2 rows to a such that the output of the convolutions yield again a hidden output of the same shape. As the final hidden output , we take a global max pool to over to obtain . Unification function is modelled identical to without the max pooling such that where is the hidden output of the convolutional layers.
a.2 Unification Memory Networks
Unlike previous architectures, with UMN we interleave into . We use embedding sizes of and model with an iterative memory network. We take the final hidden state of a bidirectional GRU, with initial state , to represent the sentences of the context and query in a dimensional vector and the query . The initial state of the memory network is . At each iteration :
(6)  
(7) 
where is another dimensional bidirectional GRU and with the elementwise multiplication and the concatenation of vectors. Taking as the context attention, we obtain the next state of the memory network:
(8) 
and iterate many times in advance. The final prediction becomes . All weight matrices and bias vectors are independent but are tied across iterations.
Appendix B Generated Dataset Samples
Dataset  Task  Context  Query  Answer  

Sequence  i  1488  constant  2  
Sequence  ii  6157  head  6  
Sequence  iii  1837  tail  7  
Sequence  iv  3563  duplicate  3  
Grid  i 

box  2  
Grid  ii 

head  4  
Grid  iii 

centre  7  
Grid  iv 

corner  2 
Task  Sequences  Grid 

i  
ii  
iii  
iv 
Appendix C Training Details
c.1 Unification MLP & CNN
Both unification models are trained on a 5fold crossvalidation over the generated datasets for 2000 iterations with a batch size of 64. We don’t use any weight decay and save the training and test accuracies every 10 iterations, as presented in Figure 3.
c.2 Unification Memory Networks
We again use a batch size of 64 and pretrain for 40 epochs then together with for 260 epochs. We use epochs for UMN since the dataset sizes are fixed. To learn alongside , we combine error signals from the unification of the invariant and the example. Following equation 4, the objective function not only incorporates the negative loglikelihood of the answer but also the mean squared error between intermediate states and at each iteration as an auxiliary loss:
(9) 
We pretrain by setting for 40 epochs and then set . For strong supervision we also compute the negative loglikelihood for the context attention , described in Appendix A, at each iteration using the supporting facts of the tasks. We apply a dropout of 0.1 for all recurrent neural networks used and only for the bAbI dataset weight decay with 0.001 as the coefficient.
Appendix D Further Results
Supervision  Weak  Strong  

# Invs  1  3  1  3  3 
Training Size  1k  1k  1k  1k  50 
1  0.0  0.0  0.0  0.0  1.4 
2  65.6  63.1  0.3  0.7  30.0 
3  67.1  62.6  1.0  2.4  39.8 
4  0.0  0.0  0.0  0.0  37.0 
5  3.4  4.0  0.8  1.1  26.5 
6  0.2  0.6  0.0  0.0  18.4 
7  22.0  22.8  10.7  11.3  22.8 
8  10.3  8.5  7.4  7.6  24.7 
9  0.1  25.7  0.0  0.0  33.8 
10  0.1  2.0  0.0  0.3  32.6 
11  0.0  0.0  0.0  0.0  11.9 
12  0.0  0.1  0.0  0.0  21.3 
13  2.1  3.7  0.0  0.1  5.8 
14  19.7  13.5  0.5  0.1  54.8 
15  0.0  0.7  0.0  0.0  0.0 
16  55.2  56.2  0.0  0.0  39.7 
17  39.2  49.0  51.1  49.3  48.8 
18  4.4  8.0  0.6  0.5  10.4 
19  91.8  89.6  53.9  46.7  90.2 
20  0.0  0.0  0.0  0.0  2.7 
Mean  19.1  20.5  6.3  6.0  27.6 
Std  27.9  27.0  15.6  14.3  21.0 
#  8  10  4  4  17 
Support  Weak  Strong  

Size  1k  10k  1k  10k  
Model  N2N  GN2N  EntNet  QRN  UMN  DMN+  DNC  MemNN  UMN  DMN 
1  0.0  0.0  0.7  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
2  8.3  8.1  56.4  0.5  65.6  0.3  0.4  0.0  0.7  1.8 
3  40.3  38.8  69.7  1.2  67.1  1.1  1.8  0.0  2.4  4.8 
4  2.8  0.4  1.4  0.7  0.0  0.0  0.0  0.0  0.0  0.0 
5  13.1  1.0  4.6  1.2  3.4  0.5  0.8  2.0  1.1  0.7 
6  7.6  8.4  30.0  1.2  0.2  0.0  0.0  0.0  0.0  0.0 
7  17.3  17.8  22.3  9.4  22.0  2.4  0.6  15.0  11.3  3.1 
8  10.0  12.5  19.2  3.7  10.3  0.0  0.3  9.0  7.6  3.5 
9  13.2  10.7  31.5  0.0  0.1  0.0  0.2  0.0  0.0  0.0 
10  15.1  16.5  15.6  0.0  0.1  0.0  0.2  2.0  0.3  2.5 
11  0.9  0.0  8.0  0.0  0.0  0.0  0.0  0.0  0.0  0.1 
12  0.2  0.0  0.8  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
13  0.4  0.0  9.0  0.3  2.1  0.0  0.1  0.0  0.1  0.2 
14  1.7  1.2  62.9  3.8  19.7  0.2  0.4  1.0  0.1  0.0 
15  0.0  0.0  57.8  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
16  1.3  0.1  53.2  53.4  55.2  45.3  55.1  0.0  0.0  0.6 
17  51.0  41.7  46.4  51.8  39.2  4.2  12.0  35.0  49.3  40.6 
18  11.1  9.2  8.8  8.8  4.4  2.1  0.8  5.0  0.5  4.7 
19  82.8  88.5  90.4  90.7  91.8  0.0  3.9  64.0  46.7  65.5 
20  0.0  0.0  2.6  0.3  0.0  0.0  0.0  0.0  0.0  0.0 
Mean  13.9  12.7  29.6  11.3  19.1  2.8  3.8  6.7  6.0  6.4 
#  11  10  15  5  8  1  2  4  4  2 
Size  1k  10k  50  20k  
Support  Weak  Strong  Weak  Strong  Weak  
Arity  1  2  1  2  1  2  1  2  2  2  2  2 
# Invs / Model  1  3  1  3  1  3  1  1  3  3  DMN  IMA 
Facts  1.2  0.9  0.0  0.4  0.0  0.0  0.0  0.0  0.0  33.5  0.0  0.0 
Unification  0.0  10.3  0.0  10.8  0.0  0.0  0.0  0.0  0.0  41.3  13.0  10.0 
1 Step  50.3  49.8  4.4  20.0  1.2  27.8  0.1  1.3  5.7  50.2  26.0  2.0 
2 Steps  47.5  50.0  5.7  35.0  37.2  47.8  0.0  29.7  28.7  49.9  33.0  5.0 
3 Steps  47.6  49.2  10.4  38.7  39.6  45.6  0.0  26.0  26.1  48.3  23.0  6.0 
AND  31.3  37.4  10.7  16.4  29.8  29.0  0.2  0.4  1.2  50.0  20.0  5.0 
OR  25.2  38.1  21.0  35.0  20.5  30.2  4.4  20.6  17.4  47.6  13.0  3.0 
Transitivity  50.0  26.6  39.6  5.0  6.0  49.2  50.0  50.0  
1 Step NBF  46.4  38.7  3.8  28.8  1.1  21.6  0.1  1.1  8.0  47.6  21.0  2.0 
2 Steps NBF  48.5  48.9  7.7  39.6  30.4  48.2  0.1  33.4  28.7  50.3  15.0  4.0 
AND NBF  51.0  50.1  43.1  48.6  29.4  44.2  0.1  1.3  40.1  49.5  16.0  8.0 
OR NBF  51.4  48.4  50.8  47.3  47.6  47.8  21.3  27.6  30.5  47.3  25.0  14.0 
Mean  36.4  39.3  14.3  28.9  21.5  31.8  2.4  12.2  16.0  47.1  21.2  9.1 
Std  18.7  15.9  16.4  14.1  17.1  16.7  6.1  13.2  13.6  4.7  12.3  13.4 
#  9  11  7  11  7  10  1  5  9  12  11  5 
Xsandra went back to the Y:bathroom 
is X:sandra in the Y:bathroom 
yes 
X:m ( Y:e )  X:m ( Y:e ) 

X:a ( Y:w , Z:e )  X:a ( Y:w , Z:e ) 
X:m ( T )  X:m ( c ) 
X:x ( A ) not Z:q ( A )  X:x ( Y:z ) 

