Learning Distributed Representations of Symbolic
Structure Using Binding and Unbinding Operations
Abstract
Widely used recurrent units, including Longshort Term Memory (LSTM) and Gated Recurrent Unit (GRU), perform well on natural language tasks, but their ability to learn structured representations is still questionable. Exploiting Tensor Product Representations (TPRs) — distributed representations of symbolic structure in which vectorembedded symbols are bound to vectorembedded structural positions — we propose the TPRU, a recurrent unit that, at each time step, explicitly executes structuralrole binding and unbinding operations to incorporate structural information into learning. Experiments are conducted on both the Logical Entailment task and the Multigenre Natural Language Inference (MNLI) task, and our TPRderived recurrent unit provides strong performance with significantly fewer parameters than LSTM and GRU baselines. Furthermore, our learnt TPRU trained on MNLI demonstrates solid generalisation ability on downstream tasks.
Learning Distributed Representations of Symbolic
Structure Using Binding and Unbinding Operations
Shuai Tang Paul Smolensky Virginia R. de Sa Department of Cognitive Science, UC San Diego Microsoft Research AI, Redmond Department of Cognitive Science, Johns Hopkins University
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Recent advances in deep learning benefit largely from neural networks’ ability to learn distributed representations of inputs from various domains; even samples from different modalities can be easily compared in a common representation space. In contrast to localist (1hot) representations that are not able to directly represent the possible componential structure within data [18], distributed representations are potentially capable of inducing implicit structure in the data, or explicit structure that is not presented along with the data. When it comes to statistical inference, distributed representations show considerable power, attesting to strong ability to encode world knowledge and to efficiently use representation space. However, the interpretability of learnt distributed representations is limited; it is typically quite unclear what specifically has been encoded in the representations.
Symbolic computing can take advantage of the presented structure of data and denote each substructure as a symbol; throughout computation, the representations derived by symbolic computing maintain the structure of data explicitly, and each substructure can be retrieved by simple straightforward computation [8]. In terms of inducing implicit structure from data, symbolic computing aims to decompose each sample to an ensemble of unique symbols that carry potential substructure of the data. With enough symbols, the underlying structure of data can be encoded thoroughly. The explicit usage of symbols in symbolic computing systems improves the capability of induced representations, but also, it brings in issues including inefficient memory usage and computational expense.
Tensor product representation (TPR) [36] is an instantiation of general neuralsymbolic computing in which symbol structures are given a fillerrole decomposition: the structure is captured by a set of roles (e.g., leftchildofroot), each of which is bound to a filler (e.g., a symbol). A TPR embedding of a symbol structure derives from vector embeddings of the roles and their fillers via the outer or tensor product: , where and respectively denote matrices having the role or filler vectors as columns. Each is the embedding of a rolefiller binding; is the binding operation. The unbinding operation returns the filler of a particular role in ; it is performed by the inner product: , where is the dual of , satisfying . Letting be the matrix with columns , we have .^{1}^{1}1 if ; otherwise . is the identity matrix.
Let . For present purposes it turns out that it suffices to consider filler vectors , in which case ; we henceforth denote as , a binding complex. Let be the column vector comprised of the . Now the binding and unbinding operations, simultaneously over all roles, become
(1) 
With binding and unbinding operations and sufficient role vectors, the binding complex is able to represent data from a wide range of domains, with the structure of the data preserved.
We aim to incorporate symbolic computing into learning distributed representations of data when explicit structure is not presented to neural networks. Specifically, we propose a recurrent unit that executes binding and unbinding operations according to Eq. 1. The proposed recurrent unit leverages both the advantages of distributed representations and the ability to explore and maintain learnt structure from symbolic computing. Our contribution is threefold:
We propose a recurrent unit, named TPRU, which integrates symbolic computing with learning distributed representations, and has significantly fewer parameters than widely used Longshort Term Memory (LSTM) [19] and Gated Recurrent Unit (GRU) [11].
We present experimental results with the TPRU on both the Logical Entailment task [16] and the Multigenre Natural Language Inference (MNLI) dataset [41], both of which arguably require highquality structured representations for making good predictions. The proposed unit provides strong performance on both tasks.
The TPRU trained on MNLI with plain (attentionless) architecture demonstrates solid generalisation ability on downstream natural language tasks.
2 Related Work
Recent efforts on learning structured distributed representations can be roughly categorised into two types: enforcing a strong global geometrical constraint on the representation space, such as hyperbolic embedding [25, 43], and introducing inductive biases into the architecture of networks, including the Relational Memory Core [34] and neuralsymbolic computing methods [8]. The latter category divides into models that insert neural networks into discrete structures [7, 32, 37] and those that insert discrete structures into neural network representations [17]. Our work falls in this last category: learning structured representations by incorporating the inductive biases inherent in TPRs.
Some prior work has incorporated TPRs into RNN representations. Questionanswering on SQuAD [33] was addressed [26] with a gated recurrent unit in which the hidden state was a single TPR binding; the present work deploys complexes containing multiple bindings. TPRstyle unbinding was applied in caption generation [21], but the representations were deeplearned and not explicitly designed to be TPRs as in the present work. A contracted version of TPRs, Holographic Reduced Representations, was utilised to decompose input space and output space with fillerrole decomposition [6]. However, our work differs from prior work in that TPR fillerrole binding and unbinding operations are explicitly carried out in our proposed recurrent unit, and also, the two operations directly determine the calculation of the update on the hidden states.
The logical entailment task was introduced recently, and a model [16] was proposed that used given parse trees of the input propositions and was designed specifically for the task. Our model does not receive parsed input and must learn simultaneously to identify, encode, and use the structure necessary to compute logical entailment. NLI (or Recognising Textual Entailment) has assumed a central role in NLP [14]. Neural models have persistently made errors explicable as failure to encode propositional structure [9] and our work targets that particular capability. Our proposed recurrent unit contains essential operations in symbolic computing systems, which encourages the learnt distributed representations to encode more structurerelated information. We can thus expect that our TPRU will give strong performance on both tasks even with significantly reduced parameters.
3 Proposed Recurrent Unit: The TPRU
As shown in both LSTM [19] and GRU [11] design, a gating mechanism helps the hidden state at the current time step to directly copy information from the previous time step, and alleviate vanishing and exploding gradient issues. Our proposed recurrent unit keeps the gating mechanism and adopts the design of the input gate in GRU.
At each time step, the TPRU receives two input vectors, one of which is the binding complex from the previous time step and the other is the vector representation of the external input to the network at the current time step . The TPRU produces a binding complex . An input gate is computed to calculate a weighted sum of the information produced at current time step and the binding complex from the previous time step ,
(2)  
(3) 
where is the logistic sigmoid function, is the hyperbolic tangent function, is the Hadamard (elementwise) product, and and are matrices of learnable parameters. As we now explain, the calculation of is carried out by the unbinding and binding operations of TPR (Eq. 1).
3.1 Unbinding Operation
Consider a set of hypothesised unbinding vectors . At timestep , these can be used to unbind fillers from the previous binding complex using Eq. 1. We posit a matrix that transforms the current input into the binding space where it too can be unbound, yielding fillers :
(4) 
A strong sparsity constraint is enforced by applying a rectified linear unit (ReLU) to both and [42], and their interaction is calculated by taking the square of the sum of the two sparse vectors. The resulting vector is then normalised to form a distribution .
(5)  
(6) 
where and are two scalar parameters for stable learning.
3.2 Binding Operation
Given a hypothesised set of binding role vectors , we apply the binding operation in Eq. 1 to the fillers at time to get the candidate update for the binding complex ,
(7) 
The gating mechanism controls the weighted sum of the candidate vector and the previous binding complex to produce a binding complex at current time step, as given by Eqs. 2 and 3.
3.3 Unbinding and Binding Role Vectors
In the TPRU, there is a matrix of role vectors used for the binding operation and a matrix of unbinding vectors used for the unbinding operation. To control the number of parameters in our proposed unit, instead of directly learning the role and unbinding vectors, a fixed set of vectors is sampled from a standard normal distribution, and two linear transformations are learnt to transform to and .
Therefore, in total, our proposed TPRU has five learnable matrices, including , , and , . Compared to the six parameter matrices of the GRU and the eight of the LSTM, the total number of parameters in the TPRU is significantly fewer.
4 Tasks
Two entailment tasks, including an abstract logical entailment task and a relatively realistic natural language entailment task, along with other downstream natural language tasks, are considered to demonstrate that the TPRU is capable of inducing structured representation through learning. Each of the two entailment tasks provides pairs of samples, and for each pair, the model needs to tell whether the first (the premise) entails the second (the hypothesis).
As our goal is to learn structured vector representations, the proposed TPRU serves as an encoding function, which learns to process a proposition or sentence one token at a time, and then produces a vector representation. During learning, two vector representations are produced by the same recurrent unit given a pair of samples, then a simple feature engineering method (e.g. concatenation of the two representations) is applied to form an input vector to a subsequent classifier which makes a final prediction. In general, with a simple classifier, e.g. a linear classifier or a multilayer perceptron with a single hidden layer, the learning process forces the encoding function to produce high quality representations of samples, either propositions or sentences, and better vector representations lead to stronger performance.
4.1 Logical Entailment
In propositional logic, for a pair of propositions, and , the value of is independent of the identities of the shared variables between and , and is dependent only on the structure of the expression and the connectives in each subexpression, because of equivalence. For example, holds no matter how we replace variable or with any other variables or propositions. Thus, logical entailment naturally is a good testbed for evaluating a model’s ability to carry out abstract, highly structuresensitive reasoning [16].
Theoretically, it is possible to construct a truth table that contains as rows/worlds all possible combinations of values of variables in both propositions and ; the value of can be checked by going through every entry in each row. An example is given in the supplementary material. As the logical entailment task emphasises reasoning on connectives, it requires the learnt distributed vector representations to encode the structure of any given proposition to excel at the task.
The dataset used in our experiments has balanced positive and negative classes, and the task difficulty of the training set is comparable to that of the validation set.^{2}^{2}2https://github.com/deepmind/logicalentailmentdataset Five test sets are generated to evaluate the generalisation ability at different difficulty levels: some test sets have significantly more variables and operators than both the training and validation sets (see Table 1).
4.2 Multigenre Natural Language Inference
Natural language inference (NLI) tasks require inferring word meaning in context as well as the hierarchical relations among constituents in a given sentence (either premise or hypothesis), and then reasoning whether the premise sentence entails the hypothesis sentence or not. Compared to logical entailment, the inference and reasoning in NLI also rely on the identities of words in sentences in addition to their structure. More importantly, the ambiguity and polysemicity of language lead to the impossibility of creating a truth table that lists all cases. Therefore, NLI is an intrinsically hard task.
The Multigenre Natural Language Inference (MNLI) dataset [41] collected sentence pairs in ten genres; only five genres are available in the training set, while all ten genres are presented in the development set. There are three classes, Entailment, Neural and Contradiction. The performance of a model on the mismatched genres, which exist in the development but not the training set, tells us how well the structure encoded in distributed vector representations of sentences learnt from seen genres in training generalises to sentence pairs in unseen genres. As the nature of NLI tasks requires inferring both word meaning and structure of constituents in a given sentence, supervised training signals from labelled datasets enforce an encoding function to analyse meaning and structure at the same time during learning, which eventually forces distributed vector representations of sentences produced from the learnt encoding function to be structured. Thus, a suitable inductive bias that enhances the ability to learn structures of sentences will enhance success on the MNLI task.
4.3 Downstream Natural Language Tasks
Vector representations of sentences learnt from labelled NLI tasks demonstrate strong transferability and generalisation ability, which indicates that the learnt encoding function can be applied as a generalpurpose sentence encoder to other downstream natural language tasks [13]. As our proposed TPRU is also able to map any given sentence into a distributed vector representation, it is reasonable to evaluate the learnt vector representations on other natural language tasks, and the performance of our proposed recurrent unit will tell us the generalisation ability of the learnt representations.
SentEval [12] presents a collection of natural language tasks in various domains, including sentiment analysis (MR [28], SST [38], CR [20], SUBJ [27], MPQA [40]), paraphrase detection (MRPC [15]), semantic relatedness (SICK [24]), questiontype classification (TREC [23]) and semantic textual similarity (STS [1, 2, 3, 4, 5]). Except for STS tasks, in which the cosine similarity of a pair of sentence representations is compared with a humanannotated similarity score, each of the tasks requires learning a linear classifier on top of produced sentence representations to make predictions.

model valid test # params easy hard big massive exam Mean 75.7 81.0 184.4 3310.8 848,570.0 5.8 Plain (BiDAF) Architecture  dim 64 LSTM 71.7 (88.5) 71.8 (88.7) 64.1 (74.5) 64.2 (73.8) 53.7 (66.8) 68.3 (80.0) 65.5k (230.0k) GRU 75.1 (87.9) 77.1 (88.3) 63.7 (72.5) 63.8 (71.3) 54.4 (66.1) 73.7 (78.0) 49.1k (172.4k) Ours 8 66.8 (86.2) 67.2 (87.1) 59.3 (69.1) 60.9 (68.2) 51.9 (62.5) 67.0 (74.3) 40.1k (131.3k) 32 73.7 (88.4) 73.7 (88.4) 62.7 (71.1) 62.8 (70.1) 53.0 (64.9) 76.7 (77.0) 128 75.9 (88.5) 76.0 (88.6) 64.9 (71.5) 64.0 (69.8) 53.8 (64.1) 75.7 (80.0) 512 76.8 (88.6) 76.8 (89.2) 64.4 (72.6) 64.6 (71.2) 54.6 (64.4) 75.3 (80.0) Plain (BiDAF) Architecture  dim 128 LSTM 64.5 (88.6) 64.2 (89.3) 59.7 (74.7) 62.1 (73.5) 50.9 (67.4) 65.0 (78.3) 196.6k (917.5k) GRU 80.8 (86.2) 80.3 (85.7) 65.9 (69.1) 66.0 (69.1) 55.0 (63.1) 77.3 (72.7) 147.5k (688.1k) Ours 8 63.7 (87.1) 63.4 (87.3) 57.5 (69.4) 59.6 (68.1) 51.3 (62.7) 65.0 (76.0) 131.1k (524.3k) 32 71.5 (88.2) 71.7 (88.5) 62.6 (71.6) 62.4 (70.3) 52.0 (64.4) 78.3 (78.3) 128 72.8 (88.4) 73.1 (89.0) 63.8 (72.4) 62.8 (71.5) 52.6 (66.3) 71.3 (80.0) 512 79.6 (88.6) 79.6 (89.2) 66.1 (72.7) 65.9 (70.8) 55.2 (64.9) 80.3 (79.7)
model  MNLI  # params  
dev matched  dev mismatched  
Plain (BiDAF) Architecture  dim 512  
LSTM  72.0 (76.0)  73.2 (75.5)  10.5m (29.4m)  
GRU  72.1 (74.2)  72.8 (74.8)  7.9m (22.0m)  
Ours  16  72.4 (73.9)  73.5 (75.0)  5.8m (15.7m) 
64  73.0 (74.8)  73.5 (75.5)  
256  73.1 (75.9)  73.9 (76.8)  
1024  73.2 (76.2)  73.8 (76.6)  
Plain (BiDAF) Architecture  dim 1024  
LSTM  72.5 (75.5)  73.9 (76.6)  25.2m (83.9m)  
GRU  72.6 (74.8)  73.6 (75.9)  18.9m (62.9m)  
Ours  16  72.9 (73.9)  73.7 (74.8)  14.7m (46.1m) 
64  73.4 (75.2)  74.4 (76.0)  
256  73.7 (75.5)  74.6 (76.7)  
1024  74.2 (76.7)  74.7 (77.3) 
5 Training Details
Experiments are conducted in PyTorch [30] with the Adam optimiser [22] and gradient clipping [29]. Reported results are averaged from the results of three different random initialisations.
5.1 Plain Architecture
For the Logical Entailment task, we train our proposed TPRU as well as LSTM [19] and GRU [11] RNNs for 90 epochs. Only the output at the last time step is regarded as the representation of a given proposition, and two proposition representations are concatenated, as done in previous work [16], and fed into a multilayer perceptron which has only one hidden layer with ReLU activation function. The initial learning rate is and divided by every epochs. The best model is picked based on the performance on the validation set, and then evaluated on all five test sets with different difficulty levels. Symbolic Vocabulary Permutation [16] is applied as data augmentation during learning which systematically replaces variables with randomly sampled variables according to equivalence as only connectives matter on this task. Detailed results are presented Table 1.
For the MNLI task, our proposed TPRU as well as LSTM and GRU units are trained for 10 epochs. A global maxpooling over time is applied on top of binding complexes produced by each recurrent unit at all time steps to generate the vector representation for a given sentence. Given a pair of generated sentence representations and , a vector is constructed to represent the difference between two vectors , where is the Hadamard (elementwise) product and is the absolute difference, and the vector is fed into a multilayer perceptron with the same settings as given above. The feature engineering and the choice of classifier are suggested by prior work [35, 39]. The Stanford Natural Language Inference (SNLI) dataset [9] is added as additional training data as recommended [39], and ELMo [31] is applied for producing vector representations of words. The initial learning rate is 0.0001, and kept constant during learning. The best model is chosen according to the averaged classification accuracy on matched (five genres that exist in both training and dev set) and mismatched (five genres in dev set only) set. LSTM and GRU models use the same settings. The performance of each model is presented in Table 2.
For downstream natural language tasks, the parameters in the learnt recurrent unit — our proposed TPRU, LSTM or GRU — are fixed and used to extract vector representations of sentences for each task. Linear logistic regression or softmax regression is applied when additional learning is required to make predictions. Details of hyperparameter settings of the classifiers can be found in the SentEval package.^{3}^{3}3https://github.com/facebookresearch/SentEval Table 3 presents macroaveraged results, which are Binary (MR, CR, SUBJ, MPQA, and SST), STS (Su., including SICKR and SICKBenchmark), STS (Un., including STS1216), TREC, SICKE, and MRPC.
Model  Downstream Tasks in SentEval  
Binary  SST5  TREC  SICKE  STS (Su.)  STS (Un.)  MRPC  
Measure  Accuracy  Pearson’s  Acc./F1  
Plain Architecture  dim 512  
LSTM  87.0  47.5  89.7  84.4  81.8  62.5  77.8 / 83.8  
GRU  87.0  47.5  91.1  84.8  80.3  62.5  76.9 / 83.4  
Ours  16  86.8  47.0  89.5  84.8  80.0  60.7  76.3 / 82.8 
64  87.1  46.9  89.9  85.1  80.8  62.1  76.8 / 83.3  
256  87.2  47.2  90.1  85.2  81.3  62.6  77.4 / 84.1  
1024  87.4  48.1  90.5  85.4  82.4  62.8  77.1 / 83.9  
Plain Architecture  dim 1024  
LSTM  87.6  47.3  92.7  85.0  81.7  63.3  77.0 / 83.6  
GRU  87.5  48.9  92.6  85.8  81.2  62.8  77.6 / 84.0  
Ours  16  87.4  47.5  91.3  85.6  79.6  60.9  76.2 / 83.2 
64  87.8  47.8  92.0  85.6  80.7  62.3  77.5 / 83.8  
256  87.8  47.9  92.5  86.0  80.6  63.3  77.6 / 83.9  
1024  87.9  48.5  91.9  85.9  81.5  63.9  77.5 / 84.4  
5.2 BiDAF Architecture
Bidirectional Attention Flow (BiDAF) [35] has been adopted in various natural language tasks, including machine comprehension [35] and question answering [10], and provides strong performance on NLI tasks [39]. The BiDAF architecture can generally be applied to any task that requires modelling relations between pairs of sequences. As both the Logical Entailment and MNLI tasks require classification of whether sequence entails sequence , BiDAF is wellsuited here.
The BiDAF architecture contains a layer for encoding two input sequences, and another for encoding the concatenation of the output from the first layer and the context vectors determined by the bidirectional attention mechanism. In our experiments, the dimensions of both layers are set to be the same, and same type of recurrent unit is applied across both layers. The same settings are used for experiments on LSTM, GRU and our TPRU models. Specifically, for TPRU, the recurrent units in both layers have the same number of role vectors. Other learning details are as in the plain architecture. Tables 1 and 2 respectively present results on the Logical Entailment and the MNLI task, with BiDAF results in parentheses.
6 Discussion
As presented in Tables 1 and 2, our proposed TPRU provides solid performance. On the Logical Entailment task, the TPRU provides similar performance with the LSTM and GRU under both the plain and BiDAF architectures, but the TPRU has significantly fewer parameters. When it comes to larger dimensionality, our TPRU appears to be more stable than LSTM and GRU during learning as we observed that the LSTM with the plain architecture overfitted the training set terribly in all three trials and the GRU with the BiDAF architecture failed in two out of three trials.
On the MNLI task, our proposed TPRU consistently outperforms both the LSTM and GRU under all four combinations of different dimensions and architectures. Unexpectedly, all models, including LSTM, GRU and our TPRU, provide better results on dev mismatched set than on the matched one, and this is possibly because the dev mismatched set is slightly easier. In Table 3, the TPRU under the plain architecture generalises as well as the LSTM and GRU on 16 downstream tasks in SentEval.
6.1 Effect of Increasing the Number of Role Vectors
In TPR [36], the number of role vectors indicates the number of unique symbols that will be used in the final representations of the data. Since each symbol is capable of representing a specific substructure of the input data, increasing the number of role vectors eventually leads to more highly structured representations if there is no limit on the dimensionality of role vectors.
Experiments are conducted to show the effect of increasing the number of role vectors on the performance on both tasks. As shown in Tables 1 and 2, adding more role vectors into our proposed TPRU gradually improves the performance on the two entailment tasks. Interestingly, on the MNLI task, our proposed TPRU with only 16 role vectors achieves similar performance to that of the LSTM and GRU, which implies that the distributed representations learnt in both the LSTM and GRU are highly redundant and can be reduced to 16 or even fewer dimensions, which also shows that the LSTM and GRU are not able to extensively exploit the representation space. Meanwhile, the introduced symbolic computing executed by binding and unbinding operations in our proposed unit encourages the model to take advantage of distinct role vectors to learn useful structured representations.
Figure 1 presents the learning curves, including training loss and accuracy, of our proposed TPRU with different number of role vectors on the two entailment tasks. As shown in the graphs, incorporating more role vectors leads to not only better performance, but also faster convergence during training. The observation is consistent on both the Logical Entailment and MNLI tasks.
7 Conclusion
We proposed a recurrent unit (TPRU) that executes binding and unbinding operations in Tensor Product Representations. The explicit execution in our proposed recurrent unit helps it leverage advantages of both distributed representations and neuralsymbolic computing, which essentially allows it to learn structured representations. Compared to widely used recurrent units, including LSTM and GRU, our proposed TPRU has many fewer parameters.
The Logical Entailment and Multigenre Natural Language Inference tasks are selected for experiments as both tasks require highly structured representations to make good predictions. Plain and BiDAF architectures are applied on both tasks. Our proposed TPRU outperforms its comparison partners, the LSTM and GRU, on MNLI tasks with different dimensions and architectures, and it performs similarly to others on the Logical Entailment task. Analysis shows that adding more role vectors tends to provide stronger results and faster convergence during learning, which parallels the utility of symbols in symbolic computing systems.
We believe that our work pushes the existing research topic on interpreting RNNs into another direction by incorporating symbolic computing. Future work should focus on the interpretabilty of our proposed TPRU as the symbolic computing is explicitly conducted by binding and unbinding operations.
Acknowledgements
Many thanks to Microsoft Research AI, Redmond for supporting the research, and to Elizabeth Clark and YooJung Choi for helpful clarification of concepts.
References
 [1] E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. GonzalezAgirre, W. Guo, I. LopezGazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria, and J. Wiebe. Semeval2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In SemEval@NAACLHLT, 2015.
 [2] E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. GonzalezAgirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe. Semeval2014 task 10: Multilingual semantic textual similarity. In SemEval@COLING, 2014.
 [3] E. Agirre, C. Banea, D. M. Cer, M. T. Diab, A. GonzalezAgirre, R. Mihalcea, G. Rigau, and J. Wiebe. Semeval2016 task 1: Semantic textual similarity, monolingual and crosslingual evaluation. In SemEval@NAACLHLT, 2016.
 [4] E. Agirre, D. M. Cer, M. T. Diab, and A. GonzalezAgirre. Semeval2012 task 6: A pilot on semantic textual similarity. In SemEval@NAACLHLT, 2012.
 [5] E. Agirre, D. M. Cer, M. T. Diab, A. GonzalezAgirre, and W. Guo. *sem 2013 shared task: Semantic textual similarity. In *SEM@NAACLHLT, 2013.
 [6] Anonymous. Towards decomposed linguistic representations with holographic reduced representation. In Under review for ICLR2019, 2019.
 [7] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. SanchezGonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 [8] T. R. Besold, A. S. d’Avila Garcez, S. Bader, H. Bowman, P. M. Domingos, P. Hitzler, K.U. Kühnberger, L. C. Lamb, D. Lowd, P. M. V. Lima, L. de Penning, G. Pinkas, H. Poon, and G. Zaverucha. Neuralsymbolic learning and reasoning: A survey and interpretation. CoRR, abs/1711.03902, 2017.
 [9] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A large annotated corpus for learning natural language inference. In EMNLP, 2015.
 [10] D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading wikipedia to answer opendomain questions. In ACL, 2017.
 [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 [12] A. Conneau and D. Kiela. Senteval: An evaluation toolkit for universal sentence representations. In LREC, 2018.
 [13] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 2017.
 [14] I. Dagan, O. Glickman, and B. Magnini. The pascal recognising textual entailment challenge. In MLCW, 2005.
 [15] W. B. Dolan, C. Quirk, and C. Brockett. Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In COLING, 2004.
 [16] R. Evans, D. Saxton, D. Amos, P. Kohli, and E. Grefenstette. Can neural networks understand logical entailment? In ICLR, 2018.
 [17] W. Hamilton, P. Bajaj, M. Zitnik, D. Jurafsky, and J. Leskovec. Embedding logical queries on knowledge graphs. arXiv preprint arXiv:1806.01445, 2018.
 [18] G. E. Hinton, J. L. McClelland, and D. E. Rumelhart. Distributed representations. 1984.
 [19] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9:1735–1780, 1997.
 [20] M. Hu and B. Liu. Mining and summarizing customer reviews. In KDD, 2004.
 [21] Q. Huang, P. Smolensky, X. He, L. Deng, and D. O. Wu. Tensor product generation networks for deep NLP modeling. In NAACLHLT, 2018.
 [22] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [23] X. Li and D. Roth. Learning question classifiers. In COLING, 2002.
 [24] M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, and R. Zamparelli. A sick cure for the evaluation of compositional distributional semantic models. In LREC, 2014.
 [25] M. Nickel and D. Kiela. Poincaré embeddings for learning hierarchical representations. In NIPS, 2017.
 [26] H. Palangi, P. Smolensky, X. He, and L. Deng. Deep learning of grammaticallyinterpretable representations through questionanswering. CoRR, abs/1705.08432, 2017.
 [27] B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL, 2004.
 [28] B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In ACL, 2005.
 [29] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML, 2013.
 [30] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [31] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. S. Zettlemoyer. Deep contextualized word representations. In NAACLHLT, 2018.
 [32] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46(1–2):77–105, 1990.
 [33] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, 2016.
 [34] A. Santoro, R. Faulkner, D. Raposo, J. W. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. P. Lillicrap. Relational recurrent neural networks. CoRR, abs/1806.01822, 2018.
 [35] M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension. 2017.
 [36] P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif. Intell., 46:159–216, 1990.
 [37] R. Socher, C. D. Manning, and A. Y. Ng. Learning continuous phrase representations and syntactic parsing with recursive neural networks. In Proceedings of the NIPS2010 Deep Learning and Unsupervised Feature Learning Workshop, pages 1–9, 2010.
 [38] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013.
 [39] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multitask benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461, 2018.
 [40] J. Wiebe, T. Wilson, and C. Cardie. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39:165–210, 2005.
 [41] A. Williams, N. Nangia, and S. R. Bowman. A broadcoverage challenge corpus for sentence understanding through inference. CoRR, abs/1704.05426, 2017.
 [42] Z. Yang, J. J. Zhao, B. Dhingra, K. He, W. W. Cohen, R. Salakhutdinov, and Y. LeCun. Glomo: Unsupervisedly learned relational graphs as transferable representations. CoRR, abs/1806.05662, 2018.
 [43] Çaglar Gülçehre, M. Denil, M. Malinowski, A. Razavi, R. Pascanu, K. M. Hermann, P. W. Battaglia, V. Bapst, D. Raposo, A. Santoro, and N. de Freitas. Hyperbolic attention networks. CoRR, abs/1805.09786, 2018.
Appendix A A Lookup Table Approach to Logical Entailment
Given proposition and proposition , a truth table presents all possible combinations of values of and , and values of and for each combination of and . holds iff, as here, in every row/world, the value of is less than or equal to that of .
T  T  T (1)  T (1)  
T  F  F (0)  F (0)  
F  T  F (0)  T (1)  
F  F  F (0)  F (0) 