Compositional Vector Space Models for Knowledge Base Completion
Knowledge base (KB) completion adds new facts to a KB by making inferences from existing facts, for example by inferring with high likelihood nationality(X,Y) from bornIn(X,Y). Most previous methods infer simple one-hop relational synonyms like this, or use as evidence a multi-hop relational path treated as an atomic feature, like bornIn(X,Z) containedIn(Z,Y). This paper presents an approach that reasons about conjunctions of multi-hop relations non-atomically, composing the implications of a path using a recurrent neural network (RNN) that takes as inputs vector embeddings of the binary relation in the path. Not only does this allow us to generalize to paths unseen at training time, but also, with a single high-capacity RNN, to predict new relation types not seen when the compositional model was trained (zero-shot learning). We assemble a new dataset of over 52M relational triples, and show that our method improves over a traditional classifier by 11%, and a method leveraging pre-trained embeddings by 7%.
Constructing large knowledge bases (KBs) supports downstream reasoning about resolved entities and their relations, rather than the noisy textual evidence surrounding their natural language mentions. For this reason KBs have been of increasing interest in both industry and academia [\citenameBollacker et al.2008, \citenameSuchanek et al.2007, \citenameCarlson et al.2010]. Such KBs typically contain many millions of facts, most of them (entity1,relation,entity2) “triples” (also known as binary relations) such as (Barack Obama, presidentOf, USA) and (Brad Pitt, marriedTo, Angelina Jolie).
However, even the largest KBs are woefully incomplete [\citenameMin et al.2013], missing many important facts, and therefore damaging their usefulness in downstream tasks. Ironically, these missing facts can frequently be inferred from other facts already in the KB, thus representing a sort of inconsistency that can be repaired by the application of an automated process. The addition of new triples by leveraging existing triples is typically known as KB completion.
Early work on this problem focused on learning symbolic rules. For example, \newcitehorn_clauses learns Horn clauses predictive of new binary relations by exhausitively exploring relational paths of increasing length, and selecting those surpassing an accuracy threshold. (A “path” is a sequence of triples in which the second entity of each triple matches the first entity of the next triple.) \newcitepra introduced the Path Ranking Algorithm (PRA), which greatly improves efficiency and robustness by replacing exhaustive search with random walks, and using unique paths as features in a per-target-relation binary classifier. A typical predictive feature learned by PRA is that CountryOfHeadquarters(X, Y) is implied by IsBasedIn(X,A) and StateLocatedIn(A, B) and CountryLocatedIn(B, Y). Given IsBasedIn(Microsoft, Seattle), StateLocatedIn(Seattle, Washington) and CountryLocatedIn(Washington, USA), we can infer the fact CountryOfHeadquarters(Microsoft, USA) using the predictive feature. In later work, \newcitepra_second greatly increase available raw material for paths by augmenting KB-schema relations with relations defined by the text connecting mentions of entities in a large corpus (also known as OpenIE relations [\citenameBanko et al.2007]).
However, these symbolic methods can produce many millions of distinct paths, each of which is categorically distinct, treated by PRA as a distinct feature. (See Figure 1.) Even putting aside the OpenIE relations, this limits the applicability of these methods to modern KBs that have thousands of relation types, since the number of distinct paths increases rapidly with the number of relation types. If textually-defined OpenIE relations are included, the problem is obviously far more severe.
Better generalization can be gained by operating on embedded vector representations of relations, in which vector similarity can be interpreted as semantic similarity. For example, \newcitetranse learn low-dimensional vector representations of entities and KB relations, such that vector differences between two entities should be close to the vectors associated with their relations. This approach can find relation synonyms, and thus perform a kind of one-to-one, non-path-based relation prediction for KB completion. Similarly \newciterescal and \newcitesocherkb perform KB completion by learning embeddings of relations, but based on matrices or tensors. Universal schema [\citenameRiedel et al.2013] learns to perform relation prediction cast as matrix completion (likewise using vector embeddings), but predicts textually-defined OpenIE relations as well as KB relations, and embeds entity-pairs in addition to individual entities. Like all of the above, it also reasons about individual relations, not the evidence of a connected path of relations.
This paper proposes an approach combining the advantages of (a) reasoning about conjunctions of relations connected in a path, and (b) generalization through vector embeddings, and (c) reasoning non-atomically and compositionally about the elements of the path, for further generalization.
Our method uses recurrent neural networks (RNNs) [\citenameWerbos1990] to compose the semantics of relations in an arbitrary-length path. At each path-step it consumes both the vector embedding of the next relation, and the vector representing the path-so-far, then outputs a composed vector (representing the extended path-so-far), which will be the input to the next step. After consuming a path, the RNN should output a vector in the semantic neighborhood of the relation between the first and last entity of the path. For example, after consuming the relation vectors along the path Melinda Gates Bill Gates Microsoft Seattle, our method produces a vector very close to the relation livesIn.
Our compositional approach allow us at test time to make predictions from paths that were unseen during training, because of the generalization provided by vector neighborhoods, and because they are composed in non-atomic fashion. This allows our model to seamlessly perform inference on many millions of paths in the KB graph. In most of our experiments, we learn a separate RNN for predicting each relation type, but alternatively, by learning a single high-capacity composition function for all relation types, our method can perform zero-shot learning—predicting new relation types for which the composition function was never explicitly trained.
Related to our work, new versions of PRA [\citenameGardner et al.2013, \citenameGardner et al.2014] use pre-trained vector representations of relations to alleviate its feature explosion problem—but the core mechanism continues to be a classifier based on atomic-path features. In the 2013 work many paths are collapsed by clustering paths according to their relations’ embeddings, and substituting cluster ids for the original relation types. In the 2014 work unseen paths are mapped to nearby paths seen at training time, where nearness is measured using the embeddings. Neither is able to perform zero-shot learning since there must be a classifer for each predicted relation type. Furthermore their pre-trained vectors do not have the opportunity to be tuned to the KB completion task because the two sub-tasks are completely disentangled.
An additional contribution of our work is a new large-scale data set of over 52 million triples, and its preprocessing for purposes of path-based KB completion (can be downloaded from http://iesl.cs.umass.edu/downloads/inferencerules/release.tar.gz). The dataset is build from the combination of Freebase [\citenameBollacker et al.2008] and Google’s entity linking in ClueWeb [\citenameOrr et al.2013]. Rather than Gardner’s 1000 distinct paths per relation type, we have over 2 million. Rather than Gardner’s 200 entity pairs, we use over 10k. All experimental comparisons below are performed on this new data set.
On this challenging large-scale dataset our compositional method outperforms PRA [\citenameLao et al.2012], and Cluster PRA [\citenameGardner et al.2013] by 11% and 7% respectively. A further contribution of our work is a new, surprisingly strong baseline method using classifiers of path bigram features, which beats PRA and Cluster PRA, and statistically ties our compositional method. Our analysis shows that our method has substantially different strengths than the new baseline, and the combination of the two yields a 15% improvement over \newcitepra_recent. We also show that our zero-shot model is indeed capable of predicting new unseen relation types.
We give background on PRA which we use to obtain a set of paths connecting the entity pairs and the RNN model which we employ to model the composition function.
2.1 Path Ranking Algorithm
Since it is impractical to exhaustively obtain the set of all paths connecting an entity pair in the large KB graph, we use PRA [\citenameLao et al.2011] to obtain a set of paths connecting the entity pairs. Given a training set of entity pairs for a relation, PRA heuristically finds a set of paths by performing random walks from the source and target nodes keeping the most common paths. We use PRA to find millions of distinct paths per relation type. We do not use the random walk probabilities given by PRA since using it did not yield improvements in our experiments.
2.2 Recurrent Neural Networks
Recurrent neural network (RNN) [\citenameWerbos1990] is a neural network that constructs vector representation for sequences (of any length). For example, a RNN model can be used to construct vector representations for phrases or sentences (of any length) in natural language by applying a composition function [\citenameMikolov et al.2010, \citenameSutskever et al.2014, \citenameVinyals et al.2014]. The vector representation of a phrase (, ) consisting of words and is given by where is the vector representation of , is an element-wise non linearity function, represents the concatenation two vectors and along with a bias term, and is the composition matrix. This operation can be repeated to construct vector representations of longer phrases.
3 Recurrent Neural Networks for KB Completion
This paper proposes a RNN model for KB completion that reasons on the paths connecting an entity pair to predict missing relation types. The vector representations of the paths (of any length) in the KB graph are computed by applying the composition function recursively as shown in Figure 2. To compute the vector representations for the higher nodes in the tree, the composition function consumes the vector representation of the node’s two children nodes and outputs a new vector of the same dimension. Predictions about missing relation types are made by comparing the vector representation of the path with the vector representation of the relation using the sigmoid function.
We represent each binary relation using a -dimensional real valued vector. We model composition using recurrent neural networks [\citenameWerbos1990]. We learn a separate composition matrix for every relation that is predicted.
Let be the vector representation of relation and be the vector representation of path . denotes the relation vector if path is of length one. To predict relation CountryOfHeadquarters, the vector representation of the path IsBasedIn StateLocatedIn containing two relations IsBasedIn and StateLocatedIn is computed by (Figure 2),
where is the element-wise non-linearity function, is the composition matrix for CountryOfHeadquarters and represents the concatenation of two vectors along with a bias feature to get a new vector .
The vector representation of the path IsBasedIn StateLocatedIn CountryLocatedIn in Figure 2 is computed similarly by,
where is the vector representation of path IsBasedIn StateLocatedIn. While computing the vector representation of a path we always traverse left to right, composing the relation vector in the right with the accumulated path vector in the left111we did not get significant improvements when we tried more sophisticated ordering schemes for computing the path representations.. This makes our model a recurrent neural network [\citenameWerbos1990].
Finally, we make a prediction regarding CountryOfHeadquarters(Microsoft, USA) using the path IsBasedIn StateLocatedIn CountryLocatedIn by comparing the vector representation of the path () with the vector representation of the relation CountryOfHeadquarters ((CountryOfHeadquarters)) using the sigmoid function.
3.1 Model Training
We train the model with the existing facts in a KB using them as positive examples and negative examples are obtained by treating the unobserved instances as negative examples [\citenameMintz et al.2009, \citenameLao et al.2011, \citenameRiedel et al.2013, \citenameBordes et al.2013]. Unlike in previous work that use RNNs[\citenameSocher et al.2011, \citenameIyyer et al.2014, \citenameIrsoy and Cardie2014], a challenge with using them for our task is that among the set of paths connecting an entity pair, we do not observe which of the path(s) is predictive of a relation. We select the path that is closest to the relation type to be predicted in the vector space. This not only allows for faster training (compared to marginalization) but also gives improved performance. This technique has been successfully used in models other than RNNs previously [\citenameWeston et al.2013, \citenameNeelakantan et al.2014].
We assume that we are given a KB (for example, Freebase enriched with SVO triples) containing a set of entity pairs , set of relations and a set of observed facts where indicates a positive fact that entity pair is in relation . Let denote the set of paths connecting entity pair given by PRA for predicting relation .
In our task, we only observe the set of paths connecting an entity pair but we do not observe which of the path(s) is predictive of the fact. We treat this as a latent variable ( for the fact ) and we assign the path whose vector representation has maximum dot product with the vector representation of the relation to be predicted. For example, for the fact is given by,
During training, we assign using the current parameter estimates. We use the same procedure to assign for unobserved facts that are used as negative examples during training.
We train a separate RNN model for predicting each relation and the parameters of the model for predicting relation are . Given a training set consisting of positive () and negative () instances222we sub-sample a portion of the set of all unobserved instances. for relation , the parameters are trained to maximize the log likelihood of the training set with L-2 regularization.
where is a binary random variable which takes the value if the fact is true and otherwise, and the probability of a fact is given by,
and . The relation vectors and the composition matrix are initialized randomly. We train the network using backpropagation through structure [\citenameGoller and Küchler1996].
4 Zero-shot KB Completion
The KB completion task involves predicting facts on thousands of relations types and it is highly desirable that a method can infer facts about relation types without directly training for them. Given the vector representation of the relations, we show that our model described in the previous section is capable of predicting relational facts without explicitly training for the target (or test) relation types (zero-shot learning).
In zero-shot or zero-data learning [\citenameLarochelle et al.2008, \citenamePalatucci et al.2009], some labels or classes are not available during training the model and only a description of those classes are given at prediction time. We make two modifications to the model described in the previous section, (1) learn a general composition matrix, and (2) fix relation vectors with pre-trained vectors, so that we can predict relations that are unseen during training. This ability of the model to generalize to unseen relations is beyond the capabilities of all previous methods for KB inference [\citenameSchoenmackers et al.2010, \citenameLao et al.2011, \citenameGardner et al.2013, \citenameGardner et al.2014].
We learn a general composition matrix for all relations instead of learning a separate composition matrix for every relation to be predicted. So, for example, the vector representation of the path IsBasedIn StateLocatedIn containing two relations IsBasedIn and StateLocatedIn is computed by (Figure 2),
where is the general composition matrix.
We initialize the vector representations of the binary relations () using the representations learned in \newcitelimin and do not update them during training. The relation vectors are not updated because at prediction time we would be predicting relation types which are never seen during training and hence their vectors would never get updated. We learn only the general composition matrix in this model. We train a single model for a set of relation types by replacing the sigmoid function with a softmax function while computing probabilities and the parameters of the composition matrix are learned using the available training data containing instances of few relations. The other aspects of the model remain unchanged.
To predict facts whose relation types are unseen during training, we compute the vector representation of the path using the general composition matrix and compute the probability of the fact using the pre-trained relation vector. For example, using the vector representation of the path IsBasedIn StateLocatedIn CountryLocatedIn in Figure 2, we can predict any relation irrespective of whether they are seen at training by comparing it with the pre-trained relation vectors.
The hyperparameters of all the models were tuned on the same held-out development data. All the neural network models are trained for iterations using dimensional relation vectors, and we set the L2-regularizer and learning rate to and respectively. We halved the learning rate after every iterations and use mini-batches of size . The neural networks and the classifiers were optimized using AdaGrad [\citenameDuchi et al.2011].
|Relation types tested||46|
|Avg. training facts/relation||6638|
|Avg. positive test instances/relation||3492|
|Avg. negative test instances/relation||43,160|
We ran experiments on Freebase [\citenameBollacker et al.2008] enriched with information from ClueWeb. We use the publicly available entity links to Freebase in the ClueWeb dataset [\citenameOrr et al.2013]. Hence, we create nodes only for Freebase entities in our KB graph. We remove facts containing /type/object/type as they do not give useful predictive information for our task. We get triples from ClueWeb by considering sentences that contain two entities linked to Freebase. We extract the phrase between the two entities and treat them as the relation types. For phrases that are of length greater than four we keep only the first and last two words. This helps us to avoid the time consuming step of dependency parsing the sentence to get the relation type. These triples are similar to facts obtained by OpenIE [\citenameBanko et al.2007]. To reduce noise, we select relation types that occur at least times. We evaluate on relation types in Freebase that have the most number of instances. The methods are evaluated on a subset of facts in Freebase that were hidden during training. Table 1 shows important statistics of our dataset.
5.2 Predictive Paths
Table 2 shows predictive paths for relations learned by the RNN model. The high quality of unseen paths is indicative of the fact that the RNN model is able to generalize to paths that are never seen during training.
|Relation: /book/written_work/original_language/ (book “x” written in language “y”)|
|/book/written_work/previous_in_series /book/written_work/author /people/person/nationality /people/person/nationality|
|/book/written_work/author /people/ethnicity/people /people/ethnicity/languages_spoken|
|”in” - ”writer” /people/person/nationality /people/person/languages|
|/book/written_work/author addresses /people/person/nationality /people/person/languages|
|Relation: /people/person/place_of_birth/ (person “x” born in place “y”)|
|“was,born,in” /location/mailing_address/citytown /location/mailing_address/state_province_region|
|“born,in” /location/location/contains “near”|
|Relation: /geography/river/cities/ (river “x” flows through or borders “y”)|
|“meets,the” /transportation/bridge/body_of_water_spanned /location/location/contains “in”|
|/geography/lake/outflow /location/location/contains “near”|
|Relation: /people/family/members/ (person “y” part of family “x”)|
|/royalty/monarch/royal_line /people/person/children /royalty/monarch/royal_line|
|/royalty/royal_line/monarchs_from_this_line /people/person/parents /people/person/parents /people/person/parents|
|/royalty/monarch/royal_line “leader” “king” “was,married,to”|
|“of,the” “but,also,of” “married” “defended”|
Using our dataset, we compare the performance of the following methods:
PRA Classifier is the method in \newcitepra_second which trains a logistic regression classifier by creating a feature for every path type.
Cluster PRA Classifier is the method in \newcitepra_recent which replaces relation types from ClueWeb triples with their cluster membership in the KB graph before the path finding step. After this step, their method proceeds in exactly the same manner as \newcitepra_second training a logistic regression classifier by creating a feature for every path type. We use pre-trained relation vectors from \newcitelimin and use k-means clustering to cluster the relation types to clusters as done in \newcitepra_recent.
Composition-Add uses a simple element-wise addition followed by sigmoid non-linearity as the composition function similar to \newcitebishan.
RNN-random is the supervised RNN model described in section 3 with the relation vectors initialized randomly.
RNN is the supervised RNN model described in section 3 with the relation vectors initialized using the method in \newcitelimin.
PRA Classifier-b is our simple extension to the method in \newcitepra_second which additionally uses bigrams in the path as features. We add a special start and stop symbol to the path before computing the bigram features.
Cluster PRA Classifier-b is our simple extension to the method in \newcitepra_recent which additionally uses bigram features computed as previously described.
RNN + PRA Classifier combines the predictions of RNN and PRA Classifier. We combine the predictions by assigning the score of a fact as the sum of their rank in the two models after sorting them in ascending order.
RNN + PRA Classifier-b combines the predictions of RNN and PRA Classifier-b using the technique described previously.
Table 3 shows the results of our experiments. The method described in \newcitevector_pra is not included in the table since the publicly available implementation does not scale to our large dataset. First, we show that it is better to train the models using all the path types instead of using only the top path types as done in previous work [\citenameGardner et al.2013, \citenameGardner et al.2014]. We can see that the RNN model performs significantly better than the baseline methods of \newcitepra_second and \newcitepra_recent. The performance of the RNN model is not affected by initialization since using random vectors and pre-trained vectors results in similar performance.
A surprising result is the impressive performance of our simple extension to the classifier approach. After the addition of bigram features, the naive PRA method is as effective as the Cluster PRA method. The small difference in performance between RNN and both PRA Classifier-b and Cluster PRA Classifier-b is not statistically significant. We conjecture that our method has substantially different strengths than the new baseline. While the classifier with bigram features has an ability to accurately memorize important local structure, the RNN model generalizes better to unseen paths that are very different from the paths seen is training. Empirically, combining the predictions of RNN and PRA Classifier-b achieves a statistically significant gain over PRA Classifier-b.
|train with top 1000 paths||train with all paths|
|Cluster PRA Classifier||46.26||53.23|
|Cluster PRA Classifier-b||48.72||58.02|
|RNN + PRA Classifier||49.92||58.42|
|RNN + PRA Classifier-b||51.94||61.17|
Table 4 shows the results of the zero-shot model described in section 4 compared with the fully supervised RNN model (section 3) and a baseline that produces a random ordering of the test facts. We evaluate on randomly selected (out of ) relation types, hence for the fully supervised version we train RNNs, one for each relation type. For evaluating the zero-shot model, we randomly split the relations into two sets of equal size and train a zero-shot model on one set and test on the other set. So, in this case we have two RNNs making predictions on relation types that they have never seen during training. As expected, the fully supervised RNN outperforms the zero-shot model by a large margin but the zero-shot model without using any direct supervision clearly performs much better than a random baseline.
|train with top 1000 paths||train with all paths|
To investigate whether the performance of the RNNs were affected by multiple local optima issues, we combined the predictions of five different RNNs trained using all the paths. Apart from RNN and RNN-random, we trained three more RNNs with different random initialization and the performance of the three RNNs individually are , and . The performance of the ensemble is and their performance stopped improving after using three RNNs. So, this indicates that even though multiple local optima affects the performance, it is likely not the only issue since the performance of the ensemble is still less than the performance of RNN + PRA Classifier-b.
We suspect the RNN model does not capture some of the important local structure as well as the classifier using bigram features. To overcome this drawback, in future work, we plan to explore compositional models that have a longer memory [\citenameHochreiter and Schmidhuber1997, \citenameCho et al.2014, \citenameMikolov et al.2014]. We also plan to include vector representations for the entities and develop models that address the issue of polysemy in verb phrases [\citenameCheng et al.2014].
6 Related Work
KB Completion includes methods such as \newcitedirt, \newciteresolver and \newciteentailment that learn inference rules of length one. \newcitehorn_clauses learn general inference rules by considering the set of all paths in the KB and selecting paths that satisfy a certain precision threshold. Their method does not scale well to modern KBs and also depends on carefully tuned thresholds. \newcitepra train a simple logistic regression classifier with NELL KB paths as features to perform KB completion while \newcitepra_recent and \newcitevector_pra extend it by using pre-trained relation vectors to overcome feature sparsity. Recently, \newcitebishan learn inference rules using simple element-wise addition or multiplication as the composition function.
Compositional Vector Space Models have been developed to represent phrases and sentences in natural language as vectors [\citenameMitchell and Lapata2008, \citenameBaroni and Zamparelli2010, \citenameYessenalina and Cardie2011]. Neural networks have been successfully used to learn vector representations of phrases using the vector representations of the words in that phrase. Recurrent neural networks have been used for many tasks such as language modeling [\citenameMikolov et al.2010], machine translation [\citenameSutskever et al.2014] and parsing [\citenameVinyals et al.2014]. Recursive neural networks, a more general version of the recurrent neural networks have been used for many tasks like parsing [\citenameSocher et al.2011], sentiment classification [\citenameSocher et al.2012, \citenameSocher et al.2013c, \citenameIrsoy and Cardie2014], question answering [\citenameIyyer et al.2014] and natural language logical semantics [\citenameBowman et al.2014]. Our overall approach is similar to RNNs with attention [\citenameBahdanau et al.2014, \citenameGraves2013] since we select a path among the set of paths connecting the entity pair to make the final prediction.
Zero-shot or zero-data learning was introduced in \newcitezeroshot for character recognition and drug discovery. \newcitezero perform zero-shot learning for neural decoding while there has been plenty of work in this direction for image recognition [\citenameSocher et al.2013b, \citenameFrome et al.2013, \citenameNorouzi et al.2014].
We develop a compositional vector space model for knowledge base completion using recurrent neural networks. In our challenging large-scale dataset available at http://iesl.cs.umass.edu/downloads/inferencerules/release.tar.gz, our method outperforms two baseline methods and performs competitively with a modified stronger baseline. The best results are obtained by combining the predictions of our model with the predictions of the modified baseline which achieves a % improvement over \newcitepra_recent. We also show that our model has the ability to perform zero-shot inference.
We thank Matt Gardner for releasing the PRA code, and for answering numerous question about the code and data. We also thanks the Stanford NLP group for releasing the neural networks code. This work was supported in part by the Center for Intelligent Information Retrieval, in part by DARPA under agreement number FA8750-13-2-0020, in part by an award from Google, and in part by NSF grant #CNS-0958392. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
- [\citenameBahdanau et al.2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ArXiv.
- [\citenameBanko et al.2007] Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. 2007. Open information extraction from the web. In International Joint Conference on Artificial Intelligence.
- [\citenameBaroni and Zamparelli2010] Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Empirical Methods in Natural Language Processing.
- [\citenameBerant et al.2011] Jonathan Berant, Ido Dagan, and Jacob Goldberger. 2011. Global learning of typed entailment rules. In Association for Computational Linguistics.
- [\citenameBollacker et al.2008] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
- [\citenameBordes et al.2013] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems.
- [\citenameBowman et al.2014] Samuel R. Bowman, Christopher Potts, and Christopher D Manning. 2014. Recursive neural networks for learning logical semantics. In CoRR.
- [\citenameCarlson et al.2010] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka, and A. 2010. Toward an architecture for never-ending language learning. In In AAAI.
- [\citenameCheng et al.2014] Cheng, Jianpengâ Kartsaklis, and Edward Grefenstette. 2014. Investigating the role of prior disambiguation in deep-learning compositional models of meaning. In In Learning Semantics workshopâ NIPS.
- [\citenameCho et al.2014] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Workshop on Syntax, Semantics and Structure in Statistical Translation.
- [\citenameDuchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. In Journal of Machine Learning Research.
- [\citenameFrome et al.2013] Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In Neural Information Processing Systems.
- [\citenameGardner et al.2013] Matt Gardner, Partha Pratim Talukdar, Bryan Kisiel, and Tom M. Mitchell. 2013. Improving learning and inference in a large knowledge-base using latent syntactic cues. In Empirical Methods in Natural Language Processing.
- [\citenameGardner et al.2014] Matt Gardner, Partha Talukdar, Jayant Krishnamurthy, and Tom Mitchell. 2014. Incorporating vector space similarity in random walk inference over knowledge bases. In Empirical Methods in Natural Language Processing.
- [\citenameGoller and Küchler1996] Christoph Goller and Andreas Küchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In IEEE Transactions on Neural Networks.
- [\citenameGraves2013] Alex Graves. 2013. Generating sequences with recurrent neural networks. In ArXiv.
- [\citenameHochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. In Neural Computation.
- [\citenameIrsoy and Cardie2014] Ozan Irsoy and Claire Cardie. 2014. Deep recursive neural networks for compositionality in language. In Neural Information Processing Systems.
- [\citenameIyyer et al.2014] Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. A neural network for factoid question answering over paragraphs. In Empirical Methods in Natural Language Processing.
- [\citenameLao et al.2011] Ni Lao, Tom Mitchell, and William W. Cohen. 2011. Random walk inference and learning in a large scale knowledge base. In Conference on Empirical Methods in Natural Language Processing.
- [\citenameLao et al.2012] Ni Lao, Amarnag Subramanya, Fernando Pereira, and William W. Cohen. 2012. Reading the web with learned syntactic-semantic inference rules. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
- [\citenameLarochelle et al.2008] Hugo Larochelle, Dumitru Erhan, and Yoshua Bengio. 2008. Zero-data learning of new tasks. In National Conference on Artificial Intelligence.
- [\citenameLin and Pantel2001] Dekang Lin and Patrick Pantel. 2001. Dirt - discovery of inference rules from text. In International Conference on Knowledge Discovery and Data Mining.
- [\citenameMikolov et al.2010] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Annual Conference of the International Speech Communication Association.
- [\citenameMikolov et al.2014] Tomas Mikolov, Armand Joulin, Sumit Chopra, Michaël Mathieu, and Marc’Aurelio Ranzato. 2014. Learning longer memory in recurrent neural networks. In CoRR.
- [\citenameMin et al.2013] Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. 2013. Distant supervision for relation extraction with an incomplete knowledge base. In HLT-NAACL, pages 777–782.
- [\citenameMintz et al.2009] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Association for Computational Linguistics and International Joint Conference on Natural Language Processing.
- [\citenameMitchell and Lapata2008] Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Association for Computational Linguistics.
- [\citenameNeelakantan et al.2014] Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Empirical Methods in Natural Language Processing.
- [\citenameNickel et al.2011] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In International Conference on Machine Learning.
- [\citenameNorouzi et al.2014] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. 2014. Zero-shot learning by convex combination of semantic embeddings. In International Conference on Learning Representations.
- [\citenameOrr et al.2013] Dave Orr, Amarnag Subramanya, Evgeniy Gabrilovich, and Michael Ringgaard. 2013. 11 billion clues in 800 million documents: A web research corpus annotated with freebase concepts. http://googleresearch.blogspot.com/2013/07/11-billion-clues-in-800-million.html.
- [\citenamePalatucci et al.2009] Mark Palatucci, Dean Pomerleau, Geoffrey Hinton, and Tom Mitchell. 2009. Zero-shot learning with semantic output codes. In Neural Information Processing Systems.
- [\citenameRiedel et al.2013] Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M. Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In HLT-NAACL.
- [\citenameSchoenmackers et al.2010] Stefan Schoenmackers, Oren Etzioni, Daniel S. Weld, and Jesse Davis. 2010. Learning first-order horn clauses from web text. In Empirical Methods in Natural Language Processing.
- [\citenameSocher et al.2011] Richard Socher, Cliff Chiung-Yu Lin, Christopher D. Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 26th International Conference on Machine Learning (ICML).
- [\citenameSocher et al.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. 2012. Semantic compositionality through recursive matrix-vector spaces. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
- [\citenameSocher et al.2013a] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013a. Reasoning with neural tensor networks for knowledge base completion. In Advances in Neural Information Processing Systems.
- [\citenameSocher et al.2013b] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. 2013b. Zero-shot learning through cross-modal transfer. In Neural Information Processing Systems.
- [\citenameSocher et al.2013c] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013c. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing.
- [\citenameSuchanek et al.2007] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web.
- [\citenameSutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems.
- [\citenameVinyals et al.2014] Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. 2014. Grammar as a foreign language. In CoRR.
- [\citenameWerbos1990] Paul Werbos. 1990. Backpropagation through time: what it does and how to do it. In IEEE.
- [\citenameWeston et al.2013] Jason Weston, Ron Weiss, and Hector Yee. 2013. Nonlinear latent factorization by embedding multiple user interests. In ACM International Conference on Recommender Systems.
- [\citenameYang et al.2014] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. In CoRR.
- [\citenameYates and Etzioni2007] Alexander Yates and Oren Etzioni. 2007. Unsupervised resolution of objects and relations on the web. In North American Chapter of the Association for Computational Linguistics.
- [\citenameYessenalina and Cardie2011] Ainur Yessenalina and Claire Cardie. 2011. Compositional matrix-space models for sentiment analysis. In Empirical Methods in Natural Language Processing.