Path Ranking with Attention to Type Hierarchies
The objective of the knowledge base completion problem is to infer missing information from existing facts in a knowledge base. Prior work has demonstrated the effectiveness of path-ranking based methods, which solve the problem by discovering observable patterns in knowledge graphs, consisting of nodes representing entities and edges representing relations. However, these patterns either lack accuracy because they rely solely on relations or cannot easily generalize due to the direct use of specific entity information. We introduce Attentive Path Ranking, a novel path pattern representation that leverages type hierarchies of entities to both avoid ambiguity and maintain generalization. Then, we present an end-to-end trained attention-based RNN model to discover the new path patterns from data. Experiments conducted on benchmark knowledge base completion datasets WN18RR and FB15k-237 demonstrate that the proposed model outperforms existing methods on the fact prediction task by statistically significant margins of and , respectively. Furthermore, quantitative and qualitative analyses show that the path patterns balance between generalization and discrimination.
Knowledge bases (KBs), such as WordNet [Miller1995] and Freebase [Bollacker et al.2008], have been used to provide background knowledge for tasks such as recommendation [Wang et al.2018], and visual question answering [Aditya, Yang, and Baral2018]. Such KBs typically contain facts stored in the form of triples, such as . Combined, the dataset of triples is often represented as a graph, consisting of nodes representing entities and edges representing relations. Many KBs also contain type information for entities, which can be represented as type hierarchies by ordering each entity’s types based on levels of abstraction.
Despite containing millions of facts, existing KBs still have a large amount of missing information [Min et al.2013]. As a result, robustly reasoning about missing information is important not only for improving the quality of KBs, but also for providing more reliable information for applications relying on the contained data. The objective of the knowledge base completion problem is to infer missing information from existing facts in KBs. More specifically, fact prediction is the problem of predicting whether a missing triple is true.
Prior work on fact prediction has demonstrated the effectiveness of path-ranking based methods [Lao, Mitchell, and Cohen2011, Gardner and Mitchell2015, Neelakantan, Roth, and McCallum2015, Das et al.2016], which solve the knowledge base completion problem by discovering observable patterns in knowledge graphs. A foundational approach in this area is the Path Ranking Algorithm (PRA) [Lao, Mitchell, and Cohen2011], which extracts patterns based on sequences of relations. However, relying on relation-only patterns often leads to overgeneralization, as shown in the example in Figure 1. In this example, the top triple () and its associated information is assumed to be known, and the second () and third () triples must be inferred from available information. PRA is limited to using only relation patterns, and as such uses 111the negative sign indicates reversed direction for the relation, observed in the path of the existing triple, to make its predictions. As the example shows, this representation enables PRA to correctly generalize to similar scenarios, such as the third triple, but the lack of context leads to over-generalization and failure to discriminate the second triple.
To improve the discriminativeness of PRA, [Das et al.2016] extend the above approach to incorporate entity information in addition to relations within the learned paths. Paths are expanded to include entities directly, or by aggregating all entity types for each entity. As an example, the path pattern of the known triple in Figure 1 becomes . This new approach successfully discriminates the second triple, however, this comes at the cost of losing generalizability, resulting in misclassification of the third triple.
In our work, we introduce a novel path-ranking approach, Attentive Path Ranking (APR), that seeks to achieve discrimination and generalization simultaneously. Our work is motivated by the distributional informativeness hypothesis [Santus et al.2014], according to which the generality of a term can be inferred from the informativeness of its most typical linguistic contexts. Similar to [Das et al.2016], we utilize a path pattern consisting of both entity types and relations. However, for each entity we use an attention mechanism [Bahdanau, Cho, and Bengio2014] to select a single type representation for that entity at the appropriate level of abstraction, thereby achieving both generalization and discrimination. As shown in Figure 1, the APR path pattern of the known triple becomes , which successfully discriminates from the second triple, and correctly generalizes to the third.
Our work makes the following contributions:
We introduce an end-to-end trained attention-based RNN model for entity type selection. Our approach builds on prior work on attention mechanisms, while contributing a novel approach for learning the appropriate levels of abstraction of entities from data.
Based on the above, we introduce Attentive Path Ranking, a novel path pattern representation that leverages type hierarchies of entities to both avoid ambiguity and maintain generalization.
Applicable to the above, and other path-ranking based methods more broadly, we introduce a novel pooling method based on attention to jointly reason about contextually important information from multiple paths.
We quantitatively validate our approach against five prior methods on two benchmark datasets: WN18RR [Dettmers et al.2018] and FB15k-237 [Toutanova et al.2015]. Our results show statistically significant improvement of on WN18RR and on FB15k-237 over state-of-the-art path-ranking techniques. Additionally, we demonstrate that our attention method is more effective than fixed levels of type abstraction, and that attention-based pooling improves performance over other standard pooling techniques. Finally, we show that APR significantly outperforms knowledge graph embedding methods on this task.
In this section, we present a summary of prior work.
Knowledge Based Reasoning
Multiple approaches to knowledge based reasoning have been proposed. Knowledge graph embedding (KGE) methods map relations and entities to vector representations in continuous spaces [Nickel et al.2015]. Methods based on inductive logic programming discover general rules from examples [Quinlan and Cameron-Jones1993]. Statistical relational learning (SRL) methods combine logics and graphical models to probabilistically reason about entities and relations [Getoor and Taskar2007]. Reinforcement learning based models treat link prediction (predicting target entities given a source entity and a relation) as Markov decision processes [Xiong, Hoang, and Wang2017, Das et al.2018]. Path-ranking based models use supervised learning to discover generalizable path patterns from graphs [Lao, Mitchell, and Cohen2011, Gardner and Mitchell2015, Neelakantan, Roth, and McCallum2015, Das et al.2016].
In the context of knowledge base completion, we focus on path-ranking based models. The Path Ranking Algorithm (PRA) [Lao, Mitchell, and Cohen2011] is the first proposed use of patterns based on sequences of relations. By using patterns as features of entity pairs, the fact prediction problem is solved as a classification problem. However, the algorithm is computationally intensive because it uses random walks to discover patterns and calculate feature weights. Subgraph Feature Extraction (SFE) [Gardner and Mitchell2015] reduces the computational complexity by treating patterns as binary features, as feature weights provide no discernible benefit to the performance, and using more efficient bi-directional breadth-first search to exhaustively search for sequences of relations and additional patterns in graphs. To generalize semantically similar patterns, [Neelakantan, Roth, and McCallum2015] use recurrent neural networks (RNNs) to create vector representations of patterns, which are then used as multidimensional features for prediction. [Das et al.2016] improve the accuracy of the RNN method by making use of the additional information in entities and entity types. We also use entity types but we focus on using types at different levels of abstraction for different entities.
Representing Hierarchical Structures
Learning representations of hierarchical structures in natural data such as text and images has been shown to be effective for tasks such as hypernym classification and textual entailment. [Vendrov et al.2016] order text and images by mapping them to a non-negative space in which entities that are closer to the origin are more general than entities that are further away. [Athiwaratkun and Wilson2018] use density order embeddings where more specific entities have smaller, concentrated probabilistic distributions and are encapsulated in broader distributions of general entities. In this work, we do not explicitly learn the representation of hierarchical types. Instead, we leverage the fact that types in type hierarchies have different levels of abstraction to create path patterns that balance generalization and discrimination.
Attention was first introduced in [Bahdanau, Cho, and Bengio2014] for machine translation, where it was used to enable the encoder-decoder model to condition generation of translated words to different parts of the original sentence. Later, cross-modality attention was shown to be effective at image captioning [Xu et al.2015] and speech recognition [Chan et al.2016]. Our approach uses attention to focus on contextually important information from multiple paths, much like the above methods. More importantly, we use attention in a novel way to efficiently discover the correct levels of abstraction for entities from a large search space.
A KB is formally defined as a set of triples, also called relation instances, , where denotes the entity set, and denotes the relation set. We make the closed world assumption that all triples in the KB are true and any triple that is not in the KB is false. The knowledge graph then can be constructed from , where nodes are entities and edges are relations. A directed edge from to with label exists for each triple in . A path between and in is denoted by , where and . The length of a path is defined as the number of relations in the path, in this case. For all pairs of entities and in the graph , we can discover a set of paths up to a fixed length, .
Our objective is, given an incomplete KB and the path set extracted from the KB, to predict whether the missing triple is true, or equivalently whether the entity pair and can be linked by .
Attentive Path Ranking
In this section, we present our proposed Attentive Path Ranking model, which takes as input the set of paths between entities and , , and outputs , the probability that connects and . Our model, shown in Figure 2, consists of three components: a relation encoder, an entity type encoder, and an attention-based pooling method. Given a path in , the relation encoder encodes the sequence of relations (e.g., in Figure 2). The entity type encoder both selects a type for each entity with a certain level of abstraction, and encodes the selected types (e.g., ). We then combine the relation and entity type encodings to form the path pattern . The above process is repeated for each path in . Attention-based pooling is then used to combine all path patterns to predict the truth value of . In the following sections we present details of the three core model components.
The relation encoder uses a LSTM [Hochreiter and Schmidhuber1997] to sequentially encode vector embeddings of relations for all relations in the path . Here, the trainable vector embeddings help generalize semantically similar relations by representing them as similar vectors. The last state of the LSTM is used as the vector representation of the sequence of relations, denoted by . We use a LSTM instead of a simple RNN for its ability to model long-term dependencies, which aids in modeling longer paths.
Entity Type Encoder
The entity type encoder consists of attention modules applied to entity types and a second LSTM. Together, these models are responsible for selecting a type for each entity from its type hierarchy , where the lowest level represents the most specific type and the highest level represents the most abstract type in a hierarchy of height .
As shown by the distributional informativeness hypothesis [Santus et al.2014], selecting from more specific levels of the type hierarchy increases the discriminativeness of the path pattern while selecting from more abstract levels makes the path pattern easier to generalize. Choosing at the appropriate level helps create a path pattern that is both discriminative and generalizable, leading to greater prediction accuracy. However, the substantial number of combinations when considering possible types for entities in all path patterns makes exhaustively searching across all ’s impossible.
To select , we use the deterministic “soft” attention introduced in [Bahdanau, Cho, and Bengio2014] to create an approximated vector representation of from the set of type vectors , where can be obtained by learning vector embeddings of entity types. We name this approximated vector representation of as the type context vector . For each type vector of entity , a weight is computed by a feed-forward network conditioned on an evolving context . This weight can be interpreted as the probability that is the right level of abstraction or the relative importance for level to combine ’s together. Formally, can be calculated as:
We model the context of the current step as the previous hidden state of the LSTM, i.e., . We also use , the last state of the relation encoder, to compute the initial memory state and hidden state of the LSTM:
where and are two separate feed-forward networks. Illustrated in Figure 2, as the LSTM stores information from the relation encoder and previously approximated entity types, both the sequence of relations and the type context vector of the previous entity can affect the approximated type for .
With the type context vector computed for each entity , the LSTM sequentially encodes these vectors. The last hidden state of the LSTM is used as a vector representation of all selected entities types, denoted . We then concatenate and together to get the final representation of our proposed path pattern .
After representations of all paths in are obtained using the above models, we then reason over all of the resulting information, jointly, to make the prediction of whether is true.
Prior neural network models [Neelakantan, Roth, and McCallum2015] and [Das et al.2016] use a feed-forward network to condense the vector representation for each path to a single value . One of the pooling methods – Max, Average, Top-K, or LogSumExp – is then applied to all values, combining them and then passing the result through a sigmoid function to make the final prediction
Compressing vector representations of paths to single values, as described above, hinders the model’s ability to collectively reason about the rich contextual information in the vectors. As a result, in our approach we introduce the use of an attention mechanism for integrating information from all paths, similar to that used to compute the type context vector. We use a trainable relation vector to represent the relation we are trying to predict. Following similar steps as when computing from , here we compute a vector representation of all path patterns from conditioned on the relation vector :
Since represents all the paths carrying information from both relations and entity types with correct levels of abstraction, the probability that exists between and can be accurately predicted using a feed-forward network along with a sigmoid function :
The relation encoder, entity type encoder, and attention pooling can be trained end to end as one complete model. For predicting each relation , we train the model using true triples and false triples in the training set as positive and negative examples, denoted and respectively, requiring all triples having relation . Our training objective is to minimize the negative log-likelihood:
We use backpropagation to update the learnable model parameters, which are the relation embedding (dim=), entity type embedding (dim=), trainable relation vector (dim=), two LSTMs in the relation encoder and entity type encoder (dim=), and feedforward networks , , , , and . We used Adam [Kingma and Ba2014] for optimization with default parameters (learning rate=, =, =, =). We trained the models fully to 50 epochs. Then we used early stopping on mean average precision as regularization.
This section describes the datasets and baselines used for evaluating our proposed method on the fact prediction task.
We evaluated our method and baseline methods on two standard datasets for knowledge base completion: FB15k-237, a subset of the commonsense knowledge graph Freebase [Bollacker et al.2008], and WN18RR, a subset of the English lexical database WordNet [Miller1995]. The first section of Table 1 shows the statistics of these two datasets.
From all true triples in each KB , we built up a complete dataset for experiments , where indicates whether a triple is true or false. This dataset contains additional false triples that are sampled from using the method based on personalized page rank222\citeauthorsfe show that this negative sampling method has no statistically significant effect on algorithms’ performance compared to sampling with PRA, another established but less efficient method used in previous works [Lao, Mitchell, and Cohen2011, Xiong, Hoang, and Wang2017]. (released code from [Gardner and Mitchell2015]). As the number of negative examples has a significant effect on algorithms’ performance [Kadlec, Bajgar, and Kleindienst2017] and evaluation, we sampled 10 negative examples for each positive one. We split the dataset into 80% training and 20% testing. Because path-ranking based methods typically model each relation separately, the data was further divided based on relations.
|# Relation instances||134,720||254,290|
|# Relations tested||11||10|
|Avg. # train inst/relation||38,039||16,696|
|Avg. # testing inst/relation||11,888||5,219|
|Avg. path length||5.3||3.5|
|Max path length||6||4|
|Avg. # paths per instance||88.6||154.8|
|Max height of type hierarchy||14||7|
|Avg. height of type hierarchy||4.6||6.4|
|Data Types||Pooling Method||WN18RR||FB15k-237|
|Path-RNNC||Relation, Type, Entity||LogSumExp||51.08||52.17|
To extract paths between entities, we first constructed graphs from true triples in the datasets. We augmented the graphs with reverse relations following existing methods [Lao, Mitchell, and Cohen2011]. We extracted paths using bi-directional BFS. We set the maximum length of paths to 6 for WN18RR and 4 for more the densely connected FB15k-237. We randomly sampled 200 paths for each pair if there were more paths. Limiting the length of paths and sub-sampling paths are both due to computational concerns.
To create type hierarchies, we extracted inherited hypernyms available in WordNet [Miller1995] for WN18RR entities and used type data released in [Xie, Liu, and Sun2016] for FB15K-237 entities. We used all available types for WN18RR. We ordered Freebase types based on their frequency of occurrence because types for Freebase entities are not strictly hierarchical. We then followed [Das et al.2016] to select up to 7 most frequently occurring types for each entity. For WN18RR, we mapped types to their vector representations using a pre-trained Google News word2vec model. Because we did not find a suitable pre-trained embedding for types of FB15K-237 entities, we trained an embedding with the whole model end-to-end.
We compared the performance of APR to the following methods333The first two methods are tested with code released by \citeauthorsfe, and the other three with code by \citeauthorchains. 444The first two baselines implement path extraction themselves, other methods and our models use the paths we extracted.: PRA from [Lao, Mitchell, and Cohen2011], SFE from [Gardner and Mitchell2015], Path-RNNA from [Neelakantan, Roth, and McCallum2015], and the two models Path-RNNB and Path-RNNC from [Das et al.2016]. Path-RNNB and Path-RNNC are different in the data types they use, as shown in Table 2.
In this section, we report results comparing performance to the above prior methods, as well as independent validation of attention-based pooling. We additionally present insights into the APR model and its modeling of abstraction.
Comparison to Existing Methods
We compared the performance of APR, paired with various pooling methods, against the baselines; Table 2 summarizes the results. All APR variants outperformed the prior state of the art on both datasets. Directly comparing all methods that utilize LogSumExp pooling (APRC and all three Path-RNN models), our results show statistically significant555As determined by a paired t-test, with each relation treated as paired data. () improvement of APRC. This result indicates that adding types, with the right balance between being generalizable and discriminative, helps create path patterns that allow for more accurate prediction. Our best model APRD, which further leverages attention pooling to reason about the rich contextual information in paths, is able to improve state-of-the-art performance by on WN18RR and on FB15k-237 with ().
One surprising result is that Path-RNNB and Path-RNNC, even using entity and type information, still perform worse than Path-RNNA on WN18RR. We suspect that the extremely large number of entities and types in WN18RR, and the simple feature aggregation method used by these models, cause learning to not generalize even for highly adaptable neural network models. The use of abstraction helps our model generalize information from individual entities and types, and achieve more robust prediction.
We compared our proposed attention pooling to three existing pooling methods. As shown in Table 2, APRD with attention pooling performs the best on both datasets. The superior performance is likely due to the pooling method’s ability to collectively reason about the rich contextual information in paths. The other three methods lose information when they compress path representations to single values.
To gain insight about the behavior of attention pooling, we visualized the path weights ’s computed by the attention module. Figure 3 shows visualizations of four representative relations from the two datasets. Figure 2(a) and Figure 2(c) show that for some relations, attention pooling focuses on small numbers of highly predictive paths. In other cases, as in Figures 2(b) and 2(d), attention pooling incorporates data across a broad collection of paths. This ability to dynamically adjust attention based on context highlights the adaptability of attention pooling.
Correct Level of Abstraction
We further investigated whether our model learns the proposed path patterns that balance between discrimination and generalization. For comparison, we modified the best model APRD by fixing the selection of types to either the most specific level or the most abstract level. Table 3 shows the effect of different levels of abstraction on performance. Using a fixed level of abstraction leads to worse performance compared to using attention. This result not only confirms that balancing between generalization and discrimination makes prediction more accurate, but also that our proposed model achieves this balance by learning the correct levels of abstraction for entities.
|Query Relation: has_profession|
|Interpretation: Actors graduated from the same school are likely to share the same profession.|
|Query Relation: has_genre|
|Interpretation: Films focusing on the same subject are likely to share the same genre. However, films depicting the same profession can have different genres.|
|Query Relation: nominee_work|
|Interpretation: The best actors are likely to perform great in all movies. The same company may not always produce award winning works.|
We also examined the distribution of attention weight at different levels of type hierarchies. Figure 4 shows that the model leveraged all levels for representing various entities. More specifically, level 1, 2, and 3 are most commonly emphasized for entities in WN18RR; level 1,2, and 6 are most often used for entities in FB15k-237. The emphasis on lower levels is expected since using the most specific level is more beneficial than using the most abstract level with evidence in Table 3. However, the diversity of levels used by the model proves that not all entities should use the most specific level: different entities require different levels of abstraction.
Finally, we visualized examples of paths from prediction along with the top weighted types the model selected for entities. In Table 4, we are able to see examples that correspond strongly with human intuition. This qualitative result again verifies that the model learns a new class of rules (path patterns) that is more specific yet still generalizable.
Comparison with Knowledge Graph Embedding Methods
As a final point of comparison, we validated our best model APRD against two state-of-the-art KGE methods666We followed [Xiong, Hoang, and Wang2017] to evaluate KGE methods on the fact prediction task. Test triples with the same relation are ranked.: TuckER [Balazevic, Allen, and Hospedales2019] and ComplEx-N3 [Lacroix, Usunier, and Obozinski2018]. As shown in Table 5, our model performs significantly better on both datasets. In fact, all neural-based path ranking methods (APR and Path-RNN) outperform the KGE methods. One possible reason that path ranking methods outperform KGE is that KGE methods require both the source and target entities in the missing triple to be known, thus failing to perform well when test entities are not present in the training set. However, only of FB15k-237’s test data consists of new entities, and only in WN18RR. As a result, new entities do not alone account for the difference in performance. This comparison additionally affirms the improved state-of-the-art performance in fact prediction achieved by our method.
This work addresses the problem of knowledge base completion. We introduced Attentive Path Ranking, a novel class of generalizable path patterns leveraging type hierarchies of entities, and developed an attention-based RNN model to discover the new path patterns from data. Our approach results in statistically significant improvement over state-of-the-art path-ranking based methods and knowledge graph embedding methods on two benchmark datasets WN18RR and FB15k-237. Quantitative and qualitative analyses of the discovered path patterns provided insights into how APR achieves a balance between generalization and discrimination.
- [Aditya, Yang, and Baral2018] Aditya, S.; Yang, Y.; and Baral, C. 2018. Explicit reasoning over end-to-end neural architectures for visual question answering. In AAAI.
- [Athiwaratkun and Wilson2018] Athiwaratkun, B., and Wilson, A. G. 2018. Hierarchical density order embeddings. In ICLR.
- [Bahdanau, Cho, and Bengio2014] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. In ICLR.
- [Balazevic, Allen, and Hospedales2019] Balazevic, I.; Allen, C.; and Hospedales, T. M. 2019. Tucker: Tensor factorization for knowledge graph completion. In EMNLP.
- [Bollacker et al.2008] Bollacker, K.; Evans, C.; Paritosh, P.; Sturge, T.; and Taylor, J. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD.
- [Chan et al.2016] Chan, W.; Jaitly, N.; Le, Q.; and Vinyals, O. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In ICASSP.
- [Das et al.2016] Das, R.; Neelakantan, A.; Belanger, D.; and McCallum, A. 2016. Chains of reasoning over entities, relations, and text using recurrent neural networks. In EACL.
- [Das et al.2018] Das, R.; Dhuliawala, S.; Zaheer, M.; Vilnis, L.; Durugkar, I.; Krishnamurthy, A.; Smola, A. J.; and McCallum, A. 2018. Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning. In ICLR.
- [Dettmers et al.2018] Dettmers, T.; Minervini, P.; Stenetorp, P.; and Riedel, S. 2018. Convolutional 2d knowledge graph embeddings. In AAAI.
- [Gardner and Mitchell2015] Gardner, M., and Mitchell, T. 2015. Efficient and expressive knowledge base completion using subgraph feature extraction. In EMNLP.
- [Getoor and Taskar2007] Getoor, L., and Taskar, B. 2007. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning). The MIT Press.
- [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation.
- [Kadlec, Bajgar, and Kleindienst2017] Kadlec, R.; Bajgar, O.; and Kleindienst, J. 2017. Knowledge base completion: Baselines strike back. In Rep4NLP@ACL.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In ICLR.
- [Lacroix, Usunier, and Obozinski2018] Lacroix, T.; Usunier, N.; and Obozinski, G. 2018. Canonical tensor decomposition for knowledge base completion. In ICML.
- [Lao, Mitchell, and Cohen2011] Lao, N.; Mitchell, T.; and Cohen, W. W. 2011. Random walk inference and learning in a large scale knowledge base. In EMNLP. ACL.
- [Miller1995] Miller, G. A. 1995. Wordnet: a lexical database for english. Communications of the ACM 38(11):39–41.
- [Min et al.2013] Min, B.; Grishman, R.; Wan, L.; Wang, C.; and Gondek, D. 2013. Distant supervision for relation extraction with an incomplete knowledge base. In NAACL, 777–782.
- [Neelakantan, Roth, and McCallum2015] Neelakantan, A.; Roth, B.; and McCallum, A. 2015. Compositional vector space models for knowledge base inference. In 2015 AAAI Spring Symposium Series.
- [Nickel et al.2015] Nickel, M.; Murphy, K.; Tresp, V.; and Gabrilovich, E. 2015. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104:11–33.
- [Quinlan and Cameron-Jones1993] Quinlan, J. R., and Cameron-Jones, R. M. 1993. Foil: A midterm report. In European conference on machine learning. Springer.
- [Santus et al.2014] Santus, E.; Lenci, A.; Lu, Q.; and Schulte im Walde, S. 2014. Chasing hypernyms in vector spaces with entropy. In EACL. ACL.
- [Toutanova et al.2015] Toutanova, K.; Chen, D.; Pantel, P.; Poon, H.; Choudhury, P.; and Gamon, M. 2015. Representing text for joint embedding of text and knowledge bases. In EMNLP.
- [Vendrov et al.2016] Vendrov, I.; Kiros, R.; Fidler, S.; and Urtasun, R. 2016. Order-embeddings of images and language. In ICLR.
- [Wang et al.2018] Wang, H.; Zhang, F.; Xie, X.; and Guo, M. 2018. Dkn: Deep knowledge-aware network for news recommendation. In World Wide Web Conference, 1835–1844. International World Wide Web Conferences Steering Committee.
- [Xie, Liu, and Sun2016] Xie, R.; Liu, Z.; and Sun, M. 2016. Representation learning of knowledge graphs with hierarchical types. In IJCAI.
- [Xiong, Hoang, and Wang2017] Xiong, W.; Hoang, T.; and Wang, W. Y. 2017. Deeppath: A reinforcement learning method for knowledge graph reasoning. In EMNLP, 564–573. Association for Computational Linguistics.
- [Xu et al.2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2048–2057.