Learning Knowledge Graph Embeddings
with Type Regularizer
Learning relations based on evidence from knowledge repositories relies on processing the available relation instances. Knowledge repositories are not balanced in terms of relations or entities – there are relations with less than 10 but also thousands of instances, and entities involved in less than 10 but also thousands of relations. Many relations, however, have clear domain and range, which we hypothesize could help learn a better, more generalizing, model. We include such information in the RESCAL model in the form of a regularization factor added to the loss function that takes into account the types (categories) of the entities that appear as arguments to relations in the knowledge base. Tested on Freebase, a frequently used benchmarking dataset for link/path predicting tasks, we note increased performance compared to the baseline model in terms of mean reciprocal rank and hits@N, N = 1, 3, 10. Furthermore, we discover scenarios that significantly impact the effectiveness of the type regularizer.
|Source Type||Source||Path or Relation||Target||Target Type|
Knowledge – lexical, world and common-sense – is crucial for tasks such as automated text comprehension and summarization, question answering, natural language dialogue systems. To make such knowledge available for automatic processing, the most common approach is to provide it as a collection of relation triples – entities or concepts connected by a relation: e.g., (concept:city:London, relation:country_capital, concept:country:UK). Globally, such collections can be viewed as knowledge graphs (KGs), for example NELL (Carlson et al., 2010), Freebase (Bollacker et al., 2008) and YAGO (Suchanek et al., 2007). In such graphs, nodes (entities/concepts) may be connected by different types of relations. This results in a multi-graph, i.e. a graph with different types of links where a link type corresponds to a relation type.
KGs are known to be incomplete (Min et al., 2013), i.e., a significant number of relations between entities are missing. Embedding the knowledge graph in a continuous vector space has been successfully used to address this problem (Nickel et al., 2012; Bordes et al., 2013; Socher et al., 2013). Such models represent the components of the graph, i.e., the entities and relations, using real valued latent factors that encode the structure of the knowledge graph. For example the latent factor model should be able to recover Cologne from the latent representations of Moselle and river_flowsThrough_city. Examples include the RESCAL (Nickel et al., 2012) tensor factorization model, the TransE model (Bordes et al., 2013) and their variations (Lin et al., 2015; Nickel et al., 2016). We focus on the RESCAL model, one of the most flexible and widely used models. RESCAL is a bilinear model that represents triples as a pairwise interaction of source and target entity latent factors (embeddings) through a matrix that represents the latent factors of the connecting relation. The entity and relation representations induced can be used to predict additional relations – edges – between known entities. Table 1 lists a few examples of entity type information in Freebase.
Existing knowledge graphs are imbalanced – both relation and entity frequencies vary widely, as evident from the statistics on Freebase 15k shown in Figure 1. Since entity and relation embeddings are based on the connectivity structure of the graph, it is reasonable to ask what is the outcome of the knowledge graph embedding for entities and relations which are underrepresented in the graph, in particular, how good are they for the task of link prediction.
Approaches such as RESCAL take an extensional view of relations – they process the collection of instances without knowledge of higher level rules or information about these relations. We hypothesize that providing the higher level – intensional – view in the form of types or categories of relation arguments, can lead to improved results for the task of link prediction. This may be true particularly for knowledge graphs such as Freebase that have strongly typed relations, and also for low-frequency relations or for relations involving low-frequency entities.
In this article we present experimental results supporting the hypothesis that augmenting single-relation models with entity type information, in the form of a ‘Type’ regularizer, leads to improvements in predicting missing links. The results show that even though the bilinear model induces representations for all entities and relations together – so it implicitly uses the type information we provide as a separate relation – the type regularizer which explicitly includes such information for each relation leads to better results. Furthermore, we note the positive impact of including the type regularizer for relations involving low-frequency entities, whereas low-frequency relations are less affected by this added information. We also analyze the effects of training data size on the usefulness of the type regularizer, and note that its impact grows with the amount of training data.
2. Related Work
A variety of latent factor models (Nickel et al., 2012; Bordes et al., 2013; Socher et al., 2013; Riedel et al., 2013) have been developed to represent entities and relations in a knowledge graph, and have been used to address the knowledge base completion (KBC) problem. Most latent factor models are trained on either knowledge graph triples, or triples extracted from open domain knowledge extraction tools (Riedel et al., 2013). A notable exception is the RNN model proposed by (Neelakantan et al., 2015) that learns path embeddings for knowledge base completion. (Guu et al., 2015) propose a compositional objective function over latent factor models, which is trained on paths as well as triples. For models that are compositional, (Toutanova et al., 2016) shows that incorporating intermediate entity information, in the form of latent factors, improves KBC performance. The source and target types are not explicitly included.
(Chang et al., 2014) make use the of type information and produce a variation of Rescal they call Trescal – Typed Rescal. The type information is used to improve the efficiency of the model, by reducing the size of the entity matrix in the computation of the loss function to entities belonging to the domain and range of the relation. The entity type as such is only implicitly incorporated, as something shared by the entities singled out for computing the loss function.
(Das et al., 2016) builds on (Neelakantan et al., 2015), and uses an RNN to model paths which incorporate type information for the entities along the path. Entities are represented as a sum of their entity types, which are learned during training. Including this information leads to higher performance.
Compared with these previous approaches, we add the entity types explicitly in the model, and derive a representation for entities and their types concurrently. We analyze the impact of using such representation for link prediction with different amounts of training data, to understand under what conditions the type information has a positive impact.
In this section we describe the RESCAL model and show how the type regularizer was added to include the type information for each relation in the computation of the loss function.
Let be the set of entities and relations in the KG respectively. A knowledge graph is a set of triples where and relation connects to .
The knowledge base completion (KBC) task is the task of classifying whether the triple is a part of the knowledge graph. This can be described as or where the question mark represents the unknown correct target/source entity from a set of candidate entities.
3.2. RESCAL Model
The RESCAL model (Nickel
et al., 2012) weights the interaction of all pairwise latent factor between the source and target entity for predicting a relation. It represents every entity as a vector (), and every relation as a matrix . This model represents the triple as a score given by
This is equivalent to tensor factorization where each relation matrix is a slice of the tensor. These vectors and matrices are learned by constructing a loss function that contrasts the score of a correct triple to incorrect ones. Here we use the max-margin loss described in the following equation:
where there are N positive instances, positive and negative instances are scored as and , respectively. is the set of incorrect targets and is the sigmoid function.
3.3. The Type Regularizer
We introduce a regularizer term which incorporates type information of source and target entities. Let be the type for entity and the relation between and . Depending on the knowledge resource, could be (in an ontology, for example), (in a resource built based on Wikipedia), or other such relations that capture the entity type. A few examples of entity types can be seen in Table 1. Note that entity type information is not used during test time.
If is the source entity and the target entity for query , then we define the regularizer as in equation 3.3, where and are sets of (negatives) for , while are sets of correct categories for source and target respectively. is the max margin loss described in equation (1).
The complete objective function to be minimized is
where the hyper-parameter , , controls the impact of the regularizer terms and is the set of negative targets for query , where corresponds to query .
We carry out experiments on FB15K, a subset of the Freebase knowledge graph provided by (Bordes et al., 2013). This dataset is a standard benchmark dataset used for evaluating link prediction algorithms (Bordes et al., 2013; Nickel et al., 2016; Trouillon et al., 2017). The FB15K dataset consists of 1345 relations and 14,951 entities. The training, validation and test set consists of 483,142, 50,000 and 59,071 triples respectively. The Freebase relations do not include the category relation, thus there is no overlap between the category triples and FB15K triples.
We obtain Freebase category data from (Gardner and Mitchell, 2015), and then the entity type by mapping the Freebase entity identifier to the Freebase category. This results in 101,353 instances of the category relation which is used in the training stage. It is not used during test time.
We use the Adam (Kingma and Ba, 2014) SGD optimizer for training because it addresses the problem of decreasing learning rate in AdaGrad. We use median gradient clipping to prevent explosive gradients and we also ensure that entity embeddings have unit norm. We performed exhaustive grid search for the L2 regularizer as well as on the validation set and we tuned the training duration using early stopping. We use 100 dimensional entity vector in all experiments 111Code is available at https://github.com/bhushank/kge.
4.3. Evaluation Procedure
For evaluation we follow the procedure described in (Socher et al., 2013). For every test triple we predict either the source or the target, and negative instaces for training and testing are produced by corrupting positive ones: we replace (or ) in a triple with an (or ) that has the same type as (or ) but does not appear in a positive instance (or ). For meaningful comparison, the negative triples that occur in training or validation datasets as positive triples are filtered out. For faster evaluation, instead of using all negative triples, we produce 1000 by randomly sampling entities from the entire set. We report results in terms of hits at 1,3,10 (HITS@1,3,10) and mean reciprocal rank (MRR) metrics. Hits at is the proportion of correct answers (hits) in the first ranked predictions, while MRR is the mean of the reciprocal of the rank of the correct answers.
We use the bilinear (RESCAL) model as a baseline. As evidenced by the results in Table 2, adding the type regularizer improves performance. It may be tempting to think that the performance improvement is natural since we are providing additional information through the type regularizer. We test this in further experiments.
|Metrics||Bilinear||Bilinear + TR|
|% training data||Model||MRR||% Improvement|
|Bilinear + TR||0.3862||+12.59|
|Bilinear + TR||0.3409||-1.3|
|Bilinear + TR||0.3198||-3.67|
We test the impact of the type regularizer by analyzing its performance on different sizes of training data. We first generate multiple training datasets by randomly sampling 25%, 50% and 75% of the triples. As illustrated in Table 3, when using only 25% to 50% of the training data, the performance drops. The type regularizer uses category information, under certain circumstances () adding it is equivalent to adding approximately 100,000 new triples with category relation to the training set. Thus, simply augmenting the model with additional information does not always improve performance.
The reason behind the performance drop with less training data is not obvious, because adding external information should help the model learn better embeddings. We hypothesize that the drop in performance is because when fewer number of training instances are available, the type regularizer leads the system to learn relations that over-generalize. The model is biased towards learning categories very well for reducing training loss. This results in embeddings that are biased towards predicting relations at the level of categories and not individual relations resulting in performance drop for the relation prediction task.
We investigate this hypothesis by varying the value of that weighs the importance of the type regularizer (cf. equation 1). We plot the Mean Reciprocal Rank vs. the strength of the type regularizer for model trained on only 25% of the training data in Fig. 2. The higher the strength of the type regularizer, the higher the cost incurred for mis-predicting the category. As Fig. 2 shows, MRR falls sharply with increase in . This effect is not observed in the 100% training data scenario. This suggests that adding category information may lead to improved performance only when the added information does not severely bias the training data.
|Relation Name||Instances (train)||Instances (test)|
To investigate the impact of training data size on the type regularizer performance, we analyze in detail the performance of the system for relations with a different number of training instances. Table 4 lists four relations we used to look into this phenomenon.
Fig. 3 shows the performance in terms of MRR (using Type Regularizer) for link prediction on these four relations. The orange and blue lines denote relations () with 11,636 and 5952 training instances respectively, while the red and green curves denote relations () with 2407 and 1010 training instances respectively. The red and green curves (the relations with fewer instances) show a larger change in MRR compared to the orange and blue curves. This confirms our hypothesis that the Type Regularizer is more sensitive for relations with a smaller number of training instances, and indicates that the embeddings learned for relations with larger number of instances are less biased towards predicting categories.
We note that equation (3.3) has the same max margin structure as the loss function, equation (1). Therefore using this particular formula for the type regularizer is equivalent to adding the category relation as an additional slice of the tensor factorized by RESCAL, then the hyperparameter is 1. Experiments have shown though that fine tuning – and this fine-tuning the usage of type information – can lead to better results. More specifically it is equivalent to adding 101,353 unique instances of category relation.
We also performed overall relation and entity analysis based on their occurrence frequency. Looking at relations grouped by the order of magnitude (oom) of their occurrence frequency presented in Figure 4 we note that low frequency relations seem not to be affected by the type regularizer, and are modeled better using only the instances themselves. The reson for this is that very low frequency relations actually connect high frequency entities, e.g. relation /award/hall_of_fame/discipline. On the other hand, high frequency relations have overall lower results than other relations. The reason for this is that in numerous cases, one of the arguments of these relations is a low frequency entity. For example, the lives_in relation that connects a person with the city they live in, has as the ”City” argument an entity that does not appear in many other relations.
To further clarify the reasons for variation in performance of relations, we analyze the link prediction results based on the order of magnitude of entity frequency, presented in Figure 5. The results in this case are more in line with the expected outcome – links that involve lower frequency entities have lower prediction results. The type information generally has a positive impact throughout, except medium-range entities where it seems that type information leads to over generalization.
Using the type regularizer as an additional terms whose weight can be calibrated using the parameter makes it easier to adjust the influence of the type information based on node degrees and relation frequencies. Furthermore, by incorporating the type information in the loss function for every relation as opposed to having it as a separate relation in the knowledge graph allows the incorporation of the range and domain information for each relation, as opposed to modelling the entity type outside of a particular environment.
It is interesting to note that the best results for medium to high frequency entities and relations are obtained when using the full training data and the type regularizer. This indicates that the type regularizer can mitigate the overfitting tendency of RESCAL, and produce a more robust model.
We proposed a type regularizer that leverages entity type information for state-of-the-art latent factor models like RESCAL. Experiments on Freebase FB15K dataset suggest that adding the type regularizer improves performance on the knowledge base completion task. However adding category information may not improve results for all relations, particularly those with fewer positive instances where introducing category information may lead to embeddings that are biased towards capturing/predicting categories rather than fine grained instances. We plan to study the impact of the added type information for datasets where the relations are not as strongly typed as Freebase – for grammatical collocation information for example and inducing selectional preferences – and for more complex, path prediction, tasks.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD ’08). ACM, New York, NY, USA, 1247–1250. https://doi.org/10.1145/1376616.1376746
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2787–2795. http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf
- Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka, and Tom M. Mitchell. 2010. Toward an Architecture for Never-Ending Language Learning. In AAAI.
- Chang et al. (2014) Kai-Wei Chang, Wen-tau Yih, Bishan Yang, and Christopher Meek. 2014. Typed Tensor Decomposition of Knowledge Bases for Relation Extraction. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1568–1579. https://doi.org/10.3115/v1/D14-1165
- Das et al. (2016) Rajarshi Das, Arvind Neelakantan, David Belanger, and Andrew McCallum. 2016. Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural Networks. arXiv preprint arXiv:1607.01426 (2016).
- Gardner and Mitchell (2015) Matt Gardner and Tom Mitchell. 2015. Efficient and Expressive Knowledge Base Completion Using Subgraph Feature Extraction. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1488–1498. https://doi.org/10.18653/v1/D15-1173
- Guu et al. (2015) Kelvin Guu, John Miller, and Percy Liang. 2015. Traversing Knowledge Graphs in Vector Space. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 318–327. https://doi.org/10.18653/v1/D15-1038
- Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning Entity and Relation Embeddings for Knowledge Graph Completion. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15). AAAI Press, 2181–2187. http://dl.acm.org/citation.cfm?id=2886521.2886624
- Min et al. (2013) Bonan Min, Ralph Grishman, Li Wan, Chang Wang, and David Gondek. 2013. Distant Supervision for Relation Extraction with an Incomplete Knowledge Base. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 777–782. http://aclweb.org/anthology/N13-1095
- Neelakantan et al. (2015) Arvind Neelakantan, Benjamin Roth, and Andrew McCallum. 2015. Compositional Vector Space Models for Knowledge Base Completion. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 156–166. https://doi.org/10.3115/v1/P15-1016
- Nickel et al. (2016) Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. 2016. Holographic Embeddings of Knowledge Graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 1955–1961. http://dl.acm.org/citation.cfm?id=3016100.3016172
- Nickel et al. (2012) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2012. Factorizing YAGO: Scalable Machine Learning for Linked Data. In Proceedings of the 21st International Conference on World Wide Web (WWW ’12). ACM, New York, NY, USA, 271–280. https://doi.org/10.1145/2187836.2187874
- Riedel et al. (2013) Sebastian Riedel, Limin Yao, Andrew McCallum, and M. Benjamin Marlin. 2013. Relation Extraction with Matrix Factorization and Universal Schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 74–84. http://aclweb.org/anthology/N13-1008
- Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning With Neural Tensor Networks for Knowledge Base Completion. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 926–934. http://papers.nips.cc/paper/5028-reasoning-with-neural-tensor-networks-for-knowledge-base-completion.pdf
- Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web (WWW ’07). ACM, New York, NY, USA, 697–706. https://doi.org/10.1145/1242572.1242667
- Toutanova et al. (2016) Kristina Toutanova, Victoria Lin, Wen-tau Yih, Hoifung Poon, and Chris Quirk. 2016. Compositional Learning of Embeddings for Relation Paths in Knowledge Base and Text. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1434–1444. https://doi.org/10.18653/v1/P16-1136
- Trouillon et al. (2017) Théo Trouillon, Christopher R Dance, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2017. Knowledge Graph Completion via Complex Tensor Factorization. arXiv preprint arXiv:1702.06879 (2017).