kbgan: Adversarial Learning for Knowledge Graph Embeddings
Abstract
We introduce an adversarial learning framework, which we named kbgan, to improve the performances of a wide range of existing knowledge graph embedding models. Because knowledge graph datasets typically only contain positive facts, sampling useful negative training examples is a nontrivial task. Replacing the head or tail entity of a fact with a uniformly randomly selected entity is a conventional method for generating negative facts used by many previous works, but the majority of negative facts generated in this way can be easily discriminated from positive facts, and will contribute little towards the training. Inspired by generative adversarial networks (gans), we use one knowledge graph embedding model as a negative sample generator to assist the training of our desired model, which acts as the discriminator in gans. The objective of the generator is to generate difficult negative samples that can maximize their likeliness determined by the discriminator, while the discriminator minimizes its training loss. This framework is independent of the concrete form of generator and discriminator, and therefore can utilize a wide variety of knowledge graph embedding models as its building blocks. In experiments, we adversarially train two translationbased models, TransE and TransD, each with assistance from one of the two probabilitybased models, DistMult and ComplEx. We evaluate the performances of kbgan on the link prediction task, using three knowledge base completion datasets: FB15k237, WN18 and WN18RR. Experimental results show that adversarial training substantially improves the performances of target embedding models under various settings.
kbgan: Adversarial Learning for Knowledge Graph Embeddings
Liwei Cai Department of Electronic Engineering Tsinghua University Beijing 100084 China cai.lw123@gmail.com William Yang Wang Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 USA william@cs.ucsb.edu
1 Introduction
Knowledge graph (Dong et al., 2014) is a powerful graph structure that can provide direct access of knowledge to users via various applications such as structured search, question answering, and intelligent virtual assistant. A common representation of knowledge graph beliefs is in the form of a discrete relational triple such as LocatedIn(NewOrleans,Louisiana).
A main challenge for using discrete representation of knowledge graph is the lack of capability of accessing the similarities among different entities and relations. Knowledge graph embedding (KGE) techniques (e.g., rescal (Nickel et al., 2011), TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), and ComplEx (Trouillon et al., 2016)) have been proposed in recent years to deal with the issue. The main idea is to represent the entities and relations in a vector space, and one can use machine learning technique to learn the continuous representation of the knowledge graph in the latent space.
However, even steady progress has been made in developing novel algorithms for knowledge graph embedding, there is still a common challenge in this line of research. For space efficiency, common knowledge graphs such as Freebase (Bollacker et al., 2008), Yago (Suchanek et al., 2007), and NELL (Mitchell et al., 2015) by default only stores beliefs, rather than disbeliefs. Therefore, when training the embedding models, there is only the natural presence of the positive examples. To use negative examples, a common method is to remove the correct tail entity, and randomly sample from a uniform distribution (Bordes et al., 2013). Unfortunately, this approach is not ideal, because the sampled entity could be completely unrelated to the head and the target relation, and thus the quality of randomly generated negative examples is often poor (e.g, LocatedIn(NewOrleans,BarackObama)). Other approach might leverage external ontological constraints such as entity types (Krompaß et al., 2015) to generate negative examples, but such resource does not always exist or accessible.
Model  Score function  Number of parameters 

TransE  
TransD  
DistMult  ()  
ComplEx  ()  
TransH  
TransR  
ManifoldE (hyperplane)  
rescal  
HolE  ( is circular correlation)  
ConvE 
In this work, we provide a generic solution to improve the training of a wide range of knowledge graph embedding models. Inspired by the recent advances of generative adversarial deep models (Goodfellow et al., 2014), we propose a novel adversarial learning framework, namely, kbgan, for generating better negative examples to train knowledge graph embedding models. More specifically, we consider probabilitybased, logloss embedding models as the generator to supply better quality negative examples, and use distancebased, marginloss embedding models as the discriminator to generate the final knowledge graph embeddings. Since the generator has a discrete generation step, we cannot directly use the gradientbased approach to backpropagate the errors. We then consider a onestep reinforcement learning setting, and use a variancereduction reinforce method to achieve this goal. Empirically, we perform experiments on three common KGE datasets (FB15K237, WN18 and WN18RR), and verify the adversarial learning approach with a set of distancebased and probabilitybased KGE models. Our experiments show that across various settings, this adversarial learning mechanism can significantly improve the performance of some of the most commonly used translation based KGE methods. Our contributions are threefold:

We are the first to consider adversarial learning to generate useful negative training examples to improve knowledge graph embedding.

This adversarial learning framework applies to a widerange of probabilitybased and distancebased KGE models, without the need of external ontologies constraints.

Our method shows consistent performance gains on three commonly used KGE datasets.
In the next section, we outline related work on knowledge base embedding and generative adversarial models. Then, we focus on introducing the technical details and algorithms of our adversarial learning approach. Next, we will describe our experimental settings, and show the results. Finally, we will discuss the results and conclude.
2 Related Work
2.1 Knowledge Graph Embeddings
A large number of knowledge graph embedding models, which represent entities and relations in a knowledge graph with vectors or matrices, have been proposed in recent years. rescal (Nickel et al., 2011) is one of the earliest studies on matrix factorization based knowledge graph embedding models, using a bilinear form as score function. TransE (Bordes et al., 2013) is the first model to introduce translationbased embedding. Later variants, such as TransH (Wang et al., 2014), TransR (Lin et al., 2015) and TransD (Ji et al., 2015), extend TransE by projecting the embedding vectors of entities into various spaces. DistMult (Yang et al., 2015) simplifies rescal by only using a diagonal matrix, and ComplEx (Trouillon et al., 2016) extends DistMult into the complex number field. (Nickel et al., 2015) is a comprehensive survey on these models.
Some of the more recent models utilize more advanced mathematical operations, and generally produce better results in experiments. ManifoldE (Xiao et al., 2016) embeds a triple as a manifold rather than a point. HolE (Nickel et al., 2016) employs circular correlation to combine the two entities in a triple. ConvE (Dettmers et al., 2017) uses a convolutional neural network as the score function. However, most of these studies use uniform sampling to generate negative training examples (Bordes et al., 2013). Because our framework is independent of the concrete form of models, all these models can be potentially incorporated into our framework, regardless of the complexity. As a proof of principle, our work focuses on simpler models. Table 1 summarizes the score functions and dimensions of all models mentioned above.
2.2 Generative Adversarial Networks and its Variants
Generative Adversarial Networks (gans) (Goodfellow et al., 2014) is originally proposed for training generative models which generates samples in a continuous space such as images. A gan consists of two parts, the generator and the discriminator. The generator accepts a noise input and outputs an image. The discriminator is a classifier which classifies images as “true” (from the ground truth set) or “fake” (generated by the generator). When training a gan, the generator and the discriminator play a minimax game, in which the generator tries to generate “real” images to deceive the discriminator, and the discriminator tries to tell them apart from ground truth images. gans are also capable of generating samples satisfying certain requirements with some minor modification, as in conditional gan (Mirza and Osindero, 2014).
It is not possible to use gans in its original form for generating discrete samples like natural language sentences or knowledge graph triples, because the discrete sampling step prevents gradients form propagating back to the generator. SeqGan (Yu et al., 2017) is one of the first successful solutions to this problem by using reinforcement learning—It trains the generator using policy gradient and other tricks. irgan (Wang et al., 2017) is a recent work which combines two categories of information retrieval models into a discrete gan framework. Likewise, our framework relies on policy gradient to train the generator which provides discrete negative triples.
The discriminator in a gan is not necessarily a classifier. Wasserstein gan or wgan (Arjovsky et al., 2017) uses a regressor with clipped parameters as its discriminator, based on solid analysis about the mathematical nature of gan2. GoGan (JuefeiXu et al., 2017) further replaces the loss function in wgan with marginal loss. Although originating from very different fields, the form of loss function in our framework turns out to be more closely related to the one in GoGan.
3 Our Approaches
In this section, we first define two types of knowledge graph embedding models to show how kbgan can be applied. Then, we demonstrate a long overlooked problem about negative sampling which motivates us to propose kbgan to address the problem. Finally, we dive into the mathematical, and algorithmic details of kbgan.
3.1 Types of Knowledge Graph Embedding Models
For a given knowledge graph, let be the set of entities, be the set of relations, and be the set of ground truth triples. In general, a knowledge graph embedding (KGE) model can be formulated as a score function which assigns a score to every possible triple in the knowledge graph. The estimated likelihood of a triple to be true depends only on its score given by the score function.
Different models formulate their score function based on different designs, and therefore interpret scores differently, which further lead to various training objectives. Based on the underlying mathematical models, we can roughly divide KGE models into two groups:

Distancebased models include a large group of models called translationbased models, such as TransE, TransH, TransR, TransD and so on, as well as some other models such as ManifoldE. Models in this category formulate the scoring function as distance in the embedding space. When comparing multiple triples, the one with smaller distance is considered more likely to be true. However, the distance alone does not have a specific meaning. The objective of training a distancebased model is typically minimizing the following marginal loss:
(1) where is the margin, is the hinge loss, and is a negative triple. The negative triple is generated by replacing the head entity or the tail entity of a positive triple with a random entity in the knowledge graph, or formally .

Probabilitybased models include most of the remaining models. Some notable examples are rescal, DistMult, ComplEx, HolE, and ConvE. These models do not have a unified mathematical formulation for their score functions, but they all link the score of a triple to the probability of the triple to be true. There are two common formulations:

Pointwise probability indicates the probability of a triple to be the best one among a group of triples. Applying the softmax function on scores of these triples gives the probability: . Most models using this formulation are trained by minimizing the following negative loglikelihood loss:
(2) where is a set of sampled corrupted triples.

Triplewise probability indicates the probability of a triple to be true, independent of other triples. Applying the sigmoid function on the score of this triple gives the probability: . Most models using this formulation are trained by minimizing the following binary crossentropy loss:
(3) where is the collection of sampled corrupted triples for every positive triple.
Most probabilitybased models can use any of these two formulations without modifying the score function. However, in this paper we will only stick to the first one, because the probabilitybased form can produce an explicit probability distribution over a set of candidate negative triples, which is necessary for the generator of a discrete gan.

3.2 Weakness of Uniform Negative Sampling
Most previous KGE models use uniform negative sampling for generating negative triples, that is, replacing the head or tail entity of a positive triple with any of the entities in , all with equal probability. Most of the negative triples generated in this way contribute little to learning an effective embedding, because they are too obviously false.
To demonstrate this issue, let us consider the following example. Suppose we have a ground truth triple LocatedIn(NewOrleans,Louisiana), and corrupt it by replacing its tail entity. First, we remove the tail entity, leaving LocatedIn(NewOrleans,?). Because the relation LocatedIn constraints types of its entities, “?” must be a geographical region. If we fill “?” with a random entity , the probability of having a wrong type is very high, resulting in ridiculous triples like LocatedIn(NewOrleans,BarackObama) or LocatedIn(NewOrleans,StarTrek). Such triples are considered “too easy”, because they can be eliminated solely by types. In contrast, LocatedIn(NewOrleans,Florida) is a very useful negative triple, because it satisfies type constraints, but it cannot be proved wrong without detailed knowledge of American geography. If a KGE model is fed with mostly “too easy” negative examples, it would probably only learn to represent types, not the underlying semantics.
The problem is less severe to probabilitybased models, because they typically samples tens or hundreds of negative triples for each positive triple, and it is likely to have a few useful negatives among them. For instance, (Trouillon et al., 2016) found that a 100:1 negativetopositive ratio results in the best performance for ComplEx. However, for distancebased models, which can only use one negative for one positive, the low quality of uniformly sampled negatives can seriously damage their performance.
algocf[t] \end@dblfloat
3.3 Generative Adversarial Training for
Knowledge Graph Embedding Models
Inspired by GANs, we propose an adversarial training framework named kbgan which uses a pointwise probabilitybased KGE model to provide highquality negative samples for the training of a distancebased KGE model. This framework is independent of the score functions of these two models, and therefore possesses some extent of universality. Figure 1 illustrates the overall structure of kbgan.
In parallel to terminologies used in gan literature, we will simply call these two models generator and discriminator respectively in the rest of this paper. We use pointwise probabilitybased models as the generator because they can adequately model the “sampling from a probability distribution” process of discrete gans, and we aim at improving distancebased discriminators because they can benefit more from highquality negative samples. Note that a major difference between gan and our work is that, the ultimate goal of our framework is to produce a good discriminator, whereas gans are aimed at training a good generator. In addition, the discriminator here is not a classifier as it would be in most GANs.
Intuitively, the distancebased discriminator should assign a relatively small distance to a highquality negative sample. In order to encourage the generator to generate useful negative samples, the objective of the generator is to minimize the distance given by discriminator for its generated triples. And just like normally training a distancebased model, the objective of the discriminator is to minimize the marginal loss between the positive triple and the generated negative triple. In an adversarial training setting, the generator and the discriminator are alternatively trained towards their respective objectives.
Suppose that the generator produces a probability distribution on negative triples given a positive triple , and generates negative triples by sampling from this distribution. Let be the score function of the discriminator. The objective of the discriminator can be formulated as minimizing the following marginal loss function:
(4) 
Except getting negative samples from the generator, it is identical to the objective of a conventional distancebased model, and it can be simply optimized by any gradientbased algorithm.
Model  Hyperparameters  Constraints or Regularizations 

TransE  L1 distance,  
TransD  L1 distance,  
DistMult  L2 regularization:  
ComplEx  L2 regularization: 
Dataset  #rel.  #ent.  #train  #valid  #test 

FB15k237  237  14,541  272,115  17,535  20,466 
WN18  18  40,943  141,442  5,000  5,000 
WN18RR  11  40,943  86,835  3,034  3,134 
The objective of the discriminator can be formulated as maximizing the following expectation of negative distances:
(5) 
involves a discrete sampling step, so it cannot be directly optimized by gradientbased algorithms. However, we notice that the specific form of this objective, maximizing the expectation of a given function of samples from a parametrized probability distribution, is identical to the objective of a onestep reinforcement learning problem. Using the terminology of reinforcement learning to paraphrase the problem, is the state, is the policy, is the action, and is the reward. By applying the policy gradient theorem (Sutton et al., 2000), we obtain the gradient of with respect to parameters of the generator:
(6) 
where the second approximate equality means we approximate the expectation with sampling in practice. Now we can calculate the gradient of and optimize it with gradientbased algorithms. This is called reinforce (Williams, 1992) algorithm in reinforcement learning.
To reduce the variance of reinforce algorithm, we can subtract a baseline from the reward, which is an arbitrary function that only depends on the state, without affecting the expectation of gradients. In our case, we replace with in the equation above to introduce the baseline. To avoid introducing new parameters, we simply let be a constant, the average reward of the whole training set: . In practice, is approximated by the mean of rewards of recently generated negative triples.
Recall that the generator is a pointwise probabilitybased KGE model. Let its score function to be , given a set of candidate negative triples , the probability distribution is modeled as:
(7) 
Ideally, should contain all possible negatives. However, knowledge graphs are usually highly incomplete, so the ”hardest” negative triples are very likely to be false negatives (true facts). To address this issue, we instead generate by uniformly sampling of (a small number compared to the number of all possible negatives) entities from to replace or . Because true negatives are fare more than false negatives, such set would be unlikely to contain any false negative, and the negative selected by the generator would likely be a true negative. Using a small can also significantly reduce computational complexity.
Besides, we adopt the “bern” sampling technique (Wang et al., 2014) which replaces the “1” side in “1toN” and “Nto1” relations with higher probability to further reduce false negatives.
Algorithm LABEL:alg:kbgan summarizes the whole adversarial training process. Both the generator and the discriminator require pretraining, which is the same as conventionally training a single KBE model with uniform negative sampling. Formally speaking, one can pretrain the generator by minimizing the loss function defined in Equation (1), and pretrain the discriminator by minimizing the loss function defined in Equation (2). Line 14 in the algorithm assumes that we are using the vanilla gradient descent as the optimization method, but obviously one can substitute it with any gradientbased optimization algorithm.
FB15k237  WN18  WN18RR  
Method  MRR  H@10  MRR  H@10  MRR  H@10 
TransE    42.8    89.2    43.2 
TransD    45.3    92.2    42.8 
DistMult  24.1  41.9  82.2  93.6  42.5  49.1 
ComplEx  24.0  41.9  94.1  94.7  44.4  50.7 
TransE (pretrained)  24.2  42.2  43.3  91.5  18.6  45.9 
kbgan (TransE + DistMult)  27.4  45.0  71.0  94.9  21.3  48.1 
kbgan (TransE + ComplEx)  27.8  45.3  70.5  94.9  21.0  47.9 
TransD (pretrained)  24.5  42.7  49.4  92.8  19.2  46.5 
kbgan (TransD + DistMult)  27.8  45.8  77.2  94.8  21.4  47.2 
kbgan (TransD + ComplEx)  27.7  45.8  77.9  94.8  21.5  46.9 
4 Experiments
To evaluate our proposed framework, we test its performance for the link prediction task with different generators and discriminators. For the generator, we choose two classical probabilitybased KGE model, DistMult and ComplEx, and for the discriminator, we also choose two classical distancebased KGE model, TransE and TransD, resulting in four possible combinations of generator and discriminator in total. See Table 1 for a brief summary of these models.
4.1 Experimental Settings
4.1.1 Datasets
We use three common knowledge base completion datasets for our experiment: FB15k237, WN18 and WN18RR. FB15k237 is a subset of FB15k introduced by (Toutanova and Chen, 2015), while FB15k itself is a subset of Freebase which consists of a huge number of reallife facts. FB15k237 removed redundant relations in FB15k and greatly reduced the number of relations. Likewise, WN18RR is a subset of WN18 introduced by (Dettmers et al., 2017) which removes reversing relations and dramatically increases the difficulty of reasoning, and WN18 itself is a subset of WordNet which stores semantic relations between English words. Both FB15k and WN18 are first introduced by (Bordes et al., 2013). Statistics of datasets we used are shown in Table 3.
4.1.2 Evaluation Protocols
Following previous works like (Yang et al., 2015) and (Trouillon et al., 2016), for each run, we report two common metrics, mean reciprocal ranking (MRR) and hits at 10 (H@10). We only report scores under the filtered setting (Bordes et al., 2013), which removes all triples appeared in training, validating, and testing sets from candidate triples before obtaining the rank of the ground truth triple.
4.1.3 Implementation Details
^{1}^{1}1We will make the kbgan source code publicly available.In the pretraining stage, we train every model for 1000 epochs, and divide every epoch into 100 minibatches. To avoid overfitting, we adopt early stopping by evaluating MRR on the validation set every 50 epochs. We also manually select hyperparameters for models according to their performance after pretraining. Due to limited computation resources, we deliberately limit the dimensions of embeddings to save time, so our hyperparameters are likely suboptimal. We also apply certain constraints or regularizations to these models, which are mostly the same as those described in their original publications. Table 2 listed all hyperparameters, constraints and regularizations we used in the experiment.
In the adversarial training stage, we keep all the hyperparamters determined in the pretraining stage unchanged. The number of candidate negative triples, , is set to 20 in all cases. We train for 5000 epochs, with 100 minibatches for each epoch. We also use early stopping in adversarial training by evaluating MRR on the validation set every 50 epochs.
We use the selfadaptive optimization method Adam (Kingma and Ba, 2015) for all trainings, and always use the recommended default setting .
4.2 Results
Results of our experiments as well as baselines are shown in Table 4. All settings of adversarial training bring a pronounced improvement to the model, which indicates that our method is consistently effective in various cases. TransE performs slightly worse than TransD on FB15k237 and WN18, but better on WN18RR. Using DistMult or ComplEx as the generator does not affect performance greatly.
TransE and TransD enhanced by kbgan can significantly beat their corresponding baseline implementations, and outperform stronger baselines in some cases. As a prototypical and proofofprinciple experiment, we have never expect stateoftheart results. Being simple models proposed several years ago, TransE and TransD has their limitations in expressiveness that are unlikely to be fully compensated by better training technique. In future researches, people may try employing more advanced models into kbgan, and we believe it has the potential to become stateoftheart.
To illustrate our training progress, we plot performances of the discriminator on validation set over epochs, which are displayed in Figure 2. As all these graphs show, our performances always steadily increase and converge to its maximum as training proceeds, which indicates that kbgan is a robust gan that can converge to good results in various settings, although gans are wellknown for difficulty in convergence. Note that in some cases the curve still tends to rise after 5000 epochs. We do not have sufficient computation resource to train for more epochs, but we believe that they will also eventually converge.
5 Conclusions
In this paper, we propose a novel adversarial learning method for improving a wide range of knowledge graph embedding models. More specifically, we designed a generatordiscriminator framework with dual KGE components. The intuition is that, unlike random uniform sampling, the probabilitybased KGE model generates higher quality negative examples, which allow the discriminator, a distancebased KGE model to learn better. To enable backpropagation of error, we introduced a onestep reinforce method to seamlessly integrate the two modules. Experimentally, we empirically tested the proposed ideas with four commonly used KGE models on three datasets, and the results showed that the adversarial learning framework brought consistent improvements to various KGE models under different settings.
References
 Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Leon Bottou. 2017. Wasserstein gan. In International Conferrence on Machine Learning.
 Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pages 1247–1250.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems. pages 2787–2795.
 Dettmers et al. (2017) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2017. Convolutional 2d knowledge graph embeddings. arXiv preprint arXiv:1707.01476 .
 Dong et al. (2014) Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A webscale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pages 601–610.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. pages 2672–2680.
 Ji et al. (2015) Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Knowledge graph embedding via dynamic mapping matrix. In The 53rd Annual Meeting of the Association for Computational Linguistics.
 JuefeiXu et al. (2017) Felix JuefeiXu, Vishnu Naresh Boddeti, and Marios Savvides. 2017. Gang of gans: Generative adversarial networks with maximum margin ranking. arXiv preprint arXiv:1704.04865 .
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In The 3rd International Conference on Learning Representations.
 Krompaß et al. (2015) Denis Krompaß, Stephan Baier, and Volker Tresp. 2015. Typeconstrained representation learning in knowledge graphs. In International Semantic Web Conference. Springer, pages 640–655.
 Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In The Twentyninth AAAI Conference on Artificial Intelligence. pages 2181–2187.
 Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.01784 .
 Mitchell et al. (2015) Tom M Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Justin Betteridge, Andrew Carlson, Bhavana Dalvi Mishra, Matthew Gardner, Bryan Kisiel, Jayant Krishnamurthy, et al. 2015. Neverending learning. In The Twentyninth AAAI Conference on Artificial Intelligence.
 Nickel et al. (2015) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2015. A review of relational machine learning for knowledge graphs. arXiv preprint arXiv:1503.00759 .
 Nickel et al. (2016) Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio Poggio. 2016. Holographic embeddings of knowledge graphs. In The Thirtieth AAAI Conference on Artificial Intelligence. pages 1955–1961.
 Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and HansPeter Kriegel. 2011. A threeway model for collective learning on multirelational data. In Proceedings of the 28th International Conference on Machine Learning. pages 809–816.
 Suchanek et al. (2007) Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web. ACM, pages 697–706.
 Sutton et al. (2000) Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. pages 1057–1063.
 Toutanova and Chen (2015) Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality. pages 57–66.
 Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In International Conference on Machine Learning. pages 2071–2080.
 Wang et al. (2017) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In The 40th International ACM SIGIR Conference.
 Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In The Twentyeighth AAAI Conference on Artificial Intelligence. pages 1112–1119.
 Williams (1992) Ronald J Williams. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning 8(34):229–256.
 Xiao et al. (2016) Han Xiao, Minlie Huang, and Xiaoyan Zhu. 2016. From one point to a manifold: Knowledge graph embedding for precise link prediction. In The TwentyFifth International Joint Conference on Artificial Intelligence.
 Yang et al. (2015) Bishan Yang, Wentau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015. Embedding entities and relations for learning and inference in knowledge bases. The 3rd International Conference on Learning Representations .
 Yu et al. (2017) Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In The ThirtyFirst AAAI Conference on Artificial Intelligence. pages 2852–2858.