Knowledge Graph Convolutional Networks for Recommender Systems with Label Smoothness Regularization
Abstract.
Knowledge graphs capture interlinked information between entities and they represent an attractive source of structured information that can be harnessed for recommender systems. However, existing recommender engines use knowledge graphs by manually designing features, do not allow for endtoend training, or provide poor scalability. Here we propose Knowledge Graph Convolutional Networks (KGCN), an endtoend trainable framework that harnesses item relationships captured by the knowledge graph to provide better recommendations. Conceptually, KGCN computes userspecific item embeddings by first applying a trainable function that identifies important knowledge graph relations for a given user and then transforming the knowledge graph into a userspecific weighted graph. Then, KGCN applies a graph convolutional neural network that computes an embedding of an item node by propagating and aggregating knowledge graph neighborhood information. Moreover, to provide better inductive bias KGCN uses label smoothness (LS), which provides regularization over edge weights and we prove that it is equivalent to label propagation scheme on a graph. Finally, We unify KGCN and LS regularization, and present a scalable minibatch implementation for KGCNLS model. Experiments show that KGCNLS outperforms strong baselines in four datasets. KGCNLS also achieves great performance in sparse scenarios and is highly scalable with respect to the knowledge graph size.
1. Introduction
Recommender systems are widely used in Internet applications and services to meet users’ personalized interests and alleviate the issue of information overload. Traditional collaborative filtering recommender algorithms (Koren et al., 2009; Wang et al., 2017b) usually suffer from the sparsity issue of useritem interactions and the cold start problem, which could be addressed by introducing additional information such as user/item attributes (Wang et al., 2018a) and social networks (Yang et al., 2017). Knowledge graphs (KGs) provide an attractive source of relational information about items, which can be harnessed to improve recommendations (Zhang et al., 2016; Wang et al., 2018c; Huang et al., 2018; Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018; Wang et al., 2018b; Sun et al., 2018; Wang et al., 2019b, c; Wang et al., 2019a). A KG is a heterogeneous graph in which nodes correspond to entities and edges correspond to relations. In many recommendation scenarios, an item (i.e., a product) may correspond to an entity in a KG. KGs then provide connectivity information between items via different types of relations and allow for revealing semantic relatedness of items. In addition, KGs can also provide diversity and explainability for the recommended results (Wang et al., 2018b).
The core challenge in utilizing KGs to improve recommender systems is in learning how to express userspecific item relatedness via the KG. Existing KGaware recommender systems can be classified into pathbased methods (Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018), embeddingbased methods (Zhang et al., 2016; Wang et al., 2018c; Huang et al., 2018; Wang et al., 2019b), and hybrid methods (Wang et al., 2018b; Sun et al., 2018; Wang et al., 2019c). However, these approaches cannot make full use of KGs due to manually designed features, inability to perform endtoend training, or poor scalability (see Section 2.3 for more discussion). Graph Convolutional Neural Networks (GCNs), which aggregate features from local neighbors on a graph using neural networks, represent a promising advancement in graphbased representation learning (Bruna et al., 2014; Defferrard et al., 2016; Kipf and Welling, 2017; Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017). Recently, several works aimed to use GCNs in recommender systems (Ying et al., 2018; Monti et al., 2017; van den Berg et al., 2017; Wu et al., 2018) (see Section 2.1 for more discussion), but these approaches are all designed for homogeneous bipartite useritem interaction graphs or user/itemsimilarity graphs, where GCNs can be deployed directly. It remains an open question how to extend GCNs architecture to KGaware recommender engines, because (1) GCNs are proposed for general weighted graphs, while KG edges are heterogeneous without explicit weights, and (2) it is no clear how to combine entity features learned by GCNs with the recommender system.
In this paper, we rethink the problem of KGaware recommendations. Our design objectives are to automatically capture both structural and semantic information in the KG, while maintaining scalability with respect to the KG size. Therefore, we present a Knowledge Graph Convolutional Neural Network (KGCN) approach for recommender systems. To account for the relational heterogeneity of KGs, we propose using a trainable and personalized relation scoring function to transform the KG into a userspecific weighted graph, which characterizes both the semantic information of the KG and the users’ personalized interests. For example, in the movie recommendation setting the relation scoring function could learn that a given user really cares about “director” relation between movies and persons, while somebody else may care more about “lead actor” relation. Using this personalized weighted graph, KGCN then aggregates neighborhood information with bias when calculating the embedding of a given item node. The embedding vector of each item captures the local KG structure around an item node in a userpersonalized way. Note that the size of a KG and the number of an entity’s neighbors may be prohibitively large in practice. Therefore, we implement KGCN in a minibatch fashion, and perform efficient localized convolutions by sampling a fixedsize neighborhood for each node, which guarantees tractable computational cost and greatly improves scalability.
It is worth noting a significant difference between our approach and traditional GCNs. In our case the edge weights in the graph are not given as input but setting them also requires supervised training of relation scoring function. This added flexibility makes the optimization process prone to overfitting, since the only source of supervised signal is coming from useritem interactions. Therefore, additional regularization on edge weights is needed to guide the learning process and achieve better generalization. We propose taking label smoothness (LS) (Zhu et al., 2003; Zhang and Lee, 2007) as the additional regularization, which assumes that adjacent entities in the KG are likely to have similar labels. In our context this means users tend to engage similarly with nearby items in the KG. We prove that the LS regularization is equivalent to label propagation and we therefore design a leaveoneout loss function for label propagation to provide extra supervised signal for learning the edge scoring function. We show that KGCN (feature propagation) and LS regularization (label propagation) can be unified in a minibatch implementation, thus LS can be seen as a natural choice of regularization on KGCN.
Empirically, we apply the proposed Knowledge Graph Convolutional Networks with Label Smoothness regularization (KGCNLS) to four realworld scenarios of movie, book, music, and restaurant recommendations, in which the first three are public datasets and the last is from MeituanDianping Group. Experiments result show that KGCNLS achieves significant gains over stateoftheart baselines in recommendation accuracy. We also show that KGCNLS maintains a recommendation performance even when useritem interactions are sparse and is highly scalable with respect to KG size. We release the code and datasets to researchers for validating the reported results and conducting further research. The code and data are available at https://github.com/hwwang55/KGCN.
2. Related Work
2.1. Graph Convolutional Neural Networks
GCNs aim to generalize convolutional neural networks to nonEuclidean domains (such as graphs) for robust feature learning. Bruna et al. (Bruna et al., 2014) define the convolution in Fourier domain and calculate the eigendecomposition of the graph Laplacian, Defferrard et al. (Defferrard et al., 2016) approximate the convolutional filters by Chebyshev expansion of the graph Laplacian, and Kipf et al. (Kipf and Welling, 2017) propose a convolutional architecture via a firstorder approximation. In contrast to these spectral GCNs, nonspectral GCNs operate on the graph directly and apply “convolution” (i.e., weighted average) to local neighbors of a node (Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017).
Recently, researchers also deployed GCNs in recommender systems: PinSage (Ying et al., 2018) applies GCNs to the pinboard bipartite graph in Pinterest. Monti et al. (Monti et al., 2017) and Berg et al. (van den Berg et al., 2017) model recommender systems as matrix completion and design GCNs for representation learning on useritem bipartite graph. Wu et al. (Wu et al., 2018) use GCNs on user/item intrinsic structure graphs to learn user/item representations. The difference between these works and ours is that they are all designed for homogeneous bipartite graphs or user/itemsimilarity graphs where GCNs can be used directly, while here we investigate GCNs for heterogeneous KGs. Researchers also propose using GCNs to model KGs (Schlichtkrull et al., 2018), but not for the purpose of recommendation.
2.2. Semisupervised Learning on Graphs
The goal of graphbased semisupervised learning is to correctly label all nodes in a graph given that only a few nodes are labeled. Prior work often makes assumptions on the distribution of labels over the graph, and one common assumption is smoothness. Based on different settings of edge weights in the input graph, these methods are classified as: (1) Edge weights are assumed to be given as input and therefore fixed (Zhu et al., 2003; Zhou et al., 2004; Baluja et al., 2008); (2) Edge weights are parameterized and therefore learnable (Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013). Inspired by these methods, we design a module of label smoothness regularization in our proposed model. The major distinction of our work is that the label smoothness constraint is not used for semisupervised learning on graphs, but serves as regularization to assist the learning of edge weights and achieves better generalization for recommender systems.
2.3. Recommendations with Knowledge Graphs
In general, existing KGaware recommender systems can be classified into three categories: (1) Embeddingbased methods (Zhang et al., 2016; Wang et al., 2018c; Huang et al., 2018) preprocess a KG with knowledge graph embedding (KGE) (Wang et al., 2017a) algorithms, then incorporate learned entity embeddings into recommendation. Embeddingbased methods are highly flexible in utilizing KGs to assist recommender systems, but the KGE algorithms focus more on modeling rigorous semantic relatedness (e.g., TransE (Bordes et al., 2013) assumes ), which are more suitable for ingraph applications such as link prediction rather than recommendation. In addition, embeddingbased methods usually lack an endtoend way of training. (2) Pathbased methods (Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018) explore various patterns of connections among items in a KG (a.k.a metapath or metagraph) to provide additional guidance for recommendations. Pathbased methods make use of KGs in a more intuitive way, but they rely heavily on manually designed metapaths/metagraphs, which are hard to tune in practice. (3) Hybrid methods (Wang et al., 2018b; Sun et al., 2018) combine the above two categories and learn user/item embeddings by exploiting the structure of KGs. Our proposed KGCNLS can be seen as an instance of hybrid methods. But in contrast to hybrid methods such as RKGE (Sun et al., 2018) or RippleNet (Wang et al., 2018b), computation complexity of KGCNLS scales well with the increase size of the KG.
3. Problem Formulation
We begin by describing the KGaware recommendations problem and introducing notations. In a typical recommendation scenario, we have a set of users and a set of items . The useritem interaction matrix is defined according to users’ implicit feedback, where indicates that user has engaged with item , such as clicking, watching, or purchasing. We also have a knowledge graph available, which is comprised of entityrelationentity triples . Here , , and denote the head, relation, and tail of a knowledge triple, and are the set of entities and relations in the knowledge graph, respectively. For example, the triple (A Song of Ice and Fire, book.book.author, George Martin) states the fact that George Martin writes the book “A Song of Ice and Fire”. In many recommendation scenarios, an item corresponds to an entity . For example, in book recommendation, the item “A Song of Ice and Fire” also appears in the knowledge graph as an entity with the same name. The set of entities can therefore be partitioned as , where is the set of nonitem entities. Given useritem interaction matrix as well as knowledge graph , our task is to predict whether user has potential interest in item with which he/she has not engaged before. Specifically, we aim to learn a prediction function , where denotes the probability that user will engage with item , and are model parameters of function .
4. Our Approach: KGCNLS
In this section, we first introduce knowledge graph convolutional networks and label smoothness regularization, respectively, then we present the minibatch implementation of KGCNLS.
4.1. Knowledge Graph Convolutional Networks
The key idea of our approach to KGaware recommendations is to transform a heterogeneous KG into a userpersonalized weighted graph, which characterizes user’s preferences. To this end, we introduce a userspecific relation scoring function that provides the importance of relation for user :
(1) 
where and are feature vectors of user and relation type , respectively, and is a differentiable function such as inner product. In general, characterizes the importance of relation to user . For example, a user may be more interested in movies that share the same “lead actor” relation with his/her historically liked movies, while another user may be more concerned about the “genre”.
Given userspecific relation scoring function of user , knowledge graph can therefore be transformed into a userspecific adjacency matrix , in which entry , and is the relation between entities and in .^{1}^{1}1In this work we treat an undirected graph, so is a symmetric matrix. If both triples and exist, we only consider one of and . This is due to the fact that: (1) and are the inverse of each other and semantically related; (2) Treating symmetric will greatly increase the matrix density. if there is no relation between and . See the left two subfigures in Figure 1 for illustration. We also denote the raw feature matrix of entities as , where is the dimension of raw entity features. In KGCNLS, we use multiple feed forward layers to update the entity representation matrix by aggregating representations of neighboring entities. Specifically, the layerwise forward propagation can be expressed as
(2) 
In Eq. (2), is the matrix of hidden representations of entities in layer , and . is designed to aggregate representation vectors of neighboring entities, where is the identity matrix. We use instead of , i.e., adding selfconnection to each entity, to ensure that old representation vector of the entity itself is taken into consideration when updating entity representations. is a diagonal degree matrix with entries , therefore, is used to normalize and keep the entity representation matrix stable. is the layerspecific trainable weight matrix, and is a nonlinear activation function.
A single KGCNLS layer computes the representation of an entity via a transformed mixture of itself and its immediate neighbors in the KG. We can therefore naturally extend KGCNLS to multiple layers to explore users’ potential interests in a broader and deeper way. The final output of KGCNLS is , which is the entity representations that mix the initial features of themselves and their neighbors up to hops away. Finally, the predicted engagement probability of user with item is calculated by
(3) 
where is the final representation vector of item , and is a differentiable prediction function, for example, inner product or a multilayer perceptron. Note that is userspecific since the adjacency matrix is userspecific. Furthermore, note that the system is endtoend trainable where the gradients flow from via GCN (parameter matrix ) to and eventually to representations of users and items .
4.2. Discussion
How can the knowledge graph help find users’ interests? To intuitively understand the role of the KG, we make an analogy with a physical equilibrium model as shown in Figure 2. Each entity is seen as a particle, while the supervised positive signal acts as the force pulling the observed positive samples up from the decision boundary and the negative sampling signal acts as the force pushing the unobserved samples down. Without the KG (Figure 1(a)), these samples are only loosely connected with each other through the collaborative filtering effect (which is not drawn here for clarity). In contrast, edges in the KG serve as the rubber bands that impose explicit constraints on connected entities. When number of layers is (Figure 1(b)), representation of each entity is a mixture of itself and its immediate neighbors, therefore, optimizing on the positive samples will simultaneously pull their immediate neighbors up together. The upward force goes deeper in the KG with the increase of (Figure 1(c)), which helps explore users’ longdistance interests and pull up more positive samples. It is also interesting to note that the proximity constraint exerted by the KG is personalized since the strength of the rubber band (i.e., in Eq. (1)) is userspecific and relationspecific: One user may prefer relation (Figure 1(b)) while another user (with same observed samples but different unobserved samples) may prefer relation (Figure 1(d)).
4.3. Label Smoothness Regularization
Despite the force exerted by edges in the KG, edge weights may be set inappropriately, for example, too small to pull up the unobserved samples (i.e., rubber bands are too weak). This is exactly the difference between our work and traditional GCN: In traditional GCN, edge weights of the input graph are fixed; but in KGCN, edge weights in Eq. (2) are learnable (including possible parameters of function and feature vectors of users and relations) and also requires supervised training like . Though enhancing the fitting ability of the model, this will inevitably make the optimization process prone to overfitting, since the only source of supervised signal is from useritem interactions outside KGCNLS layers. Moreover, edge weights do play an essential role in learning tasks on graphs, as highlighted by a large amount of prior works (Zhu et al., 2003; Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013; Velickovic et al., 2018). Therefore, more regularization on edge weights is needed to assist the learning of entity representations and to help generalize to unobserved interactions more efficiently.
Let’s see how an ideal set of edge weights should be like. Consider a realvalued label function on , which is constrained to take a specific value at node . To be more specific, if user has engaged with item , otherwise . Intuitively, we hope that adjacent entities in the KG are likely to have similar labels, which is known as label smoothness assumption. This motivates our choice of energy function :
(4) 
We show that the minimumenergy label function is harmonic by the following theorem (proofs of all theorems can be found in Appendix):
Theorem 1 ().
The minimumenergy label function
(5) 
w.r.t. Eq. (4) is harmonic, i.e., satisfies , where is the graph Laplacian of , is a diagonal degree matrix with entries .
The harmonic property indicates that the value of at each nonitem entity is the average of its neighboring entities :
(6) 
which leads to the following iterative label propagation scheme:
Theorem 2 ().
Repeating the following two steps:

Propagating labels: ;

Reset labels on the set of items: ;
will lead to .
Theorem 2 provides a way for reaching the minimumenergy of label function . However, does not provide any signal for updating the edge weights matrix , since the labeled part of , i.e., , equals their true labels ; Moreover, we do not know true labels for the unlabeled part .
To solve the issue, we propose minimizing the leaveoneout loss (Zhang and Lee, 2007). Suppose we hold out a single item and treat it unlabeled. Then we predict its label by using the rest of (labeled) items and (unlabeled) nonitem entities. The prediction process is identical to label propagation in Theorem 2, except that the label of item is hidden and needs to be calculated. This way, the difference between the true label of (i.e., ) and the predicted label serves as a supervised signal for regularizing edge weights:
(7) 
where is the crossentropy loss function. Given the regularization in Eq. (7), an ideal edge weight matrix should reproduce the true label of each heldout item while also satisfy the label smoothness.
Combining KGCN and LS regularization, we reach the following complete loss function for KGCNLS:
(8) 
where is the L2regularizer for the whole model, and are balancing hyperparameters. In Eq. (8), the first loss term corresponds to KGCN, which can be seen as feature propagation on the KG, while the second loss term corresponds to LS regularization, which can be seen as label propagation on the KG. A recommender algorithm is actually a mapping from features to labels: . Therefore, Eq. (8) utilizes the structural information of the KG on both the domain and the range of mapping to capture users’ higherorder preferences and to facilitate better generalization. Another benefit of our loss function in Eq. (8) is that the propagation of features and the propagation of labels can actually be unified perfectly on graphs, which will be demonstrated more clearly in the next subsection.
Lastly, we show by Figure 1(e) that how the label smoothness assumption helps regularizing the learning of edge weights using the physical model introduced in Section 4.2. Suppose we hold out the positive sample in the upper left and we intend to reproduce its label by the rest of samples. Since the true label of the heldout sample is 1 and the upper right sample has the largest label value, the LS regularization term would enforce the edges with arrows to be large so that the label can “flow” from the blue one to the striped one as much as possible. As a result, this will tighten the arrowed rubber bands and encourage the model to pull up the two upper pink samples to a greater extent.
4.4. Minibatch Implementation
Directly calculating and optimizing the complete loss in Eq. (8) requires operating on the entire KG, which may be infeasible in practice due to limited memory resources. In this subsection, we present KGCNLS algorithm with minibatch implementation.
In a realworld KG, the size of neighbors of an entity may vary significantly over the KG. To keep the computational pattern of each minibatch fixed and more efficient, we uniformly sample a fixedsize set of neighbors for each entity instead of using its full set of neighbors. Specifically, we define a neighborhood sampling mapping , where is the set of neighbors of entity and is a configurable constant.^{2}^{2}2Technically, may contain duplicates if . In KGCNLS, is also called the (singlelayer) support set of entity , as the final representation of is dependent on these entities. Figure 1 gives an illustrative example of a twolayer support set (green nodes) for a given entity (blue node), where is set as 2. Another thing to notice is that the number of unobserved negative interactions in by far exceeds the positive ones. To make computation more efficient, we use a negative sampling strategy during training (See Appendix C for details).
The formal description of the minibatch implementation of KGCNLS is presented in Algorithm 1. For a given useritem pair (line 2), we first calculate the support set for in an iterative layerbylayer manner (lines 3, 1824). In initialization step, representations of all entities are initialized with input features (line 4), and labels are initialized with ground truth for items or a random value within for nonitem entities (line 5).^{3}^{3}3The initialization of on nonitem entities does not affect the final converged result. See Appendix B for the proof. But in our algorithm, the label propagation is only repeated times and is usually not too large in practice, therefore, we carefully choose their initial values within to avoid instability. Note that item is held out and treated unlabeled. Then the propagation is repeated times (line 6), in which lines 78 describe feature propagation and lines 912 describe label propagation.^{4}^{4}4In feature propagation (line 8), edge weight is normalized by for ease of implementation. This is slightly different with Eq. (2) where edge weight is normalized by . The final representation of the item is denoted as (line 13), which is fed into together with user representation to predict the engaging probability (line 14). For the purpose of training, we calculate the complete loss (line 15) and update all parameters by gradient descent (line 16).
Computational Complexity. Suppose the number of sampled useritem interactions and the dimension of all embeddings are and , respectively. In each training iteration of Algorithm 1, the complexity of constructing support set for one item is . The complexity of initialization is ; The complexity of feature propagation and label propagation in layer is and , respectively. Therefore, the overall complexity of each iteration of KGCNLS is . This time complexity mainly depends on and , and fortunately we show that their optimums are typically not too large ( for and for ) by later experiments.
The overall architecture of KGCNLS is also illustrated in Figure 1. It is worth noting, from both Algorithm 1 and the right two subfigures in Figure 1, that the scheme of feature propagation is very similar to the scheme of label propagation in minibatch implementation of KGCNLS. This interestingly demonstrates that propagation of features and labels can be unified perfectly, and label smoothness is a natural choice of regularization on knowledge graph convolutional networks.
5. Experiments
In this section, we evaluate KGCNLS and present its performance on four realworld scenarios: movie, book, music, and restaurant recommendations.
5.1. Datasets and Baselines
We utilize the following four datasets in our experiments for movie, book, music, and restaurant recommendation, respectively, in which the first three are public datasets and the last one is from MeituanDianping Group. We use Satori^{5}^{5}5https://searchengineland.com/library/bing/bingsatori, a commercial KG built by Microsoft, to construct subKGs for MovieLens20M, BookCrossing, and Last.FM datasets. The KG for DianpingFood dataset is constructed by the internal toolkit of MeituanDianping Group. Further details of datasets are provided in Appendix C.

MovieLens20M^{6}^{6}6https://grouplens.org/datasets/movielens/ is a widely used benchmark dataset in movie recommendations, which consists of approximately 20 million explicit ratings (ranging from 1 to 5) on the MovieLens website. The corresponding KG contains 102,569 entities, 499,474 edges and 32 relationtypes.

BookCrossing^{7}^{7}7http://www2.informatik.unifreiburg.de/~cziegler/BX/ contains 1 million ratings (ranging from 0 to 10) of books in the BookCrossing community. The corresponding KG contains 25,787 entities, 60,787 edges and 18 relationtypes.

Last.FM^{8}^{8}8https://grouplens.org/datasets/hetrec2011/ contains musician listening information from a set of 2 thousand users from Last.fm online music system. The corresponding KG contains 9,366 entities, 15,518 edges and 60 relationtypes.

DianpingFood is provided by Dianping.com^{9}^{9}9https://www.dianping.com/, which contains over 10 million interactions (including clicking, buying, and adding to favorites) between approximately 2 million users and 1 thousand restaurants. The corresponding KG contains 28,115 entities, 160,519 edges and 7 relationtypes.
Model  MovieLens20M  BookCrossing  Last.FM  DianpingFood  

R@2  R@10  R@50  R@100  R@2  R@10  R@50  R@100  R@2  R@10  R@50  R@100  R@2  R@10  R@50  R@100  
SVD  0.036  0.124  0.277  0.401  0.027  0.046  0.077  0.109  0.029  0.098  0.240  0.332  0.039  0.152  0.329  0.451 
LibFM  0.039  0.121  0.271  0.388  0.033  0.062  0.092  0.124  0.030  0.103  0.263  0.330  0.043  0.156  0.332  0.448 
LibFM + TransE  0.041  0.125  0.280  0.396  0.037  0.064  0.097  0.130  0.032  0.102  0.259  0.326  0.044  0.161  0.343  0.455 
PER  0.022  0.077  0.160  0.243  0.022  0.041  0.064  0.070  0.014  0.052  0.116  0.176  0.023  0.102  0.256  0.354 
CKE  0.034  0.107  0.244  0.322  0.028  0.051  0.079  0.112  0.023  0.070  0.180  0.296  0.034  0.138  0.305  0.437 
RippleNet  0.045  0.130  0.278  0.447  0.036  0.074  0.107  0.127  0.032  0.101  0.242  0.336  0.040  0.155  0.328  0.440 
KGCNLS  0.043  0.155*  0.321*  0.458  0.045*  0.082  0.117  0.149*  0.044*  0.122  0.277  0.370*  0.047  0.170  0.340  0.487* 
KGCNavg  0.040  0.152  0.325  0.448  0.029  0.078  0.108  0.136  0.032  0.112  0.265  0.364  0.039  0.157  0.324  0.475 
Model  Movie  Book  Music  Restaurant 

SVD  0.963  0.672  0.769  0.838 
LibFM  0.959  0.691  0.778  0.837 
LibFM + TransE  0.966  0.698  0.777  0.839 
PER  0.832  0.617  0.633  0.746 
CKE  0.924  0.677  0.744  0.802 
RippleNet  0.960  0.727  0.770  0.833 
KGCNLS  0.979  0.744*  0.803*  0.850 
KGCNavg  0.975  0.722  0.774  0.844 
We compare the proposed KGCNLS model with the following baselines for recommender systems, in which the first two baselines are KGfree while the rest are all KGaware methods. Hyperparameter settings for baselines and our method are introduced in Appendix D and E, respectively.

SVD (Koren, 2008) is a classic CFbased model using inner product to model useritem interactions.

LibFM (Rendle, 2012) is a widely used featurebased factorization model for CTR prediction. We concatenate user ID and item ID as input for LibFM.

LibFM + TransE extends LibFM by attaching an entity representation learned by TransE (Bordes et al., 2013) to each useritem pair.

PER (Yu et al., 2014) is a representative of pathbased methods, which treats the KG as heterogeneous information networks and extracts metapath based features to represent the connectivity between users and items.

CKE (Zhang et al., 2016) is a representative of embeddingbased methods, which combines CF with structural, textual, and visual knowledge in a unified framework. We implement CKE as CF plus a structural knowledge module in this paper.

RippleNet (Wang et al., 2018b) is a representative of hybrid methods, which is a memorynetworklike approach that propagates users’ preferences on the KG for recommendation.
5.2. Validating the Connection between and
To validate the connection between the knowledge graph and useritem interaction , we conduct an empirical study where we investigate the correlation between the shortest distance of two randomly sampled items in the KG and whether they have common rater(s) in the dataset. For MovieLens20M and Last.FM, we randomly sample ten thousand item pairs that have no common raters and have at least one common rater, respectively, then count the distribution of their shortest distances in the KG. The results are presented in Figure 3, which clearly show that if two items have common rater(s) in the dataset, they are likely to be more close in the KG. For example, if two movies have common rater(s) in MovieLens20M, there is a probability of that they will be within 2 hops in the KG, while the probability is if they have no common rater. This finding empirically demonstrates that exploiting the proximity structure of the KG can assist making recommendations. This also justifies our motivation to use support set and label smoothness to help learn entity representations.
5.3. Results
5.3.1. Comparison with Baselines
We evaluate our method in two experiment scenarios: (1) In top recommendation, we use the trained model to select items with highest predicted click probability for each user in the test set, and choose to evaluate the recommended sets. (2) In clickthrough rate (CTR) prediction, we apply the trained model to predict each interaction in the test set. We use as the evaluation metric in CTR prediction.
The results of top recommendation and CTR prediction are presented in Table LABEL:table:topk and Table LABEL:table:ctr, respectively. We have the following observations:

In general, we find that the improvements of KGCNLS on BookCrossing and Last.FM are higher than MovieLens20M and DianpingFood compared with baselines. This is probably because MovieLens20M and DianpingFood are much denser and therefore easier to model.

The performance of KGfree baselines, SVD and LibFM, are actually better than the two KGaware baselines PER and CKE, which indicates that PER and CKE cannot make full use of the KG with manually designed metapaths and TransRlike regularization.

LibFM + TransE is better than LibFM in most cases, which demonstrates that the introduction of KG is helpful for recommendation in general.

PER performs worst among all baselines, since it is hard to define optimal metapaths in reality.

RippleNet shows strong performance compared with other baselines. Note that RippleNet also uses multihop neighborhood structure, which indicates that capturing proximity information in the KG is essential for recommendation.
The last two rows in Table LABEL:table:topk and Table LABEL:table:ctr summarize the performance of KGCNLS and its variant KGCNavg, in which neighborhood representations are directly averaged without relation scores (i.e., treating the KG unweighted). From the results we find that:

KGCNLS outperforms baselines by a significant margin. For example, the of KGCNLS surpasses baselines by , , , and on average in MovieLens20M, BookCrossing, Last.FM, and DianpingFood datasets, respectively.

KGCNavg performs worse than KGCNLS, especially in BookCrossing and Last.FM, where interactions are sparse. This demonstrates that capturing users’ personalized preferences and semantics of the KG do benefit the recommendation.
We also show daily performance of KGCNLS and baselines on DianpingFood to investigate model stability. Figure 6 shows their score from November 1, 2018 to November 30, 2018. We notice that the curve of KGCNLS is consistently above baselines over the test period; Moreover, the performance of KGCNLS is also with low variance, which suggests that KGCNLS is also robust and stable in practice.
5.3.2. Effectiveness of LS Regularization
Is the proposed LS regularization helpful in improving the performance of KGCN? To study the effectiveness of LS regularization, we fix the dimension of embeddings as , , and , then vary from to to see how performance changes. The results of in Last.FM dataset are plotted in Figure 6. It is clear that the performance of KGCNLS with a nonzero is superior to the case where , which justifies our claim that LS regularization can assist learning the edge weights in a KG and achieve better generalization in recommender systems. But note that a too large is less attractive, since it overwhelms the overall loss and misleads the direction of gradients. According to the experiment results, we find that a between and is preferable in most practical cases.
5.3.3. Results in sparse scenarios
SVD  0.882  0.913  0.938  0.955  0.963 

LibFM  0.902  0.923  0.938  0.950  0.959 
LibFM+TransE  0.914  0.935  0.949  0.960  0.966 
PER  0.802  0.814  0.821  0.828  0.832 
CKE  0.898  0.910  0.916  0.921  0.924 
RippleNet  0.921  0.937  0.947  0.955  0.960 
KGCNLS  0.961  0.970  0.974  0.977  0.979 
One major goal of using KGs in recommender systems is to alleviate the sparsity issue and cold start problem. To investigate the performance of KGCNLS in sparse scenarios, we vary the size of training set of MovieLens20M from to (while the validation and test set are kept fixed), and report the results of in Table 3. When , decreases by , , , , , and for the six baselines compared to the model trained on full training data (), but the decrease in performance for KGCNLS is only . This demonstrates that KGCNLS still maintains predictive performance even when useritem interactions are sparse.
5.3.4. Hyperparameters Sensitivity
We first vary the size of sampled neighbor to investigate the usage efficacy of the KG. The results of in the four datasets are presented in Table 4, from which we find that KGCNLS achieves the best performance when . This is because a too small does not have enough capacity to encode neighborhood information, while a too large is prone to introduce noises and incurs heavier computation overhead.
2  4  8  16  32  

MovieLens20M  0.142  0.138  0.151  0.155  0.154 
BookCrossing  0.046  0.077  0.082  0.079  0.077 
Last.FM  0.105  0.102  0.122  0.122  0.120 
DianpingFood  0.166  0.170  0.160  0.158  0.155 
We also investigate the influence of the depth of support set in KGCNLS by varying from 1 to 4. The results are shown in Table 5, which demonstrate that KGCNLS is more sensitive to compared to . We observe serious model collapse when , as a larger will mix too many embedding vectors of entities into the computation of a given entity, which oversmoothes the representation learning on graphs. This is also in accordance with our intuition, since a too long relationchain makes little sense when inferring interitem similarities. An with value is enough for most cases according to the experiment results.
1  2  3  4  

MovieLens20M  0.155  0.146  0.122  0.011 
BookCrossing  0.077  0.082  0.043  0.008 
Last.FM  0.122  0.106  0.105  0.057 
DianpingFood  0.165  0.170  0.061  0.036 
5.4. Run Time Analysis
We also investigate the scalability of our method with respect to the size of KG. We run experiments on a Microsoft Azure virtual machine with 1 NVIDIA Tesla M60 GPU, 12 Intel Xeon CPUs (E52690 v3 @2.60GHz), and 128GB of RAM. The size of the KG is increased by up to five times the original one by extracting more triples from Satori, and the running times of all methods on MovieLens20M are reported in Figure 6. Note that the trend of a curve matters more than the real values, since the values are largely dependent on the minibatch size and the number of epochs (yet we did try to align the configurations of all methods). We find that the running time of LibFM+TransE and CKE grows linearly with respect to KG size, while PER and RippleNet show an approximate exponential pattern. By contrast, KGCNLS exhibits strong scalability even when the KG is large. This is because the support set of each entity is a fixedsize sampling set, which is resistant to the expansion of the KG.
6. Conclusion and Future Work
This paper proposes knowledge graph convolutional networks with label smoothness regularization for recommender systems. KGCNLS extends GCNs to KGs by biasedly aggregating neighborhood information, which is able to learn both structure and semantic information of the KG as well as users’ personalized interests. The proposed LS regularization and leaveoneout loss provide strong additional guidance for the learning process. We also implement KGCNLS in a scalable minibatch fashion. Through extensive experiments, KGCNLS is shown to consistently outperform stateoftheart baselines in four recommendation scenarios, and achieve desirable scalability with respect to KG size.
In this paper, LS regularization is proposed for recommendation task on the KG. It is interesting to examine the LS assumption on other graph tasks such as link prediction and node classification. Investigating the theoretical relationship between feature propagation and label propagation is also a promising direction.
References
 (1)
 Baluja et al. (2008) Shumeet Baluja, Rohan Seth, D Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. 2008. Video suggestion and discovery for youtube: taking random walks through the view graph. In Proceedings of the 17th international conference on World Wide Web. ACM, 895–904.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems. 2787–2795.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. 2014. Spectral networks and locally connected networks on graphs. In the 2nd International Conference on Learning Representations.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
 Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán AspuruGuzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems. 2224–2232.
 Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
 Hu et al. (2018) Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu. 2018. Leveraging metapath based context for topn recommendation with a neural coattention model. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1531–1540.
 Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, JiRong Wen, and Edward Y Chang. 2018. Improving Sequential Recommendation with KnowledgeEnhanced Memory Networks. In the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 505–514.
 Karasuyama and Mamitsuka (2013) Masayuki Karasuyama and Hiroshi Mamitsuka. 2013. Manifoldbased similarity adaptation for label propagation. In Advances in neural information processing systems. 1547–1555.
 Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semisupervised classification with graph convolutional networks. In the 5th International Conference on Learning Representations.
 Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 426–434.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
 Monti et al. (2017) Federico Monti, Michael Bronstein, and Xavier Bresson. 2017. Geometric matrix completion with recurrent multigraph neural networks. In Advances in Neural Information Processing Systems. 3697–3707.
 Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning. 2014–2023.
 Rendle (2012) Steffen Rendle. 2012. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST) 3, 3 (2012), 57.
 Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
 Sun et al. (2018) Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, LongKai Huang, and Chi Xu. 2018. Recurrent knowledge graph embedding for effective recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 297–305.
 van den Berg et al. (2017) Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. stat 1050 (2017), 7.
 Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conferences on Learning Representations.
 Wang and Zhang (2008) Fei Wang and Changshui Zhang. 2008. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20, 1 (2008), 55–67.
 Wang et al. (2017b) Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, and Minyi Guo. 2017b. Joint topicsemanticaware social recommendation for online voting. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 347–356.
 Wang et al. (2018a) Hongwei Wang, Fuzheng Zhang, Min Hou, Xing Xie, Minyi Guo, and Qi Liu. 2018a. Shine: Signed heterogeneous information network embedding for sentiment link prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 592–600.
 Wang et al. (2018b) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2018b. RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 417–426.
 Wang et al. (2019a) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019a. Exploring HighOrder User Preference on the Knowledge Graph for Recommender Systems. ACM Transactions on Information Systems (TOIS) 37, 3 (2019), 32.
 Wang et al. (2018c) Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018c. DKN: Deep KnowledgeAware Network for News Recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. 1835–1844.
 Wang et al. (2019b) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019b. MultiTask Feature Learning for Knowledge Graph Enhanced Recommendation. In Proceedings of the 2019 World Wide Web Conference on World Wide Web.
 Wang et al. (2019c) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019c. Knowledge graph convolutional networks for recommender systems. In Proceedings of the 2019 World Wide Web Conference on World Wide Web.
 Wang et al. (2017a) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017a. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 2724–2743.
 Wu et al. (2018) Yuexin Wu, Hanxiao Liu, and Yiming Yang. 2018. Graph Convolutional Matrix Completion for Bipartite Edge Prediction. In the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.
 Yang et al. (2017) Xiwang Yang, Chao Liang, Miao Zhao, Hongwei Wang, Hao Ding, Yong Liu, Yang Li, and Junlin Zhang. 2017. Collaborative filteringbased recommendation of online social voting. IEEE Transactions on Computational Social Systems 4, 1 (2017), 1–13.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for WebScale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 974–983.
 Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. ACM, 283–292.
 Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 353–362.
 Zhang and Lee (2007) Xinhua Zhang and Wee S Lee. 2007. Hyperparameter learning for graph based semisupervised learning algorithms. In Advances in neural information processing systems. 1585–1592.
 Zhao et al. (2017) Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 635–644.
 Zhou et al. (2004) Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. 2004. Learning with local and global consistency. In Advances in neural information processing systems. 321–328.
 Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning. 912–919.
Appendix
A Proof of Theorem 1
Proof.
Denote the minimumenergy label function as . Note that the value of is fixed on . Therefore, the value of on should satisfy , which implies that . ∎
B Proof of Theorem 2
Proof.
Let . Since is fixed on , we are solely interested in . We denote (the subscript is omitted from for ease of notation), and partition matrix into submatrices according to the partition of :
Then the label propagation scheme is equivalent to
(9) 
Repeat the above procedure, we have
(10) 
where is the initial value for . Now we show that . Since is rownormalized and is a submatrix of , we have
for all possible row index . Therefore,
As goes infinity, the row sum of converges to zero, which implies that . It’s clear that the choice of initial value does not affect the convergence.
C Additional Details on Datasets
MovieLens20M, BookCrossing, and Last.FM dataset contain explicit feedbacks data (Last.FM provides the listening count as weight for each useritem interaction). Therefore, we transform them into implicit feedback, where each entry is marked with 1 indicating that the user has rated the item positively. The threshold of positive rating is 4 for MovieLens20M, while no threshold is set for BookCrossing and Last.FM due to their sparsity. Additionally, we randomly sample an unwatched set of items and mark them as 0 for each user, the number of which equals his/her positivelyrated ones.
We use Microsoft Satori to construct the KGs for MovieLens20M, BookCrossing, and Last.FM dataset. In one triple in Saroti KG, the head and tail are either IDs or textual content, and the relation is with the form “domain.head_category.tail_category” (e.g., “book.book.author”). We first select a subset of triples from the whole Satori KG with a confidence level greater than 0.9. Given the subKG, we collect Satori IDs of all valid movies/books/musicians by matching their names with tail of triples (head, film.film.name, tail), (head, book.book.title, tail), or (head, type.object.name, tail), for the three datasets. Items with multiple matched or no matched entities are excluded for simplicity. After having the set of item IDs, we match these item IDs with the head of all triples in Satori subKG, and select all wellmatched triples as the final KG for each dataset.
DianpingFood dataset is collected from Dianping.com, a Chinese group buying website hosting consumer reviews of restaurants similar to Yelp. We select approximately 10 million interactions between users and restaurants in Dianping.com from May 1, 2015 to December 12, 2018. The types of positive interactions include clicking, buying, and adding to favorites, and we sample negative interactions for each user. The KG for DianpingFood is collected from Meituan Brain, an internal knowledge graph built for dining and entertainment by MeituanDianping Group. The types of entities include POI (restaurant), city, firstlevel and secondlevel category, star, business area, dish, and tag; The types of relations correspond to the types of entities (e.g., “organization.POI.has_dish”).
The basic statistics for the four datasets are shown in Table 6.
Movie  Book  Music  Restaurant  
# users  138,159  19,676  1,872  2,298,698 
# items  16,954  20,003  3,846  1,362 
# interactions  13,501,622  172,576  42,346  23,416,418 
# entities  102,569  25,787  9,366  28,115 
# relations  32  18  60  7 
# KG triples  499,474  60,787  15,518  160,519 
D Additional Details on Baselines
The hyperparameter settings for baselines are as follows:

For SVD, we use the unbiased version (i.e., the predicted rating is modeled as ). The dimension and learning rate for the four datasets are set as: , for MovieLens20M, BookCrossing; , for Last.FM; , for DianpingFood.

For LibFM, we use the code from http://www.libfm.org/. The dimension is set as and the number of training epochs is for all datasets. The code of TransE is from https://github.com/thunlp/FastTransX. The dimension of TransE is for all datasets.

For PER, we use manually designed “useritemattributeitem” as metapaths, i.e., “usermoviedirectormovie”, “usermoviegenremovie”, and “usermoviestarmovie” for MovieLens20M; “userbookauthorbook” and “userbookgenrebook” for BookCrossing, “usermusiciandate_of_birthmusician” (date of birth is discretized), “usermusiciancountrymusician”, and “usermusiciangenremusician” for Last.FM; “userrestaurantdishrestaurant”, “userrestaurantbusiness_arearestaurant”, “userrestauranttagrestaurant” for DianpingFood. The settings of dimension and learning rate are the same as SVD.

For CKE, we only introduce the important settings here since its has nearly 20 hyperparameters. The dimension of embedding for the four datasets are , , , . The training weight for KG part is for all datasets. The learning rate are the same as in SVD.

For RippleNet, , , , , for MovieLens20M; , , , , for Last.FM; , , , , for DianpingFood. The code of RippleNet is from https://github.com/hwwang55/RippleNet.
Other hyperparameters that are not mentioned are the same as reported in their original papers or as default in their codes. We implement SVD, PER, and CKE by tensorflow since the authors do not opensource their codes.
E Additional Details on Experiment Setup
Movie  Book  Music  Restaurant  
16  8  8  4  
32  64  16  8  
1  2  1  2  
1.0  0.5  0.1  0.5  
batch size  65,536  256  128  65,536 
In KGCNLS, we set functions and as inner product, as ReLU for nonlastlayers and for the lastlayer (because we do not want the final embedding are with all nonnegative entries). Other hyperparameter settings are given in Table 7, which are determined by optimizing on a validation set. The search spaces for hyperparameters are as follows:

;

;

;

;

;

;

Batch size for MovieLens20M and DianpingFood; batch size for BookCrossing and Last.FM.
For each dataset, the ratio of training, evaluation, and test set is . Each experiment is repeated times, and the average performance is reported. All trainable parameters are optimized by Adam algorithm. The code of KGCNLS is implemented with Python 3.6, TensorFlow 1.12.0, and NumPy 1.14.3.