Knowledge Graph Convolutional Networks for Recommender Systems with Label Smoothness Regularization
Knowledge graphs capture interlinked information between entities and they represent an attractive source of structured information that can be harnessed for recommender systems. However, existing recommender engines use knowledge graphs by manually designing features, do not allow for end-to-end training, or provide poor scalability. Here we propose Knowledge Graph Convolutional Networks (KGCN), an end-to-end trainable framework that harnesses item relationships captured by the knowledge graph to provide better recommendations. Conceptually, KGCN computes user-specific item embeddings by first applying a trainable function that identifies important knowledge graph relations for a given user and then transforming the knowledge graph into a user-specific weighted graph. Then, KGCN applies a graph convolutional neural network that computes an embedding of an item node by propagating and aggregating knowledge graph neighborhood information. Moreover, to provide better inductive bias KGCN uses label smoothness (LS), which provides regularization over edge weights and we prove that it is equivalent to label propagation scheme on a graph. Finally, We unify KGCN and LS regularization, and present a scalable minibatch implementation for KGCN-LS model. Experiments show that KGCN-LS outperforms strong baselines in four datasets. KGCN-LS also achieves great performance in sparse scenarios and is highly scalable with respect to the knowledge graph size.
Recommender systems are widely used in Internet applications and services to meet users’ personalized interests and alleviate the issue of information overload. Traditional collaborative filtering recommender algorithms (Koren et al., 2009; Wang et al., 2017b) usually suffer from the sparsity issue of user-item interactions and the cold start problem, which could be addressed by introducing additional information such as user/item attributes (Wang et al., 2018a) and social networks (Yang et al., 2017). Knowledge graphs (KGs) provide an attractive source of relational information about items, which can be harnessed to improve recommendations (Zhang et al., 2016; Wang et al., 2018c; Huang et al., 2018; Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018; Wang et al., 2018b; Sun et al., 2018; Wang et al., 2019b, c; Wang et al., 2019a). A KG is a heterogeneous graph in which nodes correspond to entities and edges correspond to relations. In many recommendation scenarios, an item (i.e., a product) may correspond to an entity in a KG. KGs then provide connectivity information between items via different types of relations and allow for revealing semantic relatedness of items. In addition, KGs can also provide diversity and explainability for the recommended results (Wang et al., 2018b).
The core challenge in utilizing KGs to improve recommender systems is in learning how to express user-specific item relatedness via the KG. Existing KG-aware recommender systems can be classified into path-based methods (Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018), embedding-based methods (Zhang et al., 2016; Wang et al., 2018c; Huang et al., 2018; Wang et al., 2019b), and hybrid methods (Wang et al., 2018b; Sun et al., 2018; Wang et al., 2019c). However, these approaches cannot make full use of KGs due to manually designed features, inability to perform end-to-end training, or poor scalability (see Section 2.3 for more discussion). Graph Convolutional Neural Networks (GCNs), which aggregate features from local neighbors on a graph using neural networks, represent a promising advancement in graph-based representation learning (Bruna et al., 2014; Defferrard et al., 2016; Kipf and Welling, 2017; Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017). Recently, several works aimed to use GCNs in recommender systems (Ying et al., 2018; Monti et al., 2017; van den Berg et al., 2017; Wu et al., 2018) (see Section 2.1 for more discussion), but these approaches are all designed for homogeneous bipartite user-item interaction graphs or user/item-similarity graphs, where GCNs can be deployed directly. It remains an open question how to extend GCNs architecture to KG-aware recommender engines, because (1) GCNs are proposed for general weighted graphs, while KG edges are heterogeneous without explicit weights, and (2) it is no clear how to combine entity features learned by GCNs with the recommender system.
In this paper, we rethink the problem of KG-aware recommendations. Our design objectives are to automatically capture both structural and semantic information in the KG, while maintaining scalability with respect to the KG size. Therefore, we present a Knowledge Graph Convolutional Neural Network (KGCN) approach for recommender systems. To account for the relational heterogeneity of KGs, we propose using a trainable and personalized relation scoring function to transform the KG into a user-specific weighted graph, which characterizes both the semantic information of the KG and the users’ personalized interests. For example, in the movie recommendation setting the relation scoring function could learn that a given user really cares about “director” relation between movies and persons, while somebody else may care more about “lead actor” relation. Using this personalized weighted graph, KGCN then aggregates neighborhood information with bias when calculating the embedding of a given item node. The embedding vector of each item captures the local KG structure around an item node in a user-personalized way. Note that the size of a KG and the number of an entity’s neighbors may be prohibitively large in practice. Therefore, we implement KGCN in a minibatch fashion, and perform efficient localized convolutions by sampling a fixed-size neighborhood for each node, which guarantees tractable computational cost and greatly improves scalability.
It is worth noting a significant difference between our approach and traditional GCNs. In our case the edge weights in the graph are not given as input but setting them also requires supervised training of relation scoring function. This added flexibility makes the optimization process prone to overfitting, since the only source of supervised signal is coming from user-item interactions. Therefore, additional regularization on edge weights is needed to guide the learning process and achieve better generalization. We propose taking label smoothness (LS) (Zhu et al., 2003; Zhang and Lee, 2007) as the additional regularization, which assumes that adjacent entities in the KG are likely to have similar labels. In our context this means users tend to engage similarly with nearby items in the KG. We prove that the LS regularization is equivalent to label propagation and we therefore design a leave-one-out loss function for label propagation to provide extra supervised signal for learning the edge scoring function. We show that KGCN (feature propagation) and LS regularization (label propagation) can be unified in a minibatch implementation, thus LS can be seen as a natural choice of regularization on KGCN.
Empirically, we apply the proposed Knowledge Graph Convolu-tional Networks with Label Smoothness regularization (KGCN-LS) to four real-world scenarios of movie, book, music, and restaurant recommendations, in which the first three are public datasets and the last is from Meituan-Dianping Group. Experiments result show that KGCN-LS achieves significant gains over state-of-the-art baselines in recommendation accuracy. We also show that KGCN-LS maintains a recommendation performance even when user-item interactions are sparse and is highly scalable with respect to KG size. We release the code and datasets to researchers for validating the reported results and conducting further research. The code and data are available at https://github.com/hwwang55/KGCN.
2. Related Work
2.1. Graph Convolutional Neural Networks
GCNs aim to generalize convolutional neural networks to non-Euclidean domains (such as graphs) for robust feature learning. Bruna et al. (Bruna et al., 2014) define the convolution in Fourier domain and calculate the eigendecomposition of the graph Laplacian, Defferrard et al. (Defferrard et al., 2016) approximate the convolutional filters by Chebyshev expansion of the graph Laplacian, and Kipf et al. (Kipf and Welling, 2017) propose a convolutional architecture via a first-order approximation. In contrast to these spectral GCNs, non-spectral GCNs operate on the graph directly and apply “convolution” (i.e., weighted average) to local neighbors of a node (Duvenaud et al., 2015; Niepert et al., 2016; Hamilton et al., 2017).
Recently, researchers also deployed GCNs in recommender systems: PinSage (Ying et al., 2018) applies GCNs to the pin-board bipartite graph in Pinterest. Monti et al. (Monti et al., 2017) and Berg et al. (van den Berg et al., 2017) model recommender systems as matrix completion and design GCNs for representation learning on user-item bipartite graph. Wu et al. (Wu et al., 2018) use GCNs on user/item intrinsic structure graphs to learn user/item representations. The difference between these works and ours is that they are all designed for homogeneous bipartite graphs or user/item-similarity graphs where GCNs can be used directly, while here we investigate GCNs for heterogeneous KGs. Researchers also propose using GCNs to model KGs (Schlichtkrull et al., 2018), but not for the purpose of recommendation.
2.2. Semi-supervised Learning on Graphs
The goal of graph-based semi-supervised learning is to correctly label all nodes in a graph given that only a few nodes are labeled. Prior work often makes assumptions on the distribution of labels over the graph, and one common assumption is smoothness. Based on different settings of edge weights in the input graph, these methods are classified as: (1) Edge weights are assumed to be given as input and therefore fixed (Zhu et al., 2003; Zhou et al., 2004; Baluja et al., 2008); (2) Edge weights are parameterized and therefore learnable (Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013). Inspired by these methods, we design a module of label smoothness regularization in our proposed model. The major distinction of our work is that the label smoothness constraint is not used for semi-supervised learning on graphs, but serves as regularization to assist the learning of edge weights and achieves better generalization for recommender systems.
2.3. Recommendations with Knowledge Graphs
In general, existing KG-aware recommender systems can be classified into three categories: (1) Embedding-based methods (Zhang et al., 2016; Wang et al., 2018c; Huang et al., 2018) pre-process a KG with knowledge graph embedding (KGE) (Wang et al., 2017a) algorithms, then incorporate learned entity embeddings into recommendation. Embedding-based methods are highly flexible in utilizing KGs to assist recommender systems, but the KGE algorithms focus more on modeling rigorous semantic relatedness (e.g., TransE (Bordes et al., 2013) assumes ), which are more suitable for in-graph applications such as link prediction rather than recommendation. In addition, embedding-based methods usually lack an end-to-end way of training. (2) Path-based methods (Yu et al., 2014; Zhao et al., 2017; Hu et al., 2018) explore various patterns of connections among items in a KG (a.k.a meta-path or meta-graph) to provide additional guidance for recommendations. Path-based methods make use of KGs in a more intuitive way, but they rely heavily on manually designed meta-paths/meta-graphs, which are hard to tune in practice. (3) Hybrid methods (Wang et al., 2018b; Sun et al., 2018) combine the above two categories and learn user/item embeddings by exploiting the structure of KGs. Our proposed KGCN-LS can be seen as an instance of hybrid methods. But in contrast to hybrid methods such as RKGE (Sun et al., 2018) or RippleNet (Wang et al., 2018b), computation complexity of KGCN-LS scales well with the increase size of the KG.
3. Problem Formulation
We begin by describing the KG-aware recommendations problem and introducing notations. In a typical recommendation scenario, we have a set of users and a set of items . The user-item interaction matrix is defined according to users’ implicit feedback, where indicates that user has engaged with item , such as clicking, watching, or purchasing. We also have a knowledge graph available, which is comprised of entity-relation-entity triples . Here , , and denote the head, relation, and tail of a knowledge triple, and are the set of entities and relations in the knowledge graph, respectively. For example, the triple (A Song of Ice and Fire, book.book.author, George Martin) states the fact that George Martin writes the book “A Song of Ice and Fire”. In many recommendation scenarios, an item corresponds to an entity . For example, in book recommendation, the item “A Song of Ice and Fire” also appears in the knowledge graph as an entity with the same name. The set of entities can therefore be partitioned as , where is the set of non-item entities. Given user-item interaction matrix as well as knowledge graph , our task is to predict whether user has potential interest in item with which he/she has not engaged before. Specifically, we aim to learn a prediction function , where denotes the probability that user will engage with item , and are model parameters of function .
4. Our Approach: KGCN-LS
In this section, we first introduce knowledge graph convolutional networks and label smoothness regularization, respectively, then we present the minibatch implementation of KGCN-LS.
4.1. Knowledge Graph Convolutional Networks
The key idea of our approach to KG-aware recommendations is to transform a heterogeneous KG into a user-personalized weighted graph, which characterizes user’s preferences. To this end, we introduce a user-specific relation scoring function that provides the importance of relation for user :
where and are feature vectors of user and relation type , respectively, and is a differentiable function such as inner product. In general, characterizes the importance of relation to user . For example, a user may be more interested in movies that share the same “lead actor” relation with his/her historically liked movies, while another user may be more concerned about the “genre”.
Given user-specific relation scoring function of user , knowledge graph can therefore be transformed into a user-specific adjacency matrix , in which entry , and is the relation between entities and in .111In this work we treat an undirected graph, so is a symmetric matrix. If both triples and exist, we only consider one of and . This is due to the fact that: (1) and are the inverse of each other and semantically related; (2) Treating symmetric will greatly increase the matrix density. if there is no relation between and . See the left two subfigures in Figure 1 for illustration. We also denote the raw feature matrix of entities as , where is the dimension of raw entity features. In KGCN-LS, we use multiple feed forward layers to update the entity representation matrix by aggregating representations of neighboring entities. Specifically, the layer-wise forward propagation can be expressed as
In Eq. (2), is the matrix of hidden representations of entities in layer , and . is designed to aggregate representation vectors of neighboring entities, where is the identity matrix. We use instead of , i.e., adding self-connection to each entity, to ensure that old representation vector of the entity itself is taken into consideration when updating entity representations. is a diagonal degree matrix with entries , therefore, is used to normalize and keep the entity representation matrix stable. is the layer-specific trainable weight matrix, and is a non-linear activation function.
A single KGCN-LS layer computes the representation of an entity via a transformed mixture of itself and its immediate neighbors in the KG. We can therefore naturally extend KGCN-LS to multiple layers to explore users’ potential interests in a broader and deeper way. The final output of KGCN-LS is , which is the entity representations that mix the initial features of themselves and their neighbors up to hops away. Finally, the predicted engagement probability of user with item is calculated by
where is the final representation vector of item , and is a differentiable prediction function, for example, inner product or a multilayer perceptron. Note that is user-specific since the adjacency matrix is user-specific. Furthermore, note that the system is end-to-end trainable where the gradients flow from via GCN (parameter matrix ) to and eventually to representations of users and items .
How can the knowledge graph help find users’ interests? To intuitively understand the role of the KG, we make an analogy with a physical equilibrium model as shown in Figure 2. Each entity is seen as a particle, while the supervised positive signal acts as the force pulling the observed positive samples up from the decision boundary and the negative sampling signal acts as the force pushing the unobserved samples down. Without the KG (Figure 1(a)), these samples are only loosely connected with each other through the collaborative filtering effect (which is not drawn here for clarity). In contrast, edges in the KG serve as the rubber bands that impose explicit constraints on connected entities. When number of layers is (Figure 1(b)), representation of each entity is a mixture of itself and its immediate neighbors, therefore, optimizing on the positive samples will simultaneously pull their immediate neighbors up together. The upward force goes deeper in the KG with the increase of (Figure 1(c)), which helps explore users’ long-distance interests and pull up more positive samples. It is also interesting to note that the proximity constraint exerted by the KG is personalized since the strength of the rubber band (i.e., in Eq. (1)) is user-specific and relation-specific: One user may prefer relation (Figure 1(b)) while another user (with same observed samples but different unobserved samples) may prefer relation (Figure 1(d)).
4.3. Label Smoothness Regularization
Despite the force exerted by edges in the KG, edge weights may be set inappropriately, for example, too small to pull up the unobserved samples (i.e., rubber bands are too weak). This is exactly the difference between our work and traditional GCN: In traditional GCN, edge weights of the input graph are fixed; but in KGCN, edge weights in Eq. (2) are learnable (including possible parameters of function and feature vectors of users and relations) and also requires supervised training like . Though enhancing the fitting ability of the model, this will inevitably make the optimization process prone to overfitting, since the only source of supervised signal is from user-item interactions outside KGCN-LS layers. Moreover, edge weights do play an essential role in learning tasks on graphs, as highlighted by a large amount of prior works (Zhu et al., 2003; Zhang and Lee, 2007; Wang and Zhang, 2008; Karasuyama and Mamitsuka, 2013; Velickovic et al., 2018). Therefore, more regularization on edge weights is needed to assist the learning of entity representations and to help generalize to unobserved interactions more efficiently.
Let’s see how an ideal set of edge weights should be like. Consider a real-valued label function on , which is constrained to take a specific value at node . To be more specific, if user has engaged with item , otherwise . Intuitively, we hope that adjacent entities in the KG are likely to have similar labels, which is known as label smoothness assumption. This motivates our choice of energy function :
We show that the minimum-energy label function is harmonic by the following theorem (proofs of all theorems can be found in Appendix):
Theorem 1 ().
The minimum-energy label function
w.r.t. Eq. (4) is harmonic, i.e., satisfies , where is the graph Laplacian of , is a diagonal degree matrix with entries .
The harmonic property indicates that the value of at each non-item entity is the average of its neighboring entities :
which leads to the following iterative label propagation scheme:
Theorem 2 ().
Repeating the following two steps:
Propagating labels: ;
Reset labels on the set of items: ;
will lead to .
Theorem 2 provides a way for reaching the minimum-energy of label function . However, does not provide any signal for updating the edge weights matrix , since the labeled part of , i.e., , equals their true labels ; Moreover, we do not know true labels for the unlabeled part .
To solve the issue, we propose minimizing the leave-one-out loss (Zhang and Lee, 2007). Suppose we hold out a single item and treat it unlabeled. Then we predict its label by using the rest of (labeled) items and (unlabeled) non-item entities. The prediction process is identical to label propagation in Theorem 2, except that the label of item is hidden and needs to be calculated. This way, the difference between the true label of (i.e., ) and the predicted label serves as a supervised signal for regularizing edge weights:
where is the cross-entropy loss function. Given the regularization in Eq. (7), an ideal edge weight matrix should reproduce the true label of each held-out item while also satisfy the label smoothness.
Combining KGCN and LS regularization, we reach the following complete loss function for KGCN-LS:
where is the L2-regularizer for the whole model, and are balancing hyper-parameters. In Eq. (8), the first loss term corresponds to KGCN, which can be seen as feature propagation on the KG, while the second loss term corresponds to LS regularization, which can be seen as label propagation on the KG. A recommender algorithm is actually a mapping from features to labels: . Therefore, Eq. (8) utilizes the structural information of the KG on both the domain and the range of mapping to capture users’ higher-order preferences and to facilitate better generalization. Another benefit of our loss function in Eq. (8) is that the propagation of features and the propagation of labels can actually be unified perfectly on graphs, which will be demonstrated more clearly in the next subsection.
Lastly, we show by Figure 1(e) that how the label smoothness assumption helps regularizing the learning of edge weights using the physical model introduced in Section 4.2. Suppose we hold out the positive sample in the upper left and we intend to reproduce its label by the rest of samples. Since the true label of the held-out sample is 1 and the upper right sample has the largest label value, the LS regularization term would enforce the edges with arrows to be large so that the label can “flow” from the blue one to the striped one as much as possible. As a result, this will tighten the arrowed rubber bands and encourage the model to pull up the two upper pink samples to a greater extent.
4.4. Minibatch Implementation
Directly calculating and optimizing the complete loss in Eq. (8) requires operating on the entire KG, which may be infeasible in practice due to limited memory resources. In this subsection, we present KGCN-LS algorithm with minibatch implementation.
In a real-world KG, the size of neighbors of an entity may vary significantly over the KG. To keep the computational pattern of each minibatch fixed and more efficient, we uniformly sample a fixed-size set of neighbors for each entity instead of using its full set of neighbors. Specifically, we define a neighborhood sampling mapping , where is the set of neighbors of entity and is a configurable constant.222Technically, may contain duplicates if . In KGCN-LS, is also called the (single-layer) support set of entity , as the final representation of is dependent on these entities. Figure 1 gives an illustrative example of a two-layer support set (green nodes) for a given entity (blue node), where is set as 2. Another thing to notice is that the number of unobserved negative interactions in by far exceeds the positive ones. To make computation more efficient, we use a negative sampling strategy during training (See Appendix C for details).
The formal description of the minibatch implementation of KGCN-LS is presented in Algorithm 1. For a given user-item pair (line 2), we first calculate the support set for in an iterative layer-by-layer manner (lines 3, 18-24). In initialization step, representations of all entities are initialized with input features (line 4), and labels are initialized with ground truth for items or a random value within for non-item entities (line 5).333The initialization of on non-item entities does not affect the final converged result. See Appendix B for the proof. But in our algorithm, the label propagation is only repeated times and is usually not too large in practice, therefore, we carefully choose their initial values within to avoid instability. Note that item is held out and treated unlabeled. Then the propagation is repeated times (line 6), in which lines 7-8 describe feature propagation and lines 9-12 describe label propagation.444In feature propagation (line 8), edge weight is normalized by for ease of implementation. This is slightly different with Eq. (2) where edge weight is normalized by . The final representation of the item is denoted as (line 13), which is fed into together with user representation to predict the engaging probability (line 14). For the purpose of training, we calculate the complete loss (line 15) and update all parameters by gradient descent (line 16).
Computational Complexity. Suppose the number of sampled user-item interactions and the dimension of all embeddings are and , respectively. In each training iteration of Algorithm 1, the complexity of constructing support set for one item is . The complexity of initialization is ; The complexity of feature propagation and label propagation in layer is and , respectively. Therefore, the overall complexity of each iteration of KGCN-LS is . This time complexity mainly depends on and , and fortunately we show that their optimums are typically not too large ( for and for ) by later experiments.
The overall architecture of KGCN-LS is also illustrated in Figure 1. It is worth noting, from both Algorithm 1 and the right two subfigures in Figure 1, that the scheme of feature propagation is very similar to the scheme of label propagation in minibatch implementation of KGCN-LS. This interestingly demonstrates that propagation of features and labels can be unified perfectly, and label smoothness is a natural choice of regularization on knowledge graph convolutional networks.
In this section, we evaluate KGCN-LS and present its performance on four real-world scenarios: movie, book, music, and restaurant recommendations.
5.1. Datasets and Baselines
We utilize the following four datasets in our experiments for movie, book, music, and restaurant recommendation, respectively, in which the first three are public datasets and the last one is from Meituan-Dianping Group. We use Satori555https://searchengineland.com/library/bing/bing-satori, a commercial KG built by Microsoft, to construct sub-KGs for MovieLens-20M, Book-Crossing, and Last.FM datasets. The KG for Dianping-Food dataset is constructed by the internal toolkit of Meituan-Dianping Group. Further details of datasets are provided in Appendix C.
MovieLens-20M666https://grouplens.org/datasets/movielens/ is a widely used benchmark dataset in movie recommendations, which consists of approximately 20 million explicit ratings (ranging from 1 to 5) on the MovieLens website. The corresponding KG contains 102,569 entities, 499,474 edges and 32 relation-types.
Book-Crossing777http://www2.informatik.uni-freiburg.de/~cziegler/BX/ contains 1 million ratings (ranging from 0 to 10) of books in the Book-Crossing community. The corresponding KG contains 25,787 entities, 60,787 edges and 18 relation-types.
Last.FM888https://grouplens.org/datasets/hetrec-2011/ contains musician listening information from a set of 2 thousand users from Last.fm online music system. The corresponding KG contains 9,366 entities, 15,518 edges and 60 relation-types.
Dianping-Food is provided by Dianping.com999https://www.dianping.com/, which contains over 10 million interactions (including clicking, buying, and adding to favorites) between approximately 2 million users and 1 thousand restaurants. The corresponding KG contains 28,115 entities, 160,519 edges and 7 relation-types.
|LibFM + TransE||0.041||0.125||0.280||0.396||0.037||0.064||0.097||0.130||0.032||0.102||0.259||0.326||0.044||0.161||0.343||0.455|
|LibFM + TransE||0.966||0.698||0.777||0.839|
We compare the proposed KGCN-LS model with the following baselines for recommender systems, in which the first two baselines are KG-free while the rest are all KG-aware methods. Hyper-parameter settings for baselines and our method are introduced in Appendix D and E, respectively.
SVD (Koren, 2008) is a classic CF-based model using inner product to model user-item interactions.
LibFM (Rendle, 2012) is a widely used feature-based factorization model for CTR prediction. We concatenate user ID and item ID as input for LibFM.
LibFM + TransE extends LibFM by attaching an entity representation learned by TransE (Bordes et al., 2013) to each user-item pair.
PER (Yu et al., 2014) is a representative of path-based methods, which treats the KG as heterogeneous information networks and extracts meta-path based features to represent the connectivity between users and items.
CKE (Zhang et al., 2016) is a representative of embedding-based methods, which combines CF with structural, textual, and visual knowledge in a unified framework. We implement CKE as CF plus a structural knowledge module in this paper.
RippleNet (Wang et al., 2018b) is a representative of hybrid methods, which is a memory-network-like approach that propagates users’ preferences on the KG for recommendation.
5.2. Validating the Connection between and
To validate the connection between the knowledge graph and user-item interaction , we conduct an empirical study where we investigate the correlation between the shortest distance of two randomly sampled items in the KG and whether they have common rater(s) in the dataset. For MovieLens-20M and Last.FM, we randomly sample ten thousand item pairs that have no common raters and have at least one common rater, respectively, then count the distribution of their shortest distances in the KG. The results are presented in Figure 3, which clearly show that if two items have common rater(s) in the dataset, they are likely to be more close in the KG. For example, if two movies have common rater(s) in MovieLens-20M, there is a probability of that they will be within 2 hops in the KG, while the probability is if they have no common rater. This finding empirically demonstrates that exploiting the proximity structure of the KG can assist making recommendations. This also justifies our motivation to use support set and label smoothness to help learn entity representations.
5.3.1. Comparison with Baselines
We evaluate our method in two experiment scenarios: (1) In top- recommendation, we use the trained model to select items with highest predicted click probability for each user in the test set, and choose to evaluate the recommended sets. (2) In click-through rate (CTR) prediction, we apply the trained model to predict each interaction in the test set. We use as the evaluation metric in CTR prediction.
The results of top- recommendation and CTR prediction are presented in Table LABEL:table:topk and Table LABEL:table:ctr, respectively. We have the following observations:
In general, we find that the improvements of KGCN-LS on Book-Crossing and Last.FM are higher than MovieLens-20M and Dianping-Food compared with baselines. This is probably because MovieLens-20M and Dianping-Food are much denser and therefore easier to model.
The performance of KG-free baselines, SVD and LibFM, are actually better than the two KG-aware baselines PER and CKE, which indicates that PER and CKE cannot make full use of the KG with manually designed meta-paths and TransR-like regularization.
LibFM + TransE is better than LibFM in most cases, which demonstrates that the introduction of KG is helpful for recommendation in general.
PER performs worst among all baselines, since it is hard to define optimal meta-paths in reality.
RippleNet shows strong performance compared with other baselines. Note that RippleNet also uses multi-hop neighborhood structure, which indicates that capturing proximity information in the KG is essential for recommendation.
The last two rows in Table LABEL:table:topk and Table LABEL:table:ctr summarize the performance of KGCN-LS and its variant KGCN-avg, in which neighborhood representations are directly averaged without relation scores (i.e., treating the KG unweighted). From the results we find that:
KGCN-LS outperforms baselines by a significant margin. For example, the of KGCN-LS surpasses baselines by , , , and on average in MovieLens-20M, Book-Crossing, Last.FM, and Dianping-Food datasets, respectively.
KGCN-avg performs worse than KGCN-LS, especially in Book-Crossing and Last.FM, where interactions are sparse. This demonstrates that capturing users’ personalized preferences and semantics of the KG do benefit the recommendation.
We also show daily performance of KGCN-LS and baselines on Dianping-Food to investigate model stability. Figure 6 shows their score from November 1, 2018 to November 30, 2018. We notice that the curve of KGCN-LS is consistently above baselines over the test period; Moreover, the performance of KGCN-LS is also with low variance, which suggests that KGCN-LS is also robust and stable in practice.
5.3.2. Effectiveness of LS Regularization
Is the proposed LS regularization helpful in improving the performance of KGCN? To study the effectiveness of LS regularization, we fix the dimension of embeddings as , , and , then vary from to to see how performance changes. The results of in Last.FM dataset are plotted in Figure 6. It is clear that the performance of KGCN-LS with a non-zero is superior to the case where , which justifies our claim that LS regularization can assist learning the edge weights in a KG and achieve better generalization in recommender systems. But note that a too large is less attractive, since it overwhelms the overall loss and misleads the direction of gradients. According to the experiment results, we find that a between and is preferable in most practical cases.
5.3.3. Results in sparse scenarios
One major goal of using KGs in recommender systems is to alleviate the sparsity issue and cold start problem. To investigate the performance of KGCN-LS in sparse scenarios, we vary the size of training set of MovieLens-20M from to (while the validation and test set are kept fixed), and report the results of in Table 3. When , decreases by , , , , , and for the six baselines compared to the model trained on full training data (), but the decrease in performance for KGCN-LS is only . This demonstrates that KGCN-LS still maintains predictive performance even when user-item interactions are sparse.
5.3.4. Hyper-parameters Sensitivity
We first vary the size of sampled neighbor to investigate the usage efficacy of the KG. The results of in the four datasets are presented in Table 4, from which we find that KGCN-LS achieves the best performance when . This is because a too small does not have enough capacity to encode neighborhood information, while a too large is prone to introduce noises and incurs heavier computation overhead.
We also investigate the influence of the depth of support set in KGCN-LS by varying from 1 to 4. The results are shown in Table 5, which demonstrate that KGCN-LS is more sensitive to compared to . We observe serious model collapse when , as a larger will mix too many embedding vectors of entities into the computation of a given entity, which over-smoothes the representation learning on graphs. This is also in accordance with our intuition, since a too long relation-chain makes little sense when inferring inter-item similarities. An with value is enough for most cases according to the experiment results.
5.4. Run Time Analysis
We also investigate the scalability of our method with respect to the size of KG. We run experiments on a Microsoft Azure virtual machine with 1 NVIDIA Tesla M60 GPU, 12 Intel Xeon CPUs (E5-2690 v3 @2.60GHz), and 128GB of RAM. The size of the KG is increased by up to five times the original one by extracting more triples from Satori, and the running times of all methods on MovieLens-20M are reported in Figure 6. Note that the trend of a curve matters more than the real values, since the values are largely dependent on the minibatch size and the number of epochs (yet we did try to align the configurations of all methods). We find that the running time of LibFM+TransE and CKE grows linearly with respect to KG size, while PER and RippleNet show an approximate exponential pattern. By contrast, KGCN-LS exhibits strong scalability even when the KG is large. This is because the support set of each entity is a fixed-size sampling set, which is resistant to the expansion of the KG.
6. Conclusion and Future Work
This paper proposes knowledge graph convolutional networks with label smoothness regularization for recommender systems. KGCN-LS extends GCNs to KGs by biasedly aggregating neighborhood information, which is able to learn both structure and semantic information of the KG as well as users’ personalized interests. The proposed LS regularization and leave-one-out loss provide strong additional guidance for the learning process. We also implement KGCN-LS in a scalable minibatch fashion. Through extensive experiments, KGCN-LS is shown to consistently outperform state-of-the-art baselines in four recommendation scenarios, and achieve desirable scalability with respect to KG size.
In this paper, LS regularization is proposed for recommendation task on the KG. It is interesting to examine the LS assumption on other graph tasks such as link prediction and node classification. Investigating the theoretical relationship between feature propagation and label propagation is also a promising direction.
- Baluja et al. (2008) Shumeet Baluja, Rohan Seth, D Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. 2008. Video suggestion and discovery for youtube: taking random walks through the view graph. In Proceedings of the 17th international conference on World Wide Web. ACM, 895–904.
- Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems. 2787–2795.
- Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. 2014. Spectral networks and locally connected networks on graphs. In the 2nd International Conference on Learning Representations.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems. 3844–3852.
- Duvenaud et al. (2015) David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional networks on graphs for learning molecular fingerprints. In Advances in Neural Information Processing Systems. 2224–2232.
- Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1024–1034.
- Hu et al. (2018) Binbin Hu, Chuan Shi, Wayne Xin Zhao, and Philip S Yu. 2018. Leveraging meta-path based context for top-n recommendation with a neural co-attention model. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 1531–1540.
- Huang et al. (2018) Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y Chang. 2018. Improving Sequential Recommendation with Knowledge-Enhanced Memory Networks. In the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 505–514.
- Karasuyama and Mamitsuka (2013) Masayuki Karasuyama and Hiroshi Mamitsuka. 2013. Manifold-based similarity adaptation for label propagation. In Advances in neural information processing systems. 1547–1555.
- Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In the 5th International Conference on Learning Representations.
- Koren (2008) Yehuda Koren. 2008. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 426–434.
- Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 8 (2009), 30–37.
- Monti et al. (2017) Federico Monti, Michael Bronstein, and Xavier Bresson. 2017. Geometric matrix completion with recurrent multi-graph neural networks. In Advances in Neural Information Processing Systems. 3697–3707.
- Niepert et al. (2016) Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. 2016. Learning convolutional neural networks for graphs. In International Conference on Machine Learning. 2014–2023.
- Rendle (2012) Steffen Rendle. 2012. Factorization machines with libfm. ACM Transactions on Intelligent Systems and Technology (TIST) 3, 3 (2012), 57.
- Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European Semantic Web Conference. Springer, 593–607.
- Sun et al. (2018) Zhu Sun, Jie Yang, Jie Zhang, Alessandro Bozzon, Long-Kai Huang, and Chi Xu. 2018. Recurrent knowledge graph embedding for effective recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 297–305.
- van den Berg et al. (2017) Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. stat 1050 (2017), 7.
- Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the 6th International Conferences on Learning Representations.
- Wang and Zhang (2008) Fei Wang and Changshui Zhang. 2008. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20, 1 (2008), 55–67.
- Wang et al. (2017b) Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, and Minyi Guo. 2017b. Joint topic-semantic-aware social recommendation for online voting. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. ACM, 347–356.
- Wang et al. (2018a) Hongwei Wang, Fuzheng Zhang, Min Hou, Xing Xie, Minyi Guo, and Qi Liu. 2018a. Shine: Signed heterogeneous information network embedding for sentiment link prediction. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 592–600.
- Wang et al. (2018b) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2018b. RippleNet: Propagating User Preferences on the Knowledge Graph for Recommender Systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 417–426.
- Wang et al. (2019a) Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019a. Exploring High-Order User Preference on the Knowledge Graph for Recommender Systems. ACM Transactions on Information Systems (TOIS) 37, 3 (2019), 32.
- Wang et al. (2018c) Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018c. DKN: Deep Knowledge-Aware Network for News Recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. 1835–1844.
- Wang et al. (2019b) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019b. Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation. In Proceedings of the 2019 World Wide Web Conference on World Wide Web.
- Wang et al. (2019c) Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019c. Knowledge graph convolutional networks for recommender systems. In Proceedings of the 2019 World Wide Web Conference on World Wide Web.
- Wang et al. (2017a) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017a. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering 29, 12 (2017), 2724–2743.
- Wu et al. (2018) Yuexin Wu, Hanxiao Liu, and Yiming Yang. 2018. Graph Convolutional Matrix Completion for Bipartite Edge Prediction. In the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management.
- Yang et al. (2017) Xiwang Yang, Chao Liang, Miao Zhao, Hongwei Wang, Hao Ding, Yong Liu, Yang Li, and Junlin Zhang. 2017. Collaborative filtering-based recommendation of online social voting. IEEE Transactions on Computational Social Systems 4, 1 (2017), 1–13.
- Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 974–983.
- Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized entity recommendation: A heterogeneous information network approach. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. ACM, 283–292.
- Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative knowledge base embedding for recommender systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 353–362.
- Zhang and Lee (2007) Xinhua Zhang and Wee S Lee. 2007. Hyperparameter learning for graph based semi-supervised learning algorithms. In Advances in neural information processing systems. 1585–1592.
- Zhao et al. (2017) Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 635–644.
- Zhou et al. (2004) Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Schölkopf. 2004. Learning with local and global consistency. In Advances in neural information processing systems. 321–328.
- Zhu et al. (2003) Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning. 912–919.
A Proof of Theorem 1
Denote the minimum-energy label function as . Note that the value of is fixed on . Therefore, the value of on should satisfy , which implies that . ∎
B Proof of Theorem 2
Let . Since is fixed on , we are solely interested in . We denote (the subscript is omitted from for ease of notation), and partition matrix into sub-matrices according to the partition of :
Then the label propagation scheme is equivalent to
Repeat the above procedure, we have
where is the initial value for . Now we show that . Since is row-normalized and is a sub-matrix of , we have
for all possible row index . Therefore,
As goes infinity, the row sum of converges to zero, which implies that . It’s clear that the choice of initial value does not affect the convergence.
C Additional Details on Datasets
MovieLens-20M, Book-Crossing, and Last.FM dataset contain explicit feedbacks data (Last.FM provides the listening count as weight for each user-item interaction). Therefore, we transform them into implicit feedback, where each entry is marked with 1 indicating that the user has rated the item positively. The threshold of positive rating is 4 for MovieLens-20M, while no threshold is set for Book-Crossing and Last.FM due to their sparsity. Additionally, we randomly sample an unwatched set of items and mark them as 0 for each user, the number of which equals his/her positively-rated ones.
We use Microsoft Satori to construct the KGs for MovieLens-20M, Book-Crossing, and Last.FM dataset. In one triple in Saroti KG, the head and tail are either IDs or textual content, and the relation is with the form “domain.head_category.tail_category” (e.g., “book.book.author”). We first select a subset of triples from the whole Satori KG with a confidence level greater than 0.9. Given the sub-KG, we collect Satori IDs of all valid movies/books/musicians by matching their names with tail of triples (head, film.film.name, tail), (head, book.book.title, tail), or (head, type.object.name, tail), for the three datasets. Items with multiple matched or no matched entities are excluded for simplicity. After having the set of item IDs, we match these item IDs with the head of all triples in Satori sub-KG, and select all well-matched triples as the final KG for each dataset.
Dianping-Food dataset is collected from Dianping.com, a Chinese group buying website hosting consumer reviews of restaurants similar to Yelp. We select approximately 10 million interactions between users and restaurants in Dianping.com from May 1, 2015 to December 12, 2018. The types of positive interactions include clicking, buying, and adding to favorites, and we sample negative interactions for each user. The KG for Dianping-Food is collected from Meituan Brain, an internal knowledge graph built for dining and entertainment by Meituan-Dianping Group. The types of entities include POI (restaurant), city, first-level and second-level category, star, business area, dish, and tag; The types of relations correspond to the types of entities (e.g., “organization.POI.has_dish”).
The basic statistics for the four datasets are shown in Table 6.
|# KG triples||499,474||60,787||15,518||160,519|
D Additional Details on Baselines
The hyper-parameter settings for baselines are as follows:
For SVD, we use the unbiased version (i.e., the predicted rating is modeled as ). The dimension and learning rate for the four datasets are set as: , for MovieLens-20M, Book-Crossing; , for Last.FM; , for Dianping-Food.
For PER, we use manually designed “user-item-attribute-item” as meta-paths, i.e., “user-movie-director-movie”, “user-movie-genre-movie”, and “user-movie-star-movie” for MovieLens-20M; “user-book-author-book” and “user-book-genre-book” for Book-Crossing, “user-musician-date_of_birth-musician” (date of birth is discretized), “user-musician-country-musician”, and “user-musician-genre-musician” for Last.FM; “user-restaurant-dish-restaurant”, “user-restaurant-business_area-restaurant”, “user-restaurant-tag-restaurant” for Dianping-Food. The settings of dimension and learning rate are the same as SVD.
For CKE, we only introduce the important settings here since its has nearly 20 hyper-parameters. The dimension of embedding for the four datasets are , , , . The training weight for KG part is for all datasets. The learning rate are the same as in SVD.
For RippleNet, , , , , for MovieLens-20M; , , , , for Last.FM; , , , , for Dianping-Food. The code of RippleNet is from https://github.com/hwwang55/RippleNet.
Other hyper-parameters that are not mentioned are the same as reported in their original papers or as default in their codes. We implement SVD, PER, and CKE by tensorflow since the authors do not open-source their codes.
E Additional Details on Experiment Setup
In KGCN-LS, we set functions and as inner product, as ReLU for non-last-layers and for the last-layer (because we do not want the final embedding are with all non-negative entries). Other hyper-parameter settings are given in Table 7, which are determined by optimizing on a validation set. The search spaces for hyper-parameters are as follows:
Batch size for MovieLens-20M and Dianping-Food; batch size for Book-Crossing and Last.FM.
For each dataset, the ratio of training, evaluation, and test set is . Each experiment is repeated times, and the average performance is reported. All trainable parameters are optimized by Adam algorithm. The code of KGCN-LS is implemented with Python 3.6, TensorFlow 1.12.0, and NumPy 1.14.3.