Estimating Node Importance in Knowledge Graphs Using
Graph Neural Networks
Abstract.
How can we estimate the importance of nodes in a knowledge graph (KG)? A KG is a multirelational graph that has proven valuable for many tasks including question answering and semantic search. In this paper, we present GENI, a method for tackling the problem of estimating node importance in KGs, which enables several downstream applications such as item recommendation and resource allocation. While a number of approaches have been developed to address this problem for general graphs, they do not fully utilize information available in KGs, or lack flexibility needed to model complex relationship between entities and their importance. To address these limitations, we explore supervised machine learning algorithms. In particular, building upon recent advancement of graph neural networks (GNNs), we develop GENI, a GNNbased method designed to deal with distinctive challenges involved with predicting node importance in KGs. Our method performs an aggregation of importance scores instead of aggregating node embeddings via predicateaware attention mechanism and flexible centrality adjustment. In our evaluation of GENI and existing methods on predicting node importance in realworld KGs with different characteristics, GENI achieves 5–17% higher NDCG@100 than the state of the art.
1. Introduction
Knowledge graphs (KGs) such as Freebase (Bollacker et al., 2008), YAGO (Suchanek et al., 2007), and DBpedia (Lehmann et al., 2015) have proven highly valuable resources for many applications including question answering (Dong et al., 2015), recommendation (Zhang et al., 2016), semantic search (Barbosa et al., 2013), and knowledge completion (West et al., 2014). A KG is a multirelational graph where nodes correspond to entities, and edges correspond to relations between the two connected entities. An edge in a KG represents a fact stored in the form of “subject predicate object”, (e.g., “Tim Robbins starredin The Shawshank Redemption”). KGs are different from traditional graphs that have only a single relation; KGs normally consist of multiple, different relations that encode heterogeneous information as illustrated by an example movie KG in Figure 1.
Given a KG, estimating the importance of each node is a crucial task that enables a number of applications such as recommendation, query disambiguation, and resource allocation optimization. For example, consider a situation where a customer issues a voice query “Tell me what Genie is” to a voice assistant backed by a KG. If the KG contains several entities with such a name, the assistant could use their estimated importance to figure out which one to describe. Furthermore, many KGs are largescale, often containing millions to billions of entities for which the knowledge needs to be enriched or updated to reflect the current state. As validating information in KGs requires a lot of resources due to their size and complexity, node importance can be used to guide the system to allocate limited resources for entities of high importance.
How can we estimate the importance of nodes in a KG? In this paper, we focus on the setting where we are given importance scores of some nodes in a KG. An importance score is a value that represents the significance or popularity of a node in the KG. For example, the number of pageviews of a Wikipedia page can be used as an importance score of the corresponding entity in a KG since important nodes tend to attract a lot of attention and search traffic. Then given a KG, how can we predict node importance by making use of importance scores known for some nodes along with auxiliary information in KGs such as edge types (predicates)?
In the past, several approaches have been developed for node importance estimation. PageRank (PR) (Page et al., 1999) is an early work on this problem that revolutionized the field of Web search. However, PR scores are based only on the graph structure, and unaware of importance scores available for some nodes. Personalized PageRank (PPR) (Haveliwala, 2002) dealt with this limitation by letting users provide their own notion of node importance in a graph. PPR, however, does not take edge types into account. HAR (Li et al., 2012) extends ideas used by PR and PPR to distinguish between different predicates in KGs while being aware of importance scores and graph topology. Still, we observe that there is much room for improvement, as evidenced by the performance of existing methods on realworld KGs in Figure 2. So far, existing techniques have approached this problem in a nontrainable framework that is based on a fixed model structure determined by their prior assumptions on the propagation of node importance, and involve no learnable parameters that are optimized based on the ground truth.
In this paper, we explore a new family of solutions for the task of predicting node importance in KGs, namely, regularized supervised machine learning algorithms. Our goal is to develop a more flexible supervised approach that learns from ground truth, and makes use of additional information in KGs. Among several supervised algorithms we explore, we focus on graph neural networks (GNNs). Recently, GNNs have received increasing interests, and achieved stateoftheart performance on node and graph classification tasks across data drawn from several domains (Kipf and Welling, 2016; Defferrard et al., 2016; Hamilton et al., 2017; Ying et al., 2018; Velickovic et al., 2018). Designed to learn from graphstructured data, and based on neighborhood aggregation framework, GNNs have the potential to make further improvements over earlier approaches. However, existing GNNs have focused on graph representation learning via embedding aggregation, and have not been designed to tackle challenges that arise with supervised estimation of node importance in KGs. Challenges include modeling the relationship between the importance of neighboring nodes, accurate estimation that generalizes across different types of entities, and incorporating prior assumptions on node importance that aid model prediction, which are not addressed at the same time by existing supervised techniques.
We present GENI, a GNN for Estimating Node Importance in KGs. GENI applies an attentive GNN for predicateaware score aggregation to capture relations between the importance of nodes and their neighbors. GENI also allows flexible score adjustment according to node centrality, which captures connectivity of a node in terms of graph topology. Our main contributions are as follows.

We explore regularized supervised machine learning algorithms for estimating node importance in KGs, as opposed to nontrainable solutions where existing approaches belong.

We present GENI, a GNNbased method designed to address the challenges involved with supervised estimation of node importance in KGs.

We provide empirical evidence and an analysis of GENI using realworld KGs. Figure 2 shows that GENI outperforms the state of the art by 5%17% percentage points on real KGs.
2. Preliminaries
2.1. Problem Definition
A knowledge graph (KG) is a graph that represents multirelational data where nodes and edges correspond to entities and their relationships, respectively; is the number of types of edges (predicates); and denotes a set of edges of type . In KGs, there are often many types of predicates (i.e., ) between nodes of possibly different types (e.g., movie, actor, and director nodes), whereas in traditional graphs, nodes are connected by just one type of edges (i.e., ).
An importance score is a nonnegative real number that represents the significance or popularity of a node. For example, the total gross of a movie can be used as an importance score for a movie KG, and the number of pageviews of an entity can be used in a more generic KG such as Freebase (Bollacker et al., 2008). We assume a single set of importance scores, so the scores can compare with each other to reflect importance.
We now define the node importance estimation problem.
Definition 2.1 (Node Importance Estimation).
Given a KG and importance scores for a subset of nodes, learn a function that estimates the importance score of every node in KG.
Figure 1 shows an example KG on movies and related entities with importance scores given in advance for some movies. We approach the importance estimation problem by developing a supervised framework learning a function that maps any node in KG to its score, such that the estimation reflects its true importance as closely as possible.
Note that even when importance scores are provided for only one type of nodes (e.g., movies), we aim to do estimation for all types of nodes (e.g,. directors, actors, etc.).
Definition 2.2 (InDomain and OutOfDomain Estimation).
Given importance scores for some nodes of type (e.g., movies), predicting the importance of nodes of type is called an “indomain” estimation, and importance estimation for those nodes whose type is not is called an “outofdomain” estimation.
As available importance scores are often limited in terms of numbers and types, developing a method that generalizes well for both classes of estimation is an important challenge for supervised node importance estimation.
GENI  HAR (Li et al., 2012)  PPR (Haveliwala, 2002)  PR (Page et al., 1999)  
Neighborhood  
Predicate  
Centrality  
Input Score  
Flexibility 
2.2. Desiderata for Modeling Node Importance in KGs
Based on our discussion on prior approaches (PR, PPR, and HAR), we present the desiderata that have guided the development of our method for tackling node importance estimation problem. Table 1 summarizes GENI and existing methods in terms of these desiderata.
Neighborhood Awareness. In a graph, a node is connected to other nodes, except for the special case of isolated nodes. As neighboring entities interact with each other, and they tend to share common characteristics (network homophily), neighborhoods should be taken into account when node importance is modeled.
Making Use of Predicates. KGs consist of multiple types of predicates. Under the assumption that different predicates could play a different role in determining node importance, models should make predictions using information from predicates.
Centrality Awareness. Without any other information, it is reasonable to assume that highly central nodes are more important than less central ones. Therefore, scores need to be estimated in consideration of node centrality, capturing connectivity of a node.
Utilizing Input Importance Scores. In addition to graph topology, input importance scores provide valuable information to infer relationships between nodes and their importance. Thus, models should tap into both the graph structure and input scores for more accurate prediction.
Flexible Adaptation. Our assumption regarding node importance such as the one on centrality may not conform to the real distribution of input scores over KGs. Also, we do not limit models to a specific type of input scores. On the other hand, models can be provided with input scores that possess different characteristics. It is thus critical that a model can flexibly adapt to the importance that input scores reflect.
2.3. Graph Neural Networks
In this section, we present a generic definition of graph neural networks (GNNs). GNNs are mainly based on neighborhood aggregation architecture (Kipf and Welling, 2016; Hamilton et al., 2017; Gilmer et al., 2017; Ying et al., 2018; Velickovic et al., 2018). In a GNN with layers, its th layer () receives a feature vector for each node from the th layer (where is an input node feature ), and updates it by aggregating feature vectors from the neighborhood of node , possibly using a different weight for neighbor . As updated feature vectors become the input to the th layer, repeated aggregation procedure through layers in principle captures th order neighbors in learning a node’s representation. This process of learning representation of node by th layer is commonly expressed as (Hamilton et al., 2017; Ying et al., 2018; Xu et al., 2018):
(1)  
(2) 
where Aggregate is an aggregation function defined by the model (e.g., averaging or maxpooling operation); Transform is a modelspecific function that performs a (nonlinear) transformation of node embeddings via parameters in th layer shared by all nodes (e.g., multiplication with a shared weight matrix followed by some nonlinearity ); Combine is a function that merges the aggregated neighborhood representation with the node’s representation (e.g., concatenation).
3. Method
Effective estimation of node importance in KGs involves addressing the requirements presented in Section 2.2. As a supervised learning method, the GNN framework naturally allows us to utilize input importance scores to train a model with flexible adaptation. Its propagation mechanism also allows us to be neighborhood aware. In this section, we present GENI, which further enhances the model in three ways.

Neighborhood Importance Awareness: GNN normally propagates information between neighbors through node embedding. This is to model the assumption that an entity and its neighbors affect each other, and thus the representation of an entity can be better represented in terms of the representation of its neighbors. In the context of node importance estimation, neighboring importance scores play a major role on the importance of a node, whereas other neighboring features may have little effect, if any. We thus directly aggregate importance scores from neighbors (Section 3.1), and show empirically that it outperforms embedding propagation (Section 4.4).

Making Use of Predicates: We design predicateaware attention mechanism that models how predicates affect the importance of connected entities (Section 3.2).

Centrality Awareness: We apply centrality adjustment to incorporate node centrality into the estimation (Section 3.3).
An overview of GENI is provided in Figure 3. In Sections 3.3, 3.2 and 3.1, we describe the three main enhancements using the basic building blocks of GENI shown in Figure 2(a). Then we discuss an extension to a general architecture in Section 3.4. Table 2 provides the definition of symbols used in this paper.
3.1. Score Aggregation
To directly model the relationship between the importance of neighboring nodes, we propose a score aggregation framework, rather than embedding aggregation. Specifically, in Equations 2 and 1, we replace the hidden embedding of node with its score estimation and combine them as follows:
(3) 
where denotes the neighbors of node , which will be a set of the firstorder neighbors of node in our experiments. Here, is a learnable weight between nodes and for the th layer (). We train it via a shared attention mechanism which is computed by a predefined model with shared parameters and predicate embeddings, as we explain soon. In other words, GENI computes the aggregated score by performing a weighted aggregation of intermediate scores from node and its neighbors. Note that GENI does not apply function after aggregation as in Equation 1, since GENI aggregates scores. Propagating scores instead of node embeddings has the additional benefit of reducing the number of model parameters.
To compute the initial estimation , GENI uses input node features. In the simplest case, they can be onehot vectors that represent each node. More generally, they are realvalued vectors representing the nodes, which are extracted manually based on domain knowledge, or generated with methods for learning node embeddings. Let be the input feature vector of node . Then GENI computes the initial score of as
(4) 
where ScoringNetwork can be any neural network that takes in a node feature vector and returns an estimation of its importance. We used a simple fullyconnected neural network for our experiments.
3.2. PredicateAware Attention Mechanism
Inspired by recent work that showcased successful application of attention mechanism, we employ a predicateaware attention mechanism that attends over the neighbor’s intermediate scores.
Our attention considers two factors. First, we consider the predicate between the nodes because different predicates can play different roles for score propagation. For example, even though a movie may be released in a popular (i.e., important) country, the movie itself may not be popular; on the other hand, a movie directed by a famous (i.e., important) director is more likely to be popular. Second, we consider the neighboring score itself in deciding the attention. A director who directed a few famous (i.e., important) movies is likely to be important; the fact that he also directed some notsofamous movies in his life is less likely to make him unimportant.
GENI incorporates predicates into attention computation by using shared predicate embeddings; i.e., each predicate is represented by a feature vector of predefined length, and this representation is shared by nodes across all layers. Further, predicate embeddings are learned so as to maximize the predictive performance of the model in a flexible fashion. Note that in KGs, there could be multiple edges of different types between two nodes (e.g., see Figure 1). We use to denote the predicate of th edge between nodes and , and to denote a mapping from a predicate to its embedding.
In GENI, we use a simple, shared selfattention mechanism, which is a single layer feedforward neural network parameterized by the weight vector . Relation between the intermediate scores of two nodes and , and the role an inbetween predicate plays are captured by the attentional layer that takes in the concatenation of all relevant information. Outputs from the attentional layer are first transformed by nonlinearity , and then normalized via the softmax function. Formally, GENI computes the attention of node on node for th layer as:
(5) 
where is a nonlinearity, is a weight vector for th layer, and is a concatenation operator.
3.3. Centrality Adjustment
Existing methods such as PR, PPR, and HAR make a common assumption that the importance of a node positively correlates with its centrality in the graph. In the context of KGs, it is also natural to assume that more central nodes would be more important than less central ones, unless the given importance scores present contradictory evidence. Making use of this prior knowledge becomes especially beneficial in cases where we are given a small number of importance scores compared to the total number of entities, and in cases where the importance scores are given for entities of a specific type out of the many types in KG.
Given that the indegree of node is a common proxy for its centrality and popularity, we define the initial centrality of node to be
(6) 
where is a small positive constant.
While node centrality provides useful information on the importance of a node, strictly adhering to the node centrality could have a detrimental effect on model prediction. We need flexibility to account for the possible discrepancy between the node’s centrality in a given KG and the provided input importance score of the node. To this end, we use a scaled and shifted centrality as our notion of node centrality:
(7) 
where and are learnable parameters for scaling and shifting. As we show in Section 4.5, this flexibility allows better performance when indegree is not the best proxy of centrality.
To compute the final score, we apply centrality adjustment to the score estimation from the last layer, and apply a nonlinearity as follows:
(8) 
3.4. Model Architecture
The simple architecture depicted in Figure 2(a) consists of a scoring network and a single score aggregation (SA) layer (i.e., ), followed by a centrality adjustment component. Figure 2(b) extends it to a more general architecture in two ways. First, we extend the framework to contain multiple SA layers; that is, . As a single SA layer aggregates the scores of direct neighbors, stacking multiple SA layers enables aggregating scores from a larger neighborhood. Second, we design each SA layer to contain a variable number of SA heads, which perform score aggregation and attention computation independently of each other. Empirically, we find using multiple SA heads to be helpful for the model performance and the stability of optimization procedure (Section 4.5).
Let be an index of an SA head, and be the number of SA heads in th layer. We define to be node ’s score that is estimated by th layer, and fed into th SA head in th (i.e., the next) layer, which in turn produces an aggregation of these scores:
(9) 
where is the attention coefficient between nodes and computed by SA head in layer .
In the first SA layer, each SA head receives input scores from a separate scoring network , which provides the initial estimation of node importance. For the following layers, output from the previous SA layer becomes the input estimation. Since in th SA layer, SA heads independently produce score estimations in total, we perform an aggregation of these scores by averaging, which is provided to the next layer. That is,
(10) 
Multiple SA heads in th layer compute attention between neighboring nodes in the same way as in Equation 5, yet independently of each other using its own parameters :
(11) 
Centrality adjustment is applied to the output from the final SA layer. In order to enable independent scaling and shifting by each SA head, separate parameters and are used for each head . Then centrality adjustment by th SA head in the final layer is:
(12) 
With SA heads in the final th layer, we perform additional aggregation of centralityadjusted scores by averaging, and apply a nonlinearity , obtaining the final estimation :
(13) 
3.5. Model Training
In order to predict node importance with input importance scores known for a subset of nodes , we train GENI using mean squared error between the given importance score and the model estimation for node ; thus, the loss function is
(14) 
Note that ScoringNetwork is trained jointly with the rest of GENI. To avoid overfitting, we apply weight decay with an early stopping criterion based on the model performance on validation entities.
4. Experiments
In this section, we aim to answer the following questions.

How do GENI and baselines perform on realworld KGs with different characteristics? In particular, how well do methods perform in and outofdomain estimation (Definition 2.2)?

How do the components of GENI, such as centrality adjustment, and different parameter values affect its estimation?
We describe datasets, baselines, and evaluation plans in Sections 4.3, 4.2 and 4.1, and answer the above questions in Sections 4.5 and 4.4.
4.1. Datasets
Name  # Nodes  # Edges  # Predicates  # SCCs.  Input Score Type  # Nodes w/ Scores  Data for OOD Evaluation 
fb15k  14,951  592,213  1,345  9  # Pageviews  14,108 (94%)  N/A 
music10k  24,830  71,846  10  130  Song hotttnesss  4,214 (17%)  Artist hotttnesss 
tmdb5k  123,906  532,058  22  15  Movie popularity  4,803 (4%)  Director ranking 
imdb  1,567,045  14,067,776  28  1  # Votes for movies  215,769 (14%)  Director ranking 
In our experiments, we use four realworld KGs with different characteristics. Here we introduce these KGs along with the importance scores used for in and outofdomain (OOD) evaluations (see Definition 2.2). Summaries of the datasets (such as the number of nodes, edges, and predicates) are given in Table 3. More details such as data sources and how they are constructed can be found in Appendix A.
fb15k is a subset of Freebase, which is a large collaborative knowledge base containing general facts, and has been widely used for research and practical applications (Bordes et al., 2013; Bollacker et al., 2008). fb15k has a much larger number of predicates and a higher density than other KGs we evaluated. For each entity, we use the number of pageviews for the corresponding Wikipedia page as its score. Note that we do not perform OOD evaluation for fb15k since importance scores for fb15k apply to all types of entities.
music10k is a music KG sampled from the Million Song Dataset^{1}^{1}1https://labrosa.ee.columbia.edu/millionsong/, which includes information about songs such as the primary artist and the album the song belongs to. The dataset provides two types of popularity scores called “song hotttnesss” and “artist hotttnesss” computed by the Echo Nest platform by considering data from many sources such as mentions on the web, play counts, etc^{2}^{2}2https://musicmachinery.com/tag/hotttnesss/. We use “song hotttnesss” as input importance scores, and “artist hotttnesss” for OOD performance evaluation.
tmdb5k is a movie KG derived from the TMDb 5000 movie dataset^{3}^{3}3https://www.kaggle.com/tmdb/tmdbmoviemetadata. It contains movies and related entities such as movie genres, companies, countries, crews, and casts. We use the “popularity” information for movies as importance scores, which is provided by the original dataset. For OOD evaluation, we use a ranking of top200 highest grossing directors^{4}^{4}4https://www.thenumbers.com/boxofficestarrecords/worldwide/lifetimespecifictechnicalrole/director. Worldwide box office grosses given in the ranking are used as importance scores for directors.
imdb is a movie KG created from the public IMDb dataset, which includes information such as movies, genres, directors, casts, and crews. imdb is the largest KG among those we evaluate, with as many nodes as tmdb5k. IMDb dataset provides the number of votes a movie received, which we use as importance scores. For OOD evaluation, we use the same director ranking used for tmdb5k.
4.2. Baselines
Methods for node importance estimation in KGs can be classified into two families of algorithms.
NonTrainable Approaches. Previously developed methods mostly belong to this category. We evaluate the following methods:
Supervised Approaches. We explore the performance of representative supervised algorithms on node importance estimation:

Linear regression (LR): an ordinary least squares algorithm.

Random forests (RF): a random forest regression model.

Neural networks (NN): a fullyconnected neural network.

Graph attention networks (GAT) (Velickovic et al., 2018): This is a GNN model reviewed in Section 2.3. We add a final layer that takes the node embedding and outputs the importance score of a node.
All these methods and GENI use the same data (node features and input importance scores). In our experiments, node features are generated using node2vec (Grover and Leskovec, 2016). Depending on the type of KGs, other types of node features, such as bagofwords representation, can also be used. Note that the graph structure is explicitly used only by GAT, although other supervised baselines make an implicit use of it when node features encode graph structural information.
We will denote each method by the name in parentheses. Experimental settings for baselines and GENI are provided in Appendix B.
4.3. Performance Evaluation
We evaluate methods based on their in and outofdomain (OOD) performance. We performed 5fold cross validation, and report the average and standard deviation of the following metrics on ranking quality and correlation: normalized discounted cumulative gain and Spearman correlation coefficient. Higher values are better for all metrics. We now provide their formal definitions.
Normalized discounted cumulative gain (NDCG) is a measure of ranking quality. Given a list of nodes ranked by predicted scores, and their graded relevance values (which are nonnegative, realvalued ground truth scores in our setting), discounted cumulative gain at position () is defined as:
(15) 
where denotes the graded relevance of the node at position . Note that due to the logarithmic reduction factor, the gain of each node is penalized at lower ranks. Consider an ideal DCG at rank position () which is obtained by an ideal ordering of nodes based on their relevance scores. Normalized DCG at position () is then computed as:
(16) 
Our motivation for using is to test the quality of ranking for the top entities.
Spearman correlation coefficient (Spearman) measures the rank correlation between the ground truth scores and predicted scores ; that is, the strength and direction of the monotonic relationship between the rank values of and . Converting and into ranks and , respectively, Spearman correlation coefficient is computed as:
(17) 
where and are the mean of and .
For indomain evaluation, we use NDCG@100 and Spearman as they complement each other: NDCG@100 looks at the top100 predictions, and Spearman considers the ranking of all entities with known scores. For NDCG, we also tried different cutoff thresholds and observed similar results. Note that we often have a small volume of data for OOD evaluation. For example, for tmdb5k and imdb, we used a ranking of 200 directors with known scores, while tmdb5k and imdb have 2,578 and 287,739 directors, respectively. Thus Spearman is not suitable for OOD evaluation as it considers only those small number of entities in the ranking, and ignores all others, even if they are predicted to be highly important; thus, for OOD evaluation, we report NDCG@100 and NDCG@2000.
Additionally, we report regression performance in Section C.2.
4.4. Importance Estimation on RealWorld Data
We evaluate GENI and baselines in terms of in and outofdomain (OOD) predictive performance.
4.4.1. InDomain Prediction
Table 4 summarizes indomain prediction performance. GENI outperforms all baselines on four datasets in terms of both NDCG@100 and Spearman. It is noteworthy that supervised approaches generally perform better indomain prediction than nontrainable ones, especially on fb15k and imdb, which are more complex and larger than the other two. It demonstrates the applicability of supervised models to our problem. On all KGs except music10k, GAT outperforms other supervised baselines, which use the same node features but do not explicitly take the graph network structure into account. This shows the benefit of directly utilizing network connectivity. By modeling the relation between scores of neighboring entities, GENI achieves further performance improvement over GAT. Among nontrainable baselines, HAR often performs worse than PR and PPR, which suggests that considering predicates could hurt performance if predicate weight adjustment is not done properly.
4.4.2. OutOfDomain Prediction
Table 5 summarizes OOD prediction results. GENI achieves the best results for all KGs in terms of both NDCG@100 and NDCG@2000. In contrast to indomain prediction where supervised baselines generally outperform nontrainable ones, we observe that nontrainable methods achieve higher OOD results than supervised baselines on music10k and tmdb5k. In these KGs, only about 4,000 entities have known scores. Given scarce ground truth, nontrainable baselines could perform better by relying on a prior assumption on the propagation of node importance. Further, note that the difference between nontrainable and supervised baselines is more drastic on tmdb5k where the proportion of nodes with scores is the smallest (). On the other hand, on imdb, which is our largest KG with the greatest number of ground truth, supervised baselines mostly outperform nontrainable methods. In particular, none of the top100 directors in imdb predicted by PR and PPR belong to the ground truth director ranking. With 14% of nodes in imdb associated with known scores, supervised methods learn to generalize better for OOD prediction. Although neighborhood aware, GAT is not better than other supervised baselines. By applying centrality adjustment, GENI achieves superior performance to both classes of baselines regardless of the number of available known scores.
Method  music10k  tmdb5k  imdb  







PR  
PPR  
HAR  
LR  
RF  
NN  
GAT  
GENI 
4.5. Analysis of Geni
4.5.1. Effect of Considering Predicates
To see how the consideration of predicates affects model performance, we run GENI on fb15k, which has the largest number of predicates, and report NDCG@100 and Spearman when a single embedding is used for all predicates (denoted by “shared embedding”) vs. when each predicate uses its own embedding (denoted by “distinct embedding”). Note that using “shared embedding”, GENI loses the ability to distinguish between different predicates. In the results given in Table 6, we observe that NDCG@100 and Spearman are increased by 3.6% and 12.7%, respectively, when a dedicated embedding is used for each predicate. This shows that GENI successfully makes use of predicates for modeling the relation between node importance; this is especially crucial in KGs such as fb15k that consist of a large number of predicates.







4.5.2. Flexibility for Centrality Adjustment.
In Equation 7, we perform scaling and shifting of for flexible centrality adjustment (CA). Here we evaluate the model with fixed CA without scaling and shifting where the final estimation . In Table 7, we report the performance of GENI on fb15k and tmdb5k obtained with fixed and flexible CA while all other parameters were identical. When node centrality strongly correlates with input scores, fixed CA obtains similar results to flexible CA. This is reflected on the result of tmdb5k dataset, where PR and log indegree baseline (LID), which estimates node importance as the log of its indegree, both estimate node importance close to the input scores. On the other hand, when node centrality is not in good agreement with input scores, as demonstrated by the poor performance of PR and LID as on fb15k, flexible CA performs much better than fixed CA (8% higher NDCG@100, and 27% higher Spearman on fb15k).
4.5.3. Parameter Sensitivity
We evaluate the parameter sensitivity of GENI by measuring performance on fb15k varying one of the following parameters while fixing others to their default values (shown in parentheses): number of score aggregation (SA) layers (1), number of SA heads in each SA layer (1), dimension of predicate embedding (10), and number of hidden layers in scoring networks (1 layer with 48 units). Results presented in Figure 4 shows that the model performance tends to improve as we use a greater number of SA layers and SA heads. For example, Spearman increases from 0.72 to 0.77 as the number of SA heads is increased from 1 to 5. Using more hidden layers for scoring networks also tends to boost performance, although exceptions are observed. Increasing the dimension of predicate embedding beyond an appropriate value negatively affects the model performance, although GENI still achieves high Spearman compared to baselines.
5. Related Work
Node Importance Estimation. Many approaches have been developed for node importance estimation (Page et al., 1999; Haveliwala, 2002; Tong et al., 2008; Jung et al., 2017; Li et al., 2012; Kleinberg, 1999). PageRank (PR) (Page et al., 1999) is based on the random surfer model where an imaginary surfer randomly moves to a neighboring node with probability , or teleports to any other node randomly with probability . PR predicts the node importance to be the limiting probability of the random surfer being at each node. Accordingly, PR scores are determined only by the graph structure, and unaware of input importance scores. Personalized PageRank (PPR) (Haveliwala, 2002) deals with this limitation by biasing the random walk to teleport to a set of nodes relevant to some specific topic, or alternatively, nodes with known importance scores. Random walk with restart (RWR) (Tong et al., 2008; Jung et al., 2017) is a closely related method that addresses a special case of PPR where teleporting is restricted to a single node. PPR and RWR, however, are not well suited for KGs since they do not consider edge types. To make a better use of rich information in KGs, HAR (Li et al., 2012) extends the idea of random walk used by PR and PPR to solve limiting probabilities arising from multirelational data, and distinguishes between different predicates in KGs while being aware of importance scores. Previous methods can be categorized as nontrainable approaches with a fixed model structure that do not involve model parameter optimization. In this paper, we explore supervised machine learning algorithms with a focus on graph neural networks.
Graph Neural Networks (GNNs). GNNs are a class of neural networks that learn from arbitrarily structured graph data. Many GNN formulations have been based on the notion of graph convolutions. The pioneering work of Bruna et al. (Bruna et al., 2014) defined the convolution operator in the Fourier domain, which involved performing the eigendecomposition of the graph Laplacian; as a result, its filters were not spatially localized, and computationally costly. A number of works followed to address these limitations. Henaff et al. (Henaff et al., 2015) introduced a localization of spectral filters via the spline parameterization. Defferrard et al. (Defferrard et al., 2016) designed more efficient, strictly localized convolutional filters. Kipf and Welling (Kipf and Welling, 2016) further simplified localized spectral convolutions via a firstorder approximation. To reduce the computational footprint and improve performance, recent works explored different ways of neighborhood aggregation. One direction has been to restrict neighborhoods via sampling techniques such as uniform neighbor sampling (Hamilton et al., 2017), vertex importance sampling (Chen et al., 2018), and random walkbased neighbor importance sampling (Ying et al., 2018). Graph attention networks (GAT) (Velickovic et al., 2018), which is most closely related to our method, explores an orthogonal direction of assigning different importance to different neighbors by employing selfattention over neighbors (Vaswani et al., 2017). While GAT exhibited stateoftheart results, it was applied only to node classifications, and is unaware of predicates. Building upon recent developments in GNNs, GENI tackles the challenges for node importance estimation in KGs, which have not been addressed by existing GNNs.
6. Conclusion
Estimating node importance in KGs is an important problem with many applications such as item recommendation and resource allocation. In this paper, we present a method GENI that addresses this problem by utilizing rich information available in KGs in a flexible manner which is required to model complex relation between entities and their importance. Our main ideas can be summarized as score aggregation via predicateaware attention mechanism and flexible centrality adjustment. Experimental results on predicting node importance in realworld KGs show that GENI outperforms existing approaches, achieving 5–17% higher NDCG@100 than the state of the art. For future work, we will consider multiple independent input sources for node importance.
References
 (1)
 Barbosa et al. (2013) Denilson Barbosa, Haixun Wang, and Cong Yu. 2013. Shallow Information Extraction for the knowledge Web. In ICDE. 1264–1267.
 Bollacker et al. (2008) Kurt D. Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD. 1247–1250.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarcíaDurán, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multirelational Data. In NIPS. 2787–2795.
 Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. 2014. Spectral Networks and Locally Connected Networks on Graphs. In ICLR.
 Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In ICLR.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In NIPS. 3837–3845.
 Dong et al. (2015) Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Question Answering over Freebase with MultiColumn Convolutional Neural Networks. In ACL. 260–269.
 Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural Message Passing for Quantum Chemistry. In ICML. 1263–1272.
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In KDD. 855–864.
 Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NIPS. 1025–1035.
 Haveliwala (2002) Taher H. Haveliwala. 2002. Topicsensitive PageRank. In WWW. 517–526.
 Henaff et al. (2015) Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep Convolutional Networks on GraphStructured Data. CoRR abs/1506.05163 (2015).
 Jung et al. (2017) Jinhong Jung, Namyong Park, Lee Sael, and U. Kang. 2017. BePI: Fast and MemoryEfficient Method for BillionScale Random Walk with Restart. In SIGMOD.
 Kipf and Welling (2016) Thomas N. Kipf and Max Welling. 2016. SemiSupervised Classification with Graph Convolutional Networks. CoRR abs/1609.02907 (2016).
 Kleinberg (1999) Jon M. Kleinberg. 1999. Authoritative Sources in a Hyperlinked Environment. J. ACM 46, 5 (1999), 604–632.
 Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia  A largescale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2015), 167–195.
 Li et al. (2012) Xutao Li, Michael K. Ng, and Yunming Ye. 2012. HAR: Hub, Authority and Relevance Scores in MultiRelational Data for Query Search. In SDM. 141–152.
 Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: Bringing order to the web. Technical Report. Stanford InfoLab.
 Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowledge. In WWW. 697–706.
 Tong et al. (2008) Hanghang Tong, Christos Faloutsos, and JiaYu Pan. 2008. Random walk with restart: fast solutions and applications. Knowl. Inf. Syst. 14, 3 (2008), 327–346.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 6000–6010.
 Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In ICLR.
 West et al. (2014) Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via searchbased question answering. In WWW. 515–526.
 Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Kenichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs with Jumping Knowledge Networks. In ICML. 5449–5458.
 Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for WebScale Recommender Systems. In KDD. 974–983.
 Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and WeiYing Ma. 2016. Collaborative Knowledge Base Embedding for Recommender Systems. In KDD.
In the appendix, we provide details on datasets, experimental settings, and additional experimental results, such as a case study on tmdb5k and regression performance evaluation for indomain predictions.
Appendix A Datasets
We perform evaluation using four realworld KGs that have different characteristics. All KGs are constructed from public data sources, which we specify in the footnote. Summaries of these datasets (such as the number of nodes, edges, and predicates) are given in Table 3. Below, we provide details on the construction of each KG.
fb15k. We use a sample of Freebase^{5}^{5}5https://everest.hds.utc.fr/doku.php?id=en:smemlj12 used by (Bordes et al., 2013). The original dataset is divided into training, validation, and test sets. We combined them into a single dataset, and later divided them randomly into three sets based on our proportion for training, validation, and test data. In order to find the number of pageviews of a Wikipedia page, which is the importance score used for fb15k, we used Freebase/Wikidata mapping^{6}^{6}6https://developers.google.com/freebase/. Most entities in fb15k can be mapped to the corresponding Wikidata page, from which we found the link to the item’s English Wikipedia page, which provides several information including the number of pageviews in the past 30 days.
music10k. We build music10k from the sample^{7}^{7}7https://think.cs.vt.edu/corgis/csv/music/music.html of the Million Song Dataset^{8}^{8}8https://labrosa.ee.columbia.edu/millionsong/. This dataset is a collection of audio features and metadata for one million popular songs. Among others, this dataset includes information about songs such as the primary artist and the album the song belongs to. We constructed music10k by adding nodes for these three entities (i.e., songs, artists, and albums), and edges of corresponding types between them as appropriate. Note that music10k is much more fragmented than other datasets.
tmdb5k. We constructed tmdb5k from the TMDb 5000 movie dataset^{9}^{9}9https://www.kaggle.com/tmdb/tmdbmoviemetadata. This dataset contains movies and relevant information such as movie genres, companies, countries, crews, and casts in a tabular form. We added nodes for each of these entities, and added edges between two related entities with appropriate types. For instance, given that “Steven Spielberg” directed “Schindler’s List”, we added two corresponding director and movie nodes, and added an edge of type “directed” between them.
imdb. We created imdb from public IMDb datasets^{10}^{10}10https://www.imdb.com/interfaces/. IMDb datasets consist of several tables, which contain information such as titles, genres, directors, writers, principal casts and crews. As for tmdb5k, we added nodes for these entities, and connected them with edges of corresponding types. In creating imdb, we focused on entities related to movies, and excluded other entities that have no relation with movies. In addition, IMDb datasets include titles each person is known for; we added edges between a person and these titles to represent this special relationship.
Scores. For fb15k, tmdb5k, imdb, we added 1 to the importance scores as an offset, and logtransformed them as the scores were highly skewed. For music10k, two types of provided scores were all between 0 and 1, and we used them without log transformation.
Appendix B Experimental Settings
b.1. Cross Validation and Early Stopping
We performed 5fold cross validation; i.e., for each fold, 80% of the ground truth scores were used for training, and the other 20% were used for testing. For methods based on neural networks, we applied early stopping by using 15% of the original training data for validation and the remaining 85% for training, with a patience of 50. That is, the training was stopped if the validation loss did not decrease for 50 consecutive epochs, and the model with the best validation performance was used for testing.
b.2. Software
We used several open source libraries, and used Python 3.6 for our implementation.
Graph Library. We used NetworkX 2.1 for graphs and graph algorithms: MultiDiGraph class was used for all KGs as there can be multiple edges of different types between two entities; NetworkX’s pagerank_scipy function was used for PR and PPR.
Machine Learning Library. We chose TensorFlow 1.12 as our deep learning framework. We used scikitlearn 0.20.0 for other machine learning algorithms such as random forest and linear regression.
Other Libraries and Algorithms. For GAT, we used the reference TensorFlow implementation provided by the authors^{11}^{11}11https://github.com/PetarV/GAT. We implemented HAR in Python 3.6 based on the algorithm description presented in (Li et al., 2012). For node2vec, we used the implementation available from the project page^{12}^{12}12https://snap.stanford.edu/node2vec/. NumPy 1.15 and SciPy 1.1.0 were used for data manipulation.
b.3. Hyperparameters and Configurations
PageRank (Pr) and Personalized PageRank (Ppr)
We used the default values for NetworkX’s pagerank_scipy function with 0.85 as a damping factor.
HAR (Li et al., 2012). We used 0.85 as a damping factor as in PR and PPR. While the maximum number of iterations was set to 30, HAR usually converged in less than 30 iterations. HAR is designed to compute two types of importance scores, hub and authority. For music10k, tmdb5k, and imdb KGs, these scores are identical since each edge in these graphs has a matching edge with an inverse predicate going in the opposite direction. Thus for these KGs, we only report authority scores. For fb15k, we compute both types of scores, and report authority scores as hub scores are slightly worse overall.
Linear Regression (LR) and Random Forests (RF). For both methods, we used default parameter values defined by scikitlearn.
Neural Networks (NN). Let denote a 3layer neural network where and are the number of neurons in the input, first hidden, second hidden, and output layers, respectively. For NN, we used an architecture of [] where is the dimension of node features. We applied a rectified linear unit (ReLU) nonlinearity at each layer, and used Adam optimizer with a learning rate , , , and a weight decay of 0.0005.
Graph Attention Networks (GAT) (Velickovic et al., 2018). We used a GAT model with two attentional layers, each of which consists of four attention heads, which is followed by a fully connected NN (FCNN). Following the settings in (Velickovic et al., 2018), we used a Leaky ReLU with a negative slope of 0.2 for attention coefficient computation, and applied an exponential linear unit (ELU) nonlinearity to the output of each attention head. The output dimension of an attention head in all layers except the last was set to . For FCNN after the attentional layers, we used an architecture of [, ] with ReLU as nonlinearity. Adam optimizer was applied with a learning rate , , , and a weight decay of 0.0005.
GENI. We used an architecture where each score aggregation (SA) layer contains four SA heads. For fb15k, we used a model with three SA layers, and for other KGs, we used a model with one SA layer. For ScoringNetwork, a twolayer FCNN with an architecture of [] was used. GENI was trained with Adam optimizer using a learning rate , , , and a weight decay of 0.0005. The dimension of predicate embedding was set to for all KGs. We used a Leaky ReLU with a negative slope of 0.2 for attention coefficient computation (), and a RELU for the final score estimation (). We defined as outgoing neighbors of node . Similar results were observed when we defined to include both outgoing and incoming neighbors of node . Since the initial values for and (parameters for centrality adjustment) affect model performance, we determined these initial values for each dataset based on the validation performance.
node2vec (Grover and Leskovec, 2016). We set the number of output dimensions to 64 for fb15k, music10k, and tmdb5k, and 128 for imdb. Other parameters were left to their default values. Note that node2vec was used in our experiments to generate node features for supervised methods.
Appendix C Additional Evaluation
c.1. Case Study
We take a look at the predictions made by GENI, HAR, and GAT on tmdb5k. Given popularity scores for some movies, methods estimate the importance score of all other entities in tmdb5k. Among them, Table 8 reports the top10 movies and directors that are estimated to have the highest importance scores by each method with “ground truth rank”“estimated rank” shown for each entity.
Indomain estimation is presented in Table 8(a). A ground truth rank is computed from the known importance scores of movies reserved for testing. The quality of the top10 movies predicted by GENI and GAT are comparable to each other, and they also contain the same movie titles such as “The Dark Knight Rises.” On the other hand, the top10 movies predicted by HAR is qualitatively worse than the two others: among the ten predictions, ground truth ranks of three movies are out of top 100.
Outofdomain estimation is presented in Table 8(b). As importance scores for directors are unknown, we use the director ranking introduced in Section 4.1. A ground truth rank denotes the rank in the director ranking, and “N/A” indicates that the director is not included in the director ranking. The quality of the top10 directors estimated by GENI and HAR are similar to each other with five directors appearing in both rankings (e.g., Steven Spielberg). Although GAT is on par with GENI for indomain estimation, its outofdomain estimation is significantly worse than others: nine out of ten predictions are not even included in the list of top200 highest earning directors. By respecting node centrality, GENI yields a much better ranking consistent with ground truth.
c.2. Regression Performance Evaluation for InDomain Predictions
In order to see how accurately supervised approaches recover the importance of nodes, we measure the regression performance of their indomain predictions. In particular, we report RMSE (rootmeansquared error) of supervised methods in Table 9. Nontrainable methods are excluded since their output is not in the same scale as the input scores. GENI performs better than other supervised methods on all four realworld datasets. Overall, the regression performance of supervised approaches follows a similar trend to their performance in terms of ranking measures reported in Table 4.