Notable Characteristics Search through Knowledge Graphs

Notable Characteristics Search through Knowledge Graphs

Davide Mottin
davide.mottin@hpi.de
   Bastian Grasnick
bastian.grasnick@student.hpi.de
   Axel Kroschk
axel.kroschk@student.hpi.de
   Patrick Siegler
patrick.siegler@student.hpi.de
   Emmanuel Müller
emmanuel.mueller@hpi.de
Abstract

Search engines employ complex data structures to assist the user in the search process. Among these structures, knowledge graphs are vastly used for various search tasks. Given a knowledge graph that represents entities and relationships among them, one aims at complementing the search with intuitive but effective mechanisms. In particular, we focus on the comparison of two or more entities and the detection of unexpected properties, called notable characteristics. These notable characteristics find large applicability in many domains since they provide non-trivial insights of the entities into consideration in an intuitive and domain-independent fashion. To this end, we propose a novel formulation of the problem of searching and retrieving notable characteristics given an initial set of query nodes. While the traditional comparison of nodes by means of node similarity provides only a score with no explanation, we go one step further. We propose a solid probabilistic approach that first retrieves nodes that are similar to the query nodes provided by the user, and then exploits distributional properties to understand whether a particular attribute is interesting or not. We experimentally evaluate the effectiveness of our approach and show that we are able to discover notable characteristics that are indeed interesting and relevant for the user.

\numberofauthors

5

1 Introduction

Search engines have greatly evolved from simple indexes of pages to complex systems that are able to predict user intention, show personalized content, answer queries on a large variety of data sources. One way to improve the search quality is by using a knowledge graph representation of data including relationships among entities. A knowledge graph represents entities (e.g., Barack Obama, USA) as nodes and relationships between them (e.g., leaderOf) as edges in a graph. With this representation knowledge graphs empower search capabilities by exploiting the relations among entities [18]. They have been successfully employed for text understanding [9], keyword search expansion [3], and semantic reasoning [28].

Figure 1: An example knowledge graph, the query (Merkel and Obama), and the discovered context nodes (Putin, Renzi, and Hollande). The fact that Angela Merkel does not have a child is a notable characteristic.

The great expressiveness of knowledge graphs can complement the search with more flexible search paradigms. Assume for instance a scholar who requires to know some non-trivial facts about Angela Merkel and Barack Obama with respect to other country leaders. It would be interesting to discover for instance that Angela Merkel has a PhD as opposed to most of the other leaders, and that she has no children. We call this fact a notable characteristic, to remark the unexpected and non-trivial aspect of the discovery. In this work we propose a novel type of search called notable characteristics search that allows the retrieval of such facts from a set of input query entities. Discovering notable characteristics also constitutes a ground for targeted analyses of products in electronic commerce or microorganisms in biological networks. Imagine a user compares two cameras and wants to know what are the special features of these two with respect to all the others. In general it can be used for large graphs that are hard to explore manually and need assistance in the discovery of notable characteristics, as a mean of comparison between entities. As a consequence, in all the cases in which a knowledge graph is available, the discovery of notable characteristics becomes an expressive and powerful search type for any user, from experts and practitioners to novice users. Moreover, the use of graphs allows for the definition of domain independent graph techniques that can easily adapt to different networks.

In our setting, we assume the user provides a set of query nodes to be compared and the algorithm finds a set of notable characteristics of these nodes. We note that having nodes as input is not a restriction to the generality of the method since there exists a number of techniques that correctly map keywords to nodes in any knowledge graph [12, 24]. Given a node, a property is a relationship with other nodes (e.g., leaderOf). A characteristic or property is notable, if it deviates from what one would expect for the nodes into consideration. To the best of our knowledge, this is the first study of automatic discovery of notable characteristics (or properties).

The discovery of notable characteristics entails two challenges. First, given the set of query nodes we need to compare them to only those nodes that are similar to some extent. Second, we need to select only those properties that are significantly different from the one expressed in the query. Please note that tackling the first challenge is very important, as comparison of the query nodes has to be performed with a set of similar nodes, which we call the context of the query. Consider the naïve approach that returns notable characteristics simply by comparing the query nodes and assume that the user provides “Angela Merkel” and “Dilma Roussef” as query. This is a counter example for the naïve direct comparison, as it will not return the gender as a notable characteristic. Both query nodes are female, however only in comparison with other presidents this becomes an interesting fact. On the other extreme, selecting all the nodes in the graph as context will mislead the analysis towards non-relevant nodes. Take our example of “Angela Merkel” and “Barack Obama”. A naïve selection of all humans will not work as context, since the gender is equally distributed in the world as well as in the input nodes, the fact “Angela Merkel is a woman” is not notable.

Hence, it is crucial to provide a thorough context selection to prevent the above cases. Therefore, we introduce the discovery of context nodes, i.e., nodes similar to the query nodes. An example of the proposed approach is depicted in in Figure 1. To this end, we devise a method that exploits metapaths [27] and random walks for context discovery. We also propose a generic framework that efficiently discover notable characteristics through a novel probabilistic approach based on distribution comparison.

Our contributions are summarized as follows:

  • We formalize the problem of notable characteristics search given a set of query nodes as input.

  • We show how to effectively compute metapaths to find the context nodes in knowledge graphs.

  • We introduce a probabilistic approach to discover notable characteristics given a query node set.

  • We experimentally evaluate our context selection approach through a user study, and show evidence of our discovered notable characteristics and the real time performance of the proposed algorithms.

The paper is structured as follows. Section 2 introduces the problem of notable characteristics search given a set of nodes. In Section 3 we present our solution based on random walk constrained with metapath discovery and the probabilistic framework to identify notable characteristics. The solution is empirically evaluated in Section 4. We present the related work in Section 5, and finally conclude in Section 6 with remarks and future work.

2 Problem statement

In this section, we introduce the problem of notable characteristics search in a knowledge graph given a set of input query nodes. A knowledge graph is a directed graph in which nodes and edges have labels or types. They are also known as information networks [20, 17] or simply labeled graphs. We are given a set of node labels and a set of edge labels. The term label and type are used interchangeably.

Definition 1 (Knowledge graph).

A knowledge graph is a quadruple , where is a set of nodes, is a set of edges, , are node and edge labeling functions, respectively.

For simplicity, we assume that everything is modeled as relationships and nodes. This is the case for attributes such as birth date: we assume that the date itself is a node connected with a “birthdate” relationship. Additionally, we assume that for every edge with type exists a reverse edge with to model cases such as “presidentOf” and “hasPresident”. The above assumptions do not change the nature and the generality of the methods but simplify the notation and the analysis.

Recall that we are interested in discovering notable characteristics of the entities mentioned in a set of input query nodes in relation to their similars. This intuitive definition entails two questions: (1) what is the set of similars? (2) what are the notable characteristics?

The set of input nodes is referred to as query set (query in short). Formally, given a knowledge graph the query is any set . The query set is manually provided by the user and therefore considered reasonably small (i.e., elements). The first question concerns the definition of a set of similars referred in this work as context nodes. We assume the existence of a similarity function that assigns a high score to nodes that are similar to those in the query set and low otherwise. Then, the context are the top- most similar nodes.

Definition 2 (Context set).

Given a knowledge graph , a query set , a similarity function , and a parameter , the context set (or simply context) is a set such that , and for each .

The second question concerns the notable characteristics. The characteristics are attributes or relationships of a specific node since they implicitly represent a signature of the node itself. As before, we assume the existence of a generic discrimination function, whose role is to return a score whether a specific characteristic is discriminative or unexpected comparing two set of nodes. Formally, in the knowledge graph , a discrimination function assigns a discrimination value or 0 if the value is not discriminative. We are now ready to define a notable characteristic.

Definition 3 (Notable characteristic).

Given a knowledge graph , a query , a context , and a discrimination function a notable characteristic is a relationship such that .

The notation denotes the set of edge labels restricted to those that are found in the edges directly connected to , i.e., .

The general problem we aim to solve is efficiently returning the notable characteristics, given a query, a similarity function and a discrimination function.

Problem 1 (Notable characteristics search).

Given a knowledge graph , a query , a similarity function and a discrimination function , find the set of notable characteristics.

The problem entails the definition of appropriate (similarity) and (discrimination) functions that retrieve and compare nodes in the knowledge graph. In Section 3 we provide an elegant instance by means of a probabilistic framework that is able to discover meaningful results. We also provide the motivation of our choices by considering several variants of the above functions.

3 Notable characteristics search

In this section we describe methods to automatically discover notable characteristics given a set of query nodes. Recall that the problem requires the definition of a similarity function  and a discrimination function . We model the discrimination function in probabilistic terms, in order to better deal with noisy settings and uncertainty. Therefore, we assume that a characteristic is interesting if its distribution in the query nodes deviates from the one in the context set. In other words, the context represents the expected behavior of the population while the query is the hypothesis to be tested.

Section 3.1 shows how to effectively find the context nodes, while Section 3.2 describes the comparison of distributions to effectively discover notable characteristics.

3.1 Finding the context

Given the query , we define a similarity function to retrieve a set of context nodes. Although many notions of similarity functions have been developed, such as structural equivalence [19] and SimRank [11], none seems suitable to our case. Existing similarity measures are either based on restricted neighborhoods of the nodes [19], or they disregard edge and node labels [11]. We devise an algorithm that takes into account edge labels and combines the advantages of random walk and metapath approaches.

In the traditional random walk model, a random walker chooses one of the outgoing edges from a node with uniform probability. Instead of uniform probability, we favor choices which are more informative in terms of edge label frequency: the lower the frequency the more informative the label. This intuition is supported by information theoretic notions, such as tf-idf and has been successfully used in graphs as well [21]. As a shorthand notation, we define as the set of edges having label , i.e., . The frequency of a label is the fraction of -labeled edges with respect to the total number of edges. We then define the weighted adjacency matrix as a square matrix, where the value between node and is defined as

(1)

The Personalized PageRank is defined as the vector

(2)

where , is the damping factor, and is vector called personalization vector. In our experiments the damping factor is , in line with previous works. We compute the PageRank starting from each node in the query to retrieve the nodes with the highest score. This is done by setting for each , individually. We refer to this baseline as RandomWalk.

However, the RandomWalk baseline disregards common connections between the query nodes. This is an important information, since the frequency-based approach does not consider the user’s similarity notion implicitly contained in the query. To this end, we adopt the notion of metapath from [27, 17] which generalizes the concept of path. A metapath for a path is a sequence that alternates node and edge labels along the path.

We mine metapaths as follows. We sample a node in with uniform probability and run a random walk until a query node is reached. The sequence of edge labels encountered during the random walk is added to the set of metapaths along with the number of times the same metapath has been found so far. It has been proved that random walks are effective in mining metapaths [16].

Once the metapaths are retrieved, we compute a score for each node based on the probability that some metapath starting from a query node ends in this node. Given the set of metapaths , we denote as the set of paths from node to matching metapath . The score of a node with respect to any query node is

is the probability of choosing metapath , which is the relative count computed previously divided by the sum of the counts of all metapaths , i.e., . Intuitively, gives a higher score to nodes that are reachable through frequent metapaths connecting the query nodes or connected through many of these metapaths. Hence, nodes that are reached from infrequent metapaths will have a low score. Once we have computed the score for each node we return the nodes with the highest score as our context.

3.2 Comparing the distributions

We revise the definition of notable characteristics in probabilistic terms. Assume we have computed the distribution of values for each characteristic (i.e., edge label) for both query and context nodes found with the method in Section 3.1. Intuitively, for each characteristic, the distribution of the context represents the expected, or normal behavior. Therefore, the query set becomes the hypothesis to be evaluated against the “true” distribution of the context.

Formally, for each characteristic , we consider two distributions in order to evaluate its notability. The first represents the frequency of the node labels (e.g., California) connected to a specific edge label (e.g., bornIn). This expresses information about the actual values in the nodes and can be used to identify cases where different attribute values are relevant. For instance, most people in the query in Figure 1 are half American and half European, while those in the context are all Europeans. We refer to these distributions as instance distributions.

where and are the number of occurrences of node at the end of an edge labeled from a node in and , respectively. In the example in Figure 1, , , where the first position in the vector indicates Physics studies and the second Law. Note that both vectors have the same size, so is zero if appears only in the context.

The second distribution computes aggregates over the number of occurrences of a specific edge label in the context. This expresses information about the existence and cardinality of an attribute and can be used to identify cases where attribute cardinality is relevant. For instance “Angela Merkel” in the query in Figure 1 has no child, while in the context all other leaders have at least one. Such cases cannot evidently be modeled as instance distributions that take into account distinct values (e.g., the child name). We refer to these distributions as cardinality distributions.

where and are the number of times a node in and respectively has edges labeled .

Both distributions can be built by iterating through the nodes in each set and counting the respective occurrences. For a given , this results in two scores and . The final score is the maximum score between and .

(3)

Many measures have been proposed in statistics to compare two distributions. However, most of them draw specific assumptions, such as a minimum number of samples or non-zero probabilities, that are not fulfilled in our case. In particular, and have no natural ordering and no distance-function between the values. Additionally, the context has a much large variety of node labels than the query. This leads to many zero values in the query-distribution. Therefore, the commonly used Kullback-Leibler divergence (KL) [15] cannot be used. Classical statistical tests, such as the z-test and the test require either a Gaussian distribution or a minimum size of the sample. On the other hand, the Earth Mover’s Distance (EMD) requires the definition of distance between values, which is not defined for .

In conclusion, we resorted to a more natural multinomial test that better expresses the relationship between our distributions. The multinomial test assumes that a set of observations is drawn from a known multinomial distribution. Therefore, assuming the context to be multinomial distributed the observations are the values found in the query. If the values observed in the query are drawn from the multinomial, than the hypothesis cannot be rejected and the characteristic is marked as non-notable. On the other hand, if the test succeeds, then the two distributions are significantly different and the characteristic is notable.

Assume we have a random variable , with parameters and . We normalize and to express a probability distribution . For a given outcome , the probability, under the hypothesis of equality between context and sample, is

where . In an exact multinomial test, the significance probability is

is the probability of or any equally or less likely outcome being drawn from the probability distribution111In case of large , the exact test is impractical, a Montecarlo sampling to approximate the final result is performed.. A difference in distributions is considered significant if the hypothesis is rejected with probability .

Finally, can be defined as

4 Experimental Evaluation

(a) ContextRW algorithm
(b) RandomWalk algorithm
Figure 2: Quality () varying context size () comparing ContextRW and RandomWalk with actors domain in YAGO dataset.

In this section, we experimentally evaluate our approach on different datasets and show the impact of the parameters on the final results. Since there has been no other study on notable characteristics search so far, we have to generate a ground for evaluation. We do so by hiring users from a crowdsourcing platform and asking to manually provide context nodes as Wikipedia entities. We then mapped the entities to the corresponding nodes in one of the largest knowledge graphs available.

Datasets: We perform experiments on two datasets: YAGO and LinkedMDB.

YAGO [2] is a large knowledge graph based on Wikipedia, Wordnet and Geonames. We downloaded YAGO 2.5222http://resources.mpi-inf.mpg.de/yago-naga/yago2.5/yago2s_tsv.7z core facts in April 2016. It consists of 3.3M nodes and 27M edges, with 366K node types and 38 edge labels, including a type-hierarchy for node types. As described in Section 2, we represented each node attribute as an edge, having the attribute value as node label.

LinkedMDB is a knowledge graph for the movie domain, extracted from the Internet Movie Database (IMDB). We downloaded a snapshot of LinkedMDB333https://datahub.io/dataset/linkedmdb in June 2016. It consists of 739K nodes and 1.6M edges of 18 types.

Experimental Setup: We implemented our solution in Java 1.8, and ran the experiments on a machine with a quad-core Intel i5-4210U CPU 1.7 GHz and 12GB RAM. All the datasets are loaded into Apache Jena triple store to perform quick traversals on the graph without loading it into main memory. The impemented algorithms are the following.

RandomWalk: A baseline algorithm for context selection based on Personalized PageRank as described in Section 3.1. Instead of the matrix multiplication we used the more scalable power iteration method. We set the number of iterations to 10 and the damping factor .

ContextRW: This is our algorithm described in Section 3.1, that includes PathMining to mine the metapaths, the weighted random walk constrained to the metapaths found by PathMining, and the final score. We ran PathMining 1M times to retrieve the relevant metapaths.

FindNC: This is our algorithm that incorporates ContextRW and our method described in Section 3.2. For the multinomial test, we used a statistic package written in R.

Summary of the experiments: We evaluate our algorithms effectiveness by comparing the retrieved context nodes to the ground truth obtained through our user survey. Our context selection returns a better context compared to the baseline quicker. Moreover, our algorithm performs better as the query size increases. The returned notable characteristics indeed represent interesting undisclosed facts in the query nodes.

Figure 3: Average quality () vs context set size () comparison in YAGO dataset.
politicians actors movie contributors
Angela Merkel Brad Pitt Steven Spielberg
Barack Obama George Clooney Robert Downey Jr.
Vladimir Putin Leonardo DiCaprio Hans Zimmer
David Cameron Scarlett Johansson Quentin Tarantino
François Hollande Johnny Depp Ellen Page
Xi Jinping Angelina Jolie Celine Dion
Table 1: Entities in the three domains used in the evaluation.

4.1 Evaluating Context selection

We compare the effectiveness of ContextRW with the baseline RandomWalk within different topics. Unfortunately, to the best of our knowledge there was no existing ground for evaluation, i.e., finding context nodes given a query. This is crucial in our case, since the results with a single node can be dramatically different than those obtained with multiple query nodes. For instance, if the query only contains US presidents, we expect to find a context of US presidents. On the contrary, if the query comprises both US and German politicians, we expect to find even politicians from other countries.

Therefore, we generated the first ground for evaluation by crowdsource contexts for given query nodes. We selected 15 query sets from three domains, namely politicians, actors, and movie contributors, to evaluate the algorithms. For each domain we manually determined 6 query nodes belonging to the domain, such as Angela Merkel and Barack Obama for politicians. The set of nodes (or entities) for each domain is shown in Table 1. We generated a ground truth with increasing query size by asking real users to provide a ranked list of entities related to those provided in the query. We hired 34 workers for each test set, asking them to provide 15 entities each. For this experiment we used the CrowdFlower444https://www.crowdflower.com/ platform. This resulted in 510 entities for each of the 15 test sets (starting from 2 entities for each domain, adding one every time), with a total of 7’650 entities. After performing the manual labeling, we removed the entities mentioned only once, resulting in 36 to 76 entities for each query. Furthermore, we noted that for the politician scenario, our version of YAGO misses some recent facts, e.g., François Hollande is not mentioned as a head of state. Hence, we manually substituted the name of the current head of state with the one found in YAGO. This substitution does not substantially change the final result, since it preserves the structural properties in the graph, but allows us to evaluate the results with respect to the user knowledge.

We evaluated the effectiveness of both ContextRW and RandomWalk in terms of score, which is defined as

Figure 4: Average quality () vs query size () comparison in YAGO dataset.
2 YAGO 0.23 23
LinkedMDB 0.30 101
3 YAGO 0.2 107
LinkedMDB 0.25 122
4 YAGO 0.19 130
LinkedMDB 0.24 124
5 YAGO 0.25 162
LinkedMDB 0.26 198
6 YAGO 0.22 285
LinkedMDB 0.25 139
Table 2: Comparing the performance of ContextRW on YAGO and LinkedMDB in the actors domain.

Context size (). Context size affects the quality of the results, since the more context nodes potentially the better recall but worse precision. While we report the quality in terms of score, we do not report time performance, since the number of context nodes generated is always the same for every run of the algorithms. Therefore, the results report the score at different cutoffs in the ranked context set. Figure 2 shows the score for varying the size of the context set () for the queries in actors domain, using YAGO dataset. We compare the performance of our ContextRW (Fig. 1(a)) with respect to the baseline RandomWalk (Fig. 1(b)). In all cases, ContextRW performs 2 times better than the baseline, indicating that the metapath constrained random walk actually improves the overall quality. This is because many close neighbors of the query nodes are irrelevant considering the similarity notion between the query nodes, and this information is ignored by the simple RandomWalk. After an initial increase in quality, we experience a non increasing trend when the context is bigger than 100 nodes. This is motivated by an increasing recall as more context nodes are considered, but a drop in precision. We also note that the RandomWalk algorithm shows a higher variance while ContextRW is more stable. The result is not surprising since ContextRW includes only nodes within metapaths, while the baseline explores the space randomly, increasing the overall noise. In Figure 3 we compare the average quality of the two algorithms using the entire query. The results show that ContextRW is on average better than RandomWalk in terms of quality, performing up to four times better for context size .

Figure 5: Average time (s) vs query size () comparison in YAGO dataset.
Figure 6: Time (s) vs metapath length with different query size ()

Query size (). The query size affects both time and quality. We analyze the performance of the algorithms varying the query size. Figure 4 shows that ContextRW improves in result quality when more query nodes are considered, supporting the claim that our method can capture semantic relationships between the nodes. On the contrary RandomWalkis not affected by the size of the query. This is a reasonable finding provided that RandomWalk does not consider metapaths.

Additionally, we compare the total runtime of each method varying the query size (). Figure 5 shows the time to compute the context for ContextRW and the baseline RandomWalk. We note that the RandomWalk algorithm is on average up to two orders of magnitude slower than ContextRW, for . Moreover, while ContextRW is faster with larger queries, a random walk approach tends to become slower. This is an expected behavior in ContextRW, since the chances to end the exploration in a query node is larger as the query size increases. Furthermore, we are able to return results in less than 20s.

Table 2 reports the maximum score at increasing , comparing YAGO and LinkedMDB datasets within the actors domain using the ContextRW algorithm. While we could not evaluate for the politicians domain because the knowledge is not included in the LinkedMDB dataset, the results for movie contributors are mostly comparable and omitted for brevity. Unsurprisingly, ContextRW performs better in LinkedMDB due to the specificity of the dataset. However, the overall maximum increasing in is not larger than This supports the claim that ContextRW is able to capture domain specific knowledge even in more general datasets, exploiting the characteristics of the graph and the metapaths.

Number of paths (). The ContextRW algorithm depends on the number of paths. Table 3 shows the score in relation to the context size and the number of paths. The number of paths does not affect the score; however, as shown in Figure 6 the time increases as the length of the metapaths (and also the number, not reported) increases. Therefore, a reasonable choice for the number of metapaths and maximum length is 5.

Number of paths ()
5 10 15 20
50 0.15 0.16 0.13 0.15
100 0.22 0.21 0.21 0.21
150 0.22 0.23 0.23 0.23
200 0.22 0.22 0.22 0.22
Table 3: score as a function of the number of paths and the size of the context for ContextRW algorithm.

4.2 Distribution Comparison

We evaluate the performance of the FindNC algorithm in terms of quality.

Metrics comparison. We first evaluate the results comparing the characteristics found by FindNC with those found by KL-divergence, and EMD that allow distribution comparison. We asked three human experts to provide a score to the characteristics of a small set of examples. We then aggregated the individual judgments and compared the ranking with the one obtained by the three methods. The minimum number of switches needed to transform one ranking to the other was used as a metric. We found that FindNC required 2 changes, while KL-divergence and EMD required 4 and 5, respectively, supporting the choice of the multinomial test as a measure of quality.

Test cases. In practice FindNC detects results that are more interesting than the one retrieved by the baseline RandomWalk when equipped with the multinomial test. We refer to RandomWalk with multinomial test as RWMult.

One test case includes the scenario with the best score for the context construction, that has as query. We selected the top 100 nodes as the context. The distribution comparison with multinomial test identified multiple edge labels, for which we provide a visual analysis of the findings.

Figure 7: Distribution for the edge label created with query {Clooney, Pitt, DiCaprio, Johansson, Depp} and . The first label is None, indicating no matching edge found.

Figure 7 shows the instance distribution of the context for the created edge label. The created edge label is absent in 43% cases (represented as None instance), whereas all the other values are equally likely with 0.66% chances. The query presents a different distribution, with one actor without created labels and all the others with a different value. This marks a clear deviation from the context and is therefore identified as a notable characteristic by the multinomial test. On the other hand, the hasWonPrize edge, whose distributions are depicted in Figure 8, is not marked as a notable characteristic. Looking at the distributions for the context and the query, it is easy to see that they are quite similar. The multinomial test cannot reject the null hypothesis of equality of the two distributions and therefore the hasWonPrize edge-label is not notable.

Figure 8: Distribution for the edge label hasWonPrize with query {Clooney, Pitt, DiCaprio, Johansson, Depp} and .

In the second test case we use as query and set the top 30 nodes as context. Our solution identified the edge influences as a notable characteristic. This is because the two authors in the query influenced an actor that was influenced by only 3 in total, and this result is definitely unexpected. On the other hand, the edge created was not found to be relevant. All authors together created 834 works in total with only 3 of those being created by multiple authors. As the query nodes also only created their own works and never collaborated, this is an expected result and thus not notable.

Figure 9: Comparison of significance probabilities for the actors scenario with 5 query nodes. The “C” after the edge label denotes cardinality distributions.

Algorithm comparison. Figure 9 compares FindNC with RWMult, with query {George Clooney, Brad Pitt, Leonardo DiCaprio, Scarlett Johansson and Johnny Depp}. All items above the threshold, depicted as a dashed line, are considered not interesting (). The random walk selects mostly famous people in the movie business, therefore the actedIn relation that connects actors with movies, is very rare in the context but common in the query, resulting in a score of 0.0086. However, this is clearly not correct and, in fact, it is deemed as uninteresting by our FindNC algorithm with a score of 0.96. Similarly, hasWonPrize shows a significant difference between the two algorithms, as winning a prize is common for actors (75%), but not so in the rather mixed random walk context (only 25%). The chart also shows that the significance level of the multinomial test can be used as a parameter to obtain the desired “interestingness” level. Choosing 0.1 would include the owns relationship as a notable characteristic, revealing that Brad Pitt is (according to the dataset) the only relevant actor to own a company (Plan B Entertainment). This is specific for Brad Pitt, but not necessarily an interesting characteristic of the entire query, as it is reflected in the context.

5 Related Work

Previous work on graphs mostly concerns the discovery of similar nodes or groups of individuals sharing common interests (graph clustering). In this section, we survey the most related works in these areas.

Node comparison measures. Node comparison has a long history in graph analytics. Being able to compare pairs of nodes returning a similarity or distance score is a fundamental activity for clustering, ranking and classification. One of the earliest methods to compare nodes is graph edit distance [4] (GED), which is the minimum number of operations to transform a graph into another. Single nodes are compared in terms of the surrounding nodes and edges. Structural equivalence [19] defines two nodes as similar if they have similar neighbors. The first algorithm for structural equivalence is CONCUR [4]. A similar approach is the one proposed by SimRank [11], which returns a self-similarity matrix between all the pairs of nodes in the graph. Random walk approaches, such as Personalized PageRank [5] and HITS [13] can also be used to find nodes similar to the input nodes. Role discovery [8] elaborates over the idea and, instead of returning a score they return multiple roles in terms of structural properties or graph global measures. Node comparison measures can only return whether one node is similar or different from another but they cannot readily adapt to the discovery of notable characteristics, since the score provides no insight on the discovery process. Additionally, these methods do not consider whether similarities or differences are meaningful with respect to a “normal” state other than total equivalence.

Seed set expansion. Seed set expansion refers to methods that ask the user to provide an initial set of entities or structures and retrieve similar nodes. These methods, also known as example-based methods, can discover tuples of an unspecified result set in a relational database [6, 26]. In graphs, the seed set can be composed of either structures (i.e., subgraphs) or nodes. Exemplar queries paradigm [21, 22] assumes that the user input is an example of the intended results. Similarly, GQBE [10] considers entity tuples to find similar other tuples in a knowledge graph. These works are orthogonal to the discovery of notable characteristics, for they merely return answers similar to the input.

Seed nodes are used to discover groups of nodes with similar characteristics [14, 23]. These seed-based clustering algorithms exploit the specificity of each node in the seed set to return ad-hoc communities. Likewise, seed-based approaches are used to discover dense regions in the graph [7, 25]. Although these methods provide multiple groups of nodes they cannot properly explain the characteristics and the differences among them; in general, they do not directly compare the query nodes with the others.

Relevant path summarization. Our problem reminisces the discovery of path templates (or metapaths) between nodes. A metapath is a sequence of node and edge labels that abstract connections between nodes. As such, they express connection patterns that have been shown to be effective in capturing non-trivial relationships and user preference patterns, improving the quality of recommendation results [17, 27]. Methods have been proposed to automatically discover metapaths from a given seed set [1, 20]. Metapaths can express notable connections between seed nodes, but are insufficient for the given problem. They cannot express the lack of an edge (e.g., Angela Merkel has no children), nor they detect characteristics related to instances (e.g., Angela Merkel is female while most leaders are male). Discovery of metapaths also ignores difference-based characteristics, such as two people born in different places when the majority of similar people was born in the same place. Most importantly, metapath discovery lacks the evaluation of notability. Notability does not correlate to frequency per se: Being born in the same place is only notable, if most similar people are born in other places.

6 Conclusions

In this paper, we study the problem of notable characteristics search given a set of query nodes in a knowledge graph. A notable characteristic is a special property in the query nodes that makes them different from their similars. Our problem is twofold: We first find a context set that represents the nodes similar to the query nodes; we then identify the notable characteristics with a novel probabilistic framework. We devise an algorithm for context selection based on random walk and metapath discovery and prove its effectiveness and efficiency with real data and user generated ground truth. In order to find the notable characteristics, we propose a probabilistic notion that first computes distributions for each edge label and subsequently performs a multinomial test to mark the characteristics that deviate from the expected behavior. We show different test cases to demonstrate the applicability and the effectiveness of our approach in real dataset.

As future work we plan to expand the notion of notable characteristics to incorporate more complex patterns. We also intend to explore correlations between attributes as well as graph structures and incorporate results into the model.

References

  • [1] L. Akoglu, D. H. Chau, C. Faloutsos, N. Tatti, H. Tong, J. Vreeken, and L. Tong. Mining connection pathways for marked nodes in large graphs. In SDM, pages 37–45, 2013.
  • [2] J. Biega, E. Kuzey, and F. M. Suchanek. Inside yago2s: A transparent information extraction architecture. In WWW, pages 325–328, 2013.
  • [3] I. Bordino, G. De Francisci Morales, I. Weber, and F. Bonchi. From machu_picchu to rafting the urubamba river: anticipating information needs via the entity-query graph. In WSDM, pages 275–284. ACM, 2013.
  • [4] R. L. Breiger, S. A. Boorman, and P. Arabie. An algorithm for clustering relational data with applications to social network analysis and comparison with multidimensional scaling. Journal of mathematical psychology, 12(3):328–383, 1975.
  • [5] S. Chakrabarti. Dynamic personalized pagerank in entity-relation graphs. In WWW, pages 571–580, 2007.
  • [6] K. Dimitriadou, O. Papaemmanouil, and Y. Diao. Explore-by-example: An automatic query steering framework for interactive data exploration. In SIGMOD, pages 517–528, 2014.
  • [7] A. Gionis, M. Mathioudakis, and A. Ukkonen. Bump hunting in the dark: Local discrepancy maximization on graphs. In ICDE, pages 1155–1166, 2015.
  • [8] K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, and L. Li. Rolx: structural role extraction & mining in large graphs. In KDD, pages 1231–1239, 2012.
  • [9] W. Hua, Z. Wang, H. Wang, K. Zheng, and X. Zhou. Short text understanding through lexical-semantic analysis. In ICDE, pages 495–506, 2015.
  • [10] N. Jayaram, A. Khan, C. Li, X. Yan, and R. Elmasri. Querying knowledge graphs by example entity tuples. TKDE, 27(10):2797–2811, 2015.
  • [11] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD, pages 538–543, 2002.
  • [12] G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: Mining informative entity relationship subgraphs. In CIKM, pages 1653–1656, New York, NY, USA, 2009.
  • [13] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. JACM, 46(5):604–632, 1999.
  • [14] I. M. Kloumann and J. M. Kleinberg. Community membership identification from small seed sets. In KDD, pages 1366–1375, 2014.
  • [15] S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [16] S. Lee, S. Lee, and B.-H. Park. Pathmining: A path-based user profiling algorithm for heterogeneous graph-based recommender systems. In FLAIRS Conference, pages 519–523, 2015.
  • [17] S. Lee, S. Park, M. Kahng, and S.-g. Lee. Pathrank: a novel node ranking measure on a heterogeneous graph for recommender systems. In CIKM, pages 1637–1641, 2012.
  • [18] M. Lissandrini, D. Mottin, T. Palpanas, D. Papadimitriou, and Y. Velegrakis. Unleashing the power of information graphs. SIGMOD Record, 43(4):21–26, 2015.
  • [19] F. Lorrain and H. C. White. Structural equivalence of individuals in social networks. The Journal of mathematical sociology, 1(1):49–80, 1971.
  • [20] C. Meng, R. Cheng, S. Maniu, P. Senellart, and W. Zhang. Discovering meta-paths in large heterogeneous information networks. In WWW, pages 754–764, 2015.
  • [21] D. Mottin, M. Lissandrini, Y. Velegrakis, and T. Palpanas. Exemplar queries: Give me an example of what you need. PVLDB, 7(5):365–376, 2014.
  • [22] D. Mottin, A. Marascu, S. B. Roy, G. Das, T. Palpanas, and Y. Velegrakis. A holistic and principled approach for the empty-answer problem. VLDB J., pages 1–26, 2016.
  • [23] B. Perozzi, L. Akoglu, P. Iglesias Sánchez, and E. Müller. Focused clustering and outlier detection in large attributed graphs. In KDD, pages 1346–1355, 2014.
  • [24] J. Pound, A. K. Hudek, I. F. Ilyas, and G. Weddell. Interpreting keyword queries over web knowledge bases. In CIKM, pages 305–314, 2012.
  • [25] N. Ruchansky, F. Bonchi, D. García-Soriano, F. Gullo, and N. Kourtellis. The minimum wiener connector problem. In SIGMOD, pages 1587–1602, 2015.
  • [26] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik. Discovering queries based on example tuples. In SIGMOD, pages 493–504, 2014.
  • [27] Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992–1003, 2011.
  • [28] X. H. Wang, D. Q. Zhang, T. Gu, and H. K. Pung. Ontology based context modeling and reasoning using owl. In PerCom, pages 18–22, 2004.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
151160
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description