Unveiling Scholarly Communities over Knowledge Graphs
Knowledge graphs represent the meaning of properties of real-world entities and relationships among them in a natural way. Exploiting semantics encoded in knowledge graphs enables the implementation of knowledge-driven tasks such as semantic retrieval, query processing, and question answering, as well as solutions to knowledge discovery tasks including pattern discovery and link prediction. In this paper, we tackle the problem of knowledge discovery in scholarly knowledge graphs, i.e., graphs that integrate scholarly data, and present Korona, a knowledge-driven framework able to unveil scholarly communities for the prediction of scholarly networks. Koronaimplements a graph partition approach and relies on semantic similarity measures to determine relatedness between scholarly entities. As a proof of concept, we built a scholarly knowledge graph with data from researchers, conferences, and papers of the Semantic Web area, and apply Koronato uncover co-authorship networks. Results observed from our empirical evaluation suggest that exploiting semantics in scholarly knowledge graphs enables the identification of previously unknown relations between researchers. By extending the ontology, these observations can be generalized to other scholarly entities, e.g., articles or institutions, for the prediction of other scholarly patterns, e.g., co-citations or academic collaboration.
Knowledge semantically represented in knowledge graphs can be exploited to solve a broad range of problems in the respective domain. For example, in scientific domains, such as bio-medicine, scholarly communication, or even in industries, knowledge graphs enable not only the description of the meaning of data, but the integration of data from heterogeneous sources and the discovery of previously unknown patterns. With the rapid growth in the number of publications, scientific groups, and research topics, the availability of scholarly datasets has considerably increased. This generates a great challenge for researchers, particularly, to keep track of new published scientific results and potential future co-authors. To alleviate the impact of the explosion of scholarly data, knowledge graphs provide a formal framework where scholarly datasets can be integrated and diverse knowledge-driven tasks can be addressed. Nevertheless, to exploit the semantics encoded in such knowledge graphs, a deep analysis of the graph structure as well as the semantics of the represented relations, is required. There have been several attempts considering both of these aspects. However, the majority of previous approaches rely on the topology of the graphs and usually omit the encoded meaning of the data. Most of such approaches are also mainly applied on special graph topologies, e.g., ego networks rather than general knowledge graphs. To provide an effective solution to the problem of representing scholarly data in knowledge graphs, and exploiting them to effectively support knowledge-driven tasks such as pattern discovery, we propose Korona, a knowledge-driven framework for scholarly knowledge graphs. Koronaenables both the creation of scholarly knowledge graphs and knowledge discovery. Specifically, Koronaresorts to community detection methods and semantic similarity measures to discover hidden relations in scholarly knowledge graphs. We have empirically evaluated the performance of Koronain a knowledge graph of publications and researchers from the Semantic Web area. As a proof of concept, we studied the accuracy of identifying co-author networks. Further, the predictive capacity of Koronahas been analyzed by members of the Semantic Web area. Experimental outcomes suggest the next conclusions: i) Koronaidentifies co-author networks that include researchers that both work on similar topics, and attend and publish in the same scientific venues. ii) Koronaallows for uncovering scientific relations among researchers of the Semantic Web area. The contributions of this paper are as follows:
A scholarly knowledge graph integrating data from DBLP datasets;
Collaboration suggestions based on co-author networks; and
An empirical evaluation of the quality of Koronausing semEP and METIS.
2 Motivating Example
In this section, we motivate the problem of knowledge discovery tackled in this paper. We present an example of co-authorship relation discovery between researchers working on data-centric problems in the Semantic Web area. We checked the Google Scholar profiles of three researchers between 2015 and 2017, and compared their networks of co-authorship. By 2016, Sören Auer and Christoph Lange were part of the same research group and wrote a large number of joint publications. Similarly, Maria-Esther Vidal, also working on data management topics, was part of a co-authorship community. Figureb illustrates the two co-authorship communities, which were confirmed by the three researchers. After 2016, these three researchers started to work in the same research lab, and a large number of scientific results, e.g., papers and projects, was produced. An approach able to discover such potential collaborations automatically would allow for the identification of the best collaborators and, thus, for maximizing the success chances of scholars and researchers working on similar scientific problems. In this paper, we rely on the natural intuition that successful researchers working on similar problems and producing similar solutions can collaborate successfully, and propose Korona, a framework able to discover unknown relations between scholarly entities in a knowledge graph. Koronaimplements graph partitioning methods able to exploit semantics encoded in a scholarly knowledge graph and to identify communities of scholarly entities that should be connected or related.
3 Our Approach: Korona
The definitions required to understand our approach are presented in this section. First, we define a scholarly knowledge graph as a knowledge graph where nodes represent scholarly entities of different types, e.g., publications, researchers, publication venues, or scientific institutions, and edges correspond to an association between these entities, e.g., co-authors or citations.
Scholarly Knowledge Graph. Let be a set of RDF URI references and a set of RDF literals. Given sets and of scholarly entities and types, respectively, and given a set of properties representing scholarly relations, a scholarly knowledge graph is defined as =, where:
Scholarly entities and types are represented as RDF URIs, i.e., ;
Relations between scholarly entities and types are represented as RDF properties, i.e., and
Figure 2 shows a portion of a scholarly knowledge graph describing scholarly entities, e.g., papers, publication venues, researchers, and different relations among them, e.g., co-authorship, citation, and collaboration.
Co-author Network. A co-author network = corresponds to a subgraph of =, where
Nodes are scholarly entities of type researcher,
Researchers are related according to co-authorship of scientific publications,
Figure 3 shows scholarly networks that can be generated by Korona. Some of these networks are among the recommended applications for scholarly data analytics in . However, the focus on this work is on co-author networks.
3.2 Problem Statement
Let = and = be two scholarly knowledge graphs, such that is an ideal scholarly knowledge graph that contains all the existing and successful relations between scholarly entities in , i.e., an oracle that knows whether two scholarly entities should be related or not. = is the actual scholarly knowledge graph, which only contains a portion of the relations represented in , i.e., ; it represents those relations that are known and is not necessarily complete. Let be the set of relations existing in the ideal scholarly knowledge graph that are not represented in the actual scholarly knowledge graph . Let = be a complete knowledge graph, which includes a relation for each possible combination of scholarly entities in and properties in , i.e., . Given a relation , the problem of discovering scholarly relations consists in determining whether , i.e., whether a relation = corresponds to an existing relation in the ideal scholarly knowledge graph .
In this paper, we specifically focus on the problem of discovering successful co-authorship relations between researchers in scholarly knowledge graph =. Thus, we are interested in finding the co-author network = composed of the maximal set of relationships or edges that belong to the ideal scholarly knowledge graph, i.e., the set in that corresponds to a solution of the following optimization problem:
3.3 Proposed Solution
We propose Koronato solve the problem of discovering meaningful co-authorship relations between researchers in scholarly knowledge graphs. Koronarelies on information about relatedness between researchers to identify communities composed of researchers that work on similar problems and publish in similar scientific events. Koronais implemented as an unsupervised machine learning method able to partition a scholarly knowledge graph into subgraphs or communities of co-author networks. Moreover, Koronaapplies the homophily prediction principle over the communities of co-author networks to identify successful co-author relations between researchers in the knowledge graph. The homophily prediction principle states that similar entities tend to be related to similar entities . Intuitively, the application of the homophily prediction principle enables Koronato relate two researchers and whenever they work on similar research topics or publish in similar scientific venues. The relatedness or similarity between two scholarly entities, e.g., researchers, research topics, or scientific venues, is represented as RDF properties in the scholarly knowledge graph. Semantic similarly measures, e.g., GADES  or Doc2Vec , are utilized to quantify the degree of relatedness between two scholarly entities. The identified degree shows the relevance of entities and returns the most related ones.
Figure 4 depicts the Koronaarchitecture; it implements a knowledge-driven approach able to transform scholarly data ingested from publicly available data sources into patterns that represent discovered relationships between researchers. Thus, Koronareceives scholarly data sources and outputs co-author networks; it works in two stages: (a) Knowledge graph creation and (b) Knowledge graph discovery. During the knowledge graph creation stage, a semantic integration pipeline is followed in order to create a scholarly knowledge graph from data ingested from heterogeneous scholarly data sources. It utilizes mapping rules between the Koronaontology and the input data sources to create the scholarly knowledge graph. Additionally, semantic similarity measures are used to compute the relatedness between scholarly entities; the results are explicitly represented in the knowledge graph as scores in the range of 0.0 and 1.0. The knowledge graph creation stage is executed offline and enables the integration of new entities in the knowledge graph whenever the input data sources change. On the other hand, the knowledge graph discovery step is executed on the fly over an existing scholarly knowledge graph. During this stage, Koronaexecutes three main tasks: (i) Intra-type Relatedness solver (IRs); (ii) Intra-type Scholarly Community solver (IRSCs); and (iii) Scholarly Pattern generator (SPg).
Intra-type Relatedness solver (IRs).
This module quantifies relatedness between the scholarly entities of the same type in a scholarly knowledge graph =. IRs receives as input = and a scholarly type in ; it outputs a set of triples , where and belong to and score quantifies the relatedness between and . The relatedness can be just computed in terms of the values of similarity represented in the knowledge graph, e.g., according to the values of the semantic similarity according to GADES or Doc2Vec. Alternatively, the values of relatedness can be computed based on the number of paths in the scholarly knowledge graph that connect the scholarly entities and . Figure 5 depicts two representations of the relatedness of scholarly entities. As shown in Figurea, IRs generates a set according to the GADES values of semantic similarity; thus, IRs includes two triples , in . On the other hand, if paths between scholarly entities are considered (Figureb), the values of relatedness can different, e.g., in this case, Sören Auer and Christoph Lange are equally similar as Maria-Esther Vidal and Louiqa Raschid.
Intra-type Scholarly Community solver (IRSCs).
Once the relatedness between the scholarly entities has been computed, communities of highly related scholarly entities are determined. IRSCs resorts to unsupervised methods such as METIS or semEP, and to relatedness values stored in , to compute the scholarly communities. Figure 6 depicts scholarly communities computed by IRSCs based on similarity values; as observed, each community includes researchers that are highly related; for readability, is shown as a heatmap where lower and higher values of similarity are represented by lighter and darker colors, respectively. For example, in Figurea, Sören Auer, Christoph Lange, and Maria-Esther Vidal are quite similar, and they are in the same community.
Scholarly Pattern generator (SPg).
SPg receives communities of scholarly entities and produces a network, e.g., a co-author network. SPg applies the homophily prediction principle on the input communities, and connects the scholarly entities in one community in a network. Figure 7 shows a co-author network computed based on a scholarly knowledge graph created from DBLP; as observed, Sören Auer, Christoph Lange, and Maria-Esther Vidal are included in the same co-author network. In addition to computing the scholarly networks, SPg scores the relations in a network and computes the weight of connectivity of a relation between two entities. For example, in Figure 7, thicker lines represent strongly connected researchers in the network. SPg can also filter from a network the relations labeled with higher values of weight of connectivity. All the relations in a network correspond to solutions to the problem of discovering successful co-authorship relations defined in Equation 1. To compute the weights of connectivity, SPg considers the values of similarity of the scholarly entities in a community ; weights are computed as aggregated values using an aggregation function , e.g., average or triangular norm. For each pair of scholarly entities in , the weight of connectivity between and , , is defined as: .
4 Empirical Evaluation
4.1 Knowledge Graph Creation
A scholarly knowledge graph has been crafted using the DBLP collection (7.83 GB in April 2017111http://dblp2.uni-trier.de/e55477e3eda3bfd402faefd37c7a8d62/); it includes researchers, papers, and publication year from the International Semantic Web Conference (ISWC) 2001–2016. The knowledge graph also includes similarity values between researchers who have published at ISWC (2001–2017). Let and be the number of papers published by researchers and together (as co-authors), respectively at ISWC (2001–2017). Let and be the total number of papers that and have in all conferences of the scholarly knowledge graph, respectively. The similarity measure is defined as: . The similarities between ISWC (2002–2016) are represented as well. Let and the number of the authors with papers published in conferences and respectively. The similarity measure corresponds to . Thus, the scholarly knowledge graph includes both scholarly entities enriched with their values of similarity.
|(a) Percentile 85||(b) Percentile 90|
|(c) Percentile 95||(d) Percentile 98|
4.2 Experimental Study
The effectiveness of Koronahas been evaluated in terms of the quality of both the generated communities of researchers and the predicted co-author networks.
We assess the following research questions: RQ1) Does the semantics encoded in scholarly knowledge graphs impact the quality of scholarly patterns? RQ2) Does the semantics encoded in scholarly knowledge graph allow for improving the quality of the predicting co-author relations?
Koronais implemented in Python 2.7. The experiments were executed on a macOS High Sierra 10.13 (64 bits) Apple MacBook Air machine with an Intel Core i5 1.6 GHz CPU and 8 GB RAM. METIS 5.1 222http://glaros.dtc.umn.edu/gkhome/metis/metis/download and SemEP 333https://github.com/gpalma/semEP are part of Koronaand used to obtain the scholarly patterns.
Let be the set of communities obtained by Korona: Conductance: measures relatedness of entities in a community, and how different they are to entities outside the community . The inverse of the conductance is reported. Coverage: compares the fraction of intra-community similarities among entities to the sum of all similarities among entities . Modularity: is the value of the intra-community similarities among the entities divided by the sum of all the similarities among the entities, minus the sum of the similarities among the entities in different communities, in the case they were randomly distributed in the communities . The value of the modularity lies in the range , which can be scaled to by computing . Performance: sums the number of intra-community relationships, plus the number of non-existent relationships between communities . Total Cut: sums all similarities among entities in different communities . Values of total cut are normalized by dividing by the sum of the similarities among the entities; inverse values are reported, i.e., .
Experiment 1: Evaluation of the Quality of Collaboration Patterns.
Prediction metrics are used to evaluate the quality of the communities generated by Koronausing METIS and semEP; relatedness of the researchers is measured in terms of and . Communities are built according to different similarity criteria; percentiles of 85, 90, 95, and 98 of the values of similarity are analyzed. For example, in percentile 85 only 85% of all similarity values among entities have scores lower than the similarity value in the percentile 85. Figure 8 presents the results of the studied metrics. In general, in all percentiles, the communities include closely related researchers. However, both implementations of Koronaexhibit quite good performance at percentile 95, and allow for grouping together researchers that are highly related in terms of the research topics on which they work, and the events where their papers are published. On the contrary, Koronacreates many communities of no related authors for percentiles 85 and 90, thus exposing low values of coverage and conductance.
Experiment 2: Survey of the Quality of the Prediction of Collaborations among Researchers.
|Q1. Do you know this person? Have you co-authored before? To avoid confusion, the meaning of “knowing” was kept simple and general. The participants were asked to only consider if they were aware of the existence of the recommended person in their research community.|
|Q2. Have you co-authored “before” with this person at any event of the ISWC series? With the same intent of keeping the survey simple, all types of collaboration on papers in any edition of this event series were considered as “having co-authored before”.|
|Q3. Have you co-authored with this person after May 2016? Our study considered scholarly metadata of publications until May 2016. The objective of this question was to find out whether a prediction had actually come true, and the researchers had collaborated.|
|Q4. Have you ever planned to write a paper with the recommended person and you never made it and why? The aim is to know whether two researchers who had been predicted to work together actually wanted to but then did not and the reason, e.g., geographical distance.|
|Q5. On a scale from 1–5, (5 being most likely), how do you score the relevance of your research with this person? The aim is to discover how close and relevant are the collaboration recommendations to the survey participant.|
Results of an online survey444https://bit.ly/2ENEg2G among 10 researchers are reported; half of the researchers are from the same research area, while the other half was chosen randomly. Knowledge subgraphs of each of the participants are part of the Koronaresearch knowledge graph; predictions are computed from these subgraphs. The predictions for each were laid out in an online spreadsheet along with 5 questions and a comment section. Table 1 lists the five questions that the survey participants were asked to validate the answers, while Table 2 reports on the results of the study. The analysis of results suggests that Koronapredictions represent potentially successful co-authorship relations; thus, they provide a solution to the problem tackled in this paper.
5 Related Work
Xia et al.  provides a comprehensive survey of tools and technologies for scholarly data management, as well as a review of data analysis techniques, e.g., social networks and statistical analysis. However, all the proposals have been made over raw data and knowledge-driven methods were not considered. Wang et al.  present a comprehensive survey of link prediction in social networks, while Paulheim  presents a survey of methodologies used for knowledge graph refinement; both works show the importance of the problem of knowledge discovery. Traverso-Ribón et al.  introduces a relation discovery approach, , able to identify hidden links in TED talks; it relies on heterogeneous bipartite graphs and on the link discovery approach proposed in . In this work, Palma et al. present semEP, a semantic-based graph partitioning approach, which was used in the implementation of Korona-semEP. Graph partitioning of semEP is similar to with the difference of only considering isolated entities, whereas is desired for ego networks. However, it is only applied to ego networks, whereas Koronais mainly designed for knowledge graphs. Sachan and Ichise  propose a syntactic approach considering dense subgraphs of a co-author network created from the DBLP. They discover relations between authors and propose pairs of researchers belonging to the same community. A link discovery tool is developed for the biomedical domain by Kastrin et al. . Albeit effective, these approaches focus on the graph structure and ignore the meaning of the data.
6 Conclusions and Future Work
Koronais presented for unveiling unknown relations; it relies on semantic similarity measures to discover hidden relations in scholarly knowledge graphs. Reported and validated experimental results show that Koronaretrieves valuable information that can impact the research direction of a researcher. In the future, we plan to extend Koronato detect other networks, e.g., affiliation networks, co-citation networks and research development networks. We plan to extend our evaluation over big scholarly datasets and study the scalability of Korona; further, the impact of several semantic similarity measures will be included in the study. Finally, Koronawill be offered as an online service that will enable researchers to explore and analyze the underlying scholarly knowledge graph.
This work has been partially funded by the EU H2020 programme for the project iASiS (grant agreement No. 727658).
-  Buluç, A., Meyerhenke, H., Safro, I., Sanders, P., Schulz, C.: Recent Advances in Graph Partitioning. Springer, Cham (2016)
-  Gaertler, M.: Clustering. In: Network Analysis: Method. Found.
-  Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. Scientific Computing (1998)
-  Kastrin, A., Rindflesch, T.C., Hristovski, D.: Link prediction on the semantic MEDLINE network - an approach to literature-based discovery. In: The Discovery Science Conference (2014)
-  Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014)
-  Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. JASIST 58(7) (2007)
-  Newman, M.E.: Modularity and community structure in networks. Proceedings of the national academy of sciences 103(23) (2006)
-  Palma, G., Vidal, M., Raschid, L.: Drug-target interaction prediction using semantic similarity and edge partitioning. In: ISWC (2014)
-  Paulheim, H.: Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web Journal 8(3) (2017)
-  Ribón, I.T., Vidal, M., Kämpgen, B., Sure-Vetter, Y.: GADES: A graph-based semantic similarity measure. In: SEMANTICS (2016)
-  Sachan, M., Ichise, R.: Using semantic information to improve link prediction results in network datasets. IJET 2(4) (2010)
-  Traverso-Ribón, I., Palma, G., Flores, A., Vidal, M.E.: Considering semantics on the discovery of relations in knowledge graphs. In: EKAW (2016)
-  Wang, P., Xu, B., Wu, Y., Zhou, X.: Link prediction in social networks: the state-of-the-art. Link Prediction in Social Networks(SCIS) 58(1) (2015)
-  Xia, F., Wang, W., Bekele, T.M., Liu, H.: Big scholarly data:a survey. IEEE Big Data (2017)