Graph Node Embeddings using DomainAware Biased Random Walks
Abstract
The recent proliferation of publicly available graphstructured data has sparked an interest in machine learning algorithms for graph data. Since most traditional machine learning algorithms assume data to be tabular, embedding algorithms for mapping graph data to realvalued vector spaces has become an active area of research. Existing graph embedding approaches are based purely on structural information and ignore any semantic information from the underlying domain. In this paper, we demonstrate that semantic information can play a useful role in computing graph embeddings. Specifically, we present a framework for devising embedding strategies aware of domainspecific interpretations of graph nodes and edges, and use knowledge of downstream machine learning tasks to identify relevant graph substructures. Using two reallife domains, we show that our framework yields embeddings that are simple to implement and yet achieve equal or greater accuracy in machine learning tasks compared to domain independent approaches.
1 Introduction
In recent years, we have witnessed a dramatic increase in the volume of graphstructured data being generated and made publicly available. For example, the Stanford Large Network Dataset Collection
Traditional machine learning methods frequently rely on tabular data representation. Specifically, the existence of realvalued vector representations of instances, and of a realvalued distance metric between pairs of instances, are assumed. Consequently, such approaches are not readily applicable to graph data. However, it is possible to overcome this limitation by defining a mapping from entities in graphstructured data to points in a realvalued vector space . Discovering such mappings, also known as embeddings, has emerged as an active area of research. A desirable embedding should: (a) be easy to compute, (b) map to a space of low dimensionality, and (c) induce a distance metric consistent with the notion of similarity in the original domain. Vector representations resulting from such an embedding may then be used to perform a variety of downstream machine learning tasks such as supervised learning (classification and regression), entity and document modeling, recommendation systems, etc., as demonstrated by Ristoski and Paulheim (2016).
A common approach to graph embedding draws inspiration from a language modeling technique, namely word2vec (Mikolov et al., 2013), that maps words in natural language to realvalued vectors based on their occurrences in sentences of a text corpus. Many graph embedding approaches perform random walks on graphs to generate linear sequences, which are treated as sentences and fed as input to word2vec to produce vector representations (Perozzi et al., 2014; Grover and Leskovec, 2016; Ristoski and Paulheim, 2016).
The random walk strategies used in the above approaches rely only on graph structure. Any semantic information from the underlying graph domain is ignored. Further, no knowledge of intended downstream applications is assumed. In this paper, we argue that semantic information from underlying domains and knowledge of downstream tasks can help improve embeddings. As an illustrative example, consider the problem of targeted advertising using a social network graph. When the nature of the advertised product is known, we are able to identify some types of graph substructures as more relevant than others. If, on the other hand, the goal is to identify terrorist networks, then a different set of substructures may deserve greater attention. Therefore, we present biased random walk strategies informed by domain knowledge such as domainspecific interpretations of graph nodes and edges, and knowledge of downstream machine learning tasks. We show that such strategies are simple to define and implement, and can lead to embeddings that lead to equal or higher accuracies in machine learning tasks compared to domain independent approaches.
2 Related Work
The terminology used in machine learning for graphs is sometimes unclear in conveying whether the instances in a dataset are nodes of the same graph or whether each instance is a graph itself. For example, the term “graph clustering” can refer to the task of clustering the nodes of a graph based on their interconnections, as well as to the problem of clustering a set of objects where each object is itself a graph. Deep learning terminology for graph data can be similarly ambiguous, as can be seen, for example, in the application of convolutional networks to graph data. Henaff et al. (2015) apply such networks to data represented as a single graph whose nodes represent instances and edges represent a nonEuclidean distance metric. On the other hand, Niepert et al. (2016) use convolutional networks on datasets where every instance is a graph. Another example is the unsupervised learning problem of generating embeddings, where the goal is to map objects to dense vector representations in a low dimensional Euclidean space. In the work by Perozzi et al. (2014), the instances being embedded are nodes of the same graph (e.g., users in an online network), while Yanardag and Vishwanathan (2015) consider datasets where each instance is itself a graph (e.g., protein structures).
To distinguish between these two types of scenarios, we use the following terminology in relation to the task of learning a function from a set of training instances (labeled or otherwise) .

We use graph mining to refer to scenarios where the instances are nodes of the same graph. As observed by Perozzi et al. (2014), the training instances in graph mining are not independent and identically distributed (i.i.d.), since the graph edges denote relationships between the instances. Therefore, graph mining amounts to relational learning. Realworld examples of graph mining include:

Community detection in social networks, where the users (nodes) in a social network graph are partitioned based on links between them.

Classifying research publications by subjects or keywords from a citation network of papers.


We use learning from graph databases to refer to scenarios where each instance is itself a graph, and the training set consists of i.i.d. samples from an underlying distribution of graphs. Real examples include:

Classifying chemical compounds as carcinogenic or noncarcinogenic.

Classifying proteins as enzymes or nonenzymes.

In the past five years, both graph mining and learning from graph databases have been explored by deep learning researchers; however, graph mining has been studied much more extensively. Our present work falls under the category of graph mining.
In unsupervised graph mining, the problem of generating node embeddings (also known as graph embeddings) has gained a lot of attention. A node embedding is a mapping of the nodes in a graph to points in a lowdimensional Euclidean space. Intuitively, the goal is to have “similar” nodes in the graph mapped to nearby points in the Euclidean space. Once generated, the same embeddings may be used in a wide range of machine learning tasks such as supervised learning (classification and regression), entity and document modeling, recommendation systems, etc., as demonstrated by Ristoski and Paulheim (2016).
A seminal contribution in this field is the DeepWalk algorithm introduced by Perozzi et al. (2014), which applies language modeling techniques to graph mining. DeepWalk starts by generating truncated random walks on the graph. These random walks provide an approximate representation of neighborhood structures in the graph. The generated walks are then treated as sentences on which word embedding techniques such as Skipgram word2vec (Mikolov et al., 2013) are trained to generate dense vector representations of nodes. The authors validated DeepWalk using multilabel classification in BlogCatalog, Flickr, and YouTube user networks.
Subsequently, this approach has been extended in various ways. Most extensions of DeepWalk redefine how neighborhood structures are represented. For instance, Grover and Leskovec (2016) introduce the concept of order random walks in their algorithm, node2vec. In a order random walk, two parameters and determine the likelihood of returning to the node visited just prior to the current node, and of moving further away from that previous node, respectively. Various settings of these parameters lead to walks biased in various ways. This approach was validated using multilabel classification in BlogCatalog, ProteinProtein Interaction (PPI), and Wikipedia graph datasets.
Another extension of DeepWalk is the RDF2Vec algorithm by Ristoski and Paulheim (2016). RDF
RDF2Vec may also be viewed as an extension of DeepWalk from the standpoint of node heterogeneity. The graphs on which DeepWalk was validated are characterized by homogeneous node types (e.g., every node in the YouTube user network represents a human user). RDF graphs such as DBPedia, on the other hand, inherently allow nodes of different types to be present in the same graph. Thus, RDF2Vec embeds nodes of various types into the same Euclidean space. Another approach to embedding nodes of different types (such as image and text) into the same space is the Heterogeneous Network Embedding (HNE) architecture by Chang et al. (2015). While DeepWalk, node2vec, and RDF2Vec rely on word2vec (which uses a shallow network with a single hidden layer), HNE uses a multilayer neural network. Conceptually, the input to the network is a pair of nodes, and the output is a predicted similarity metric. To accommodate heterogeneous nodes, every possible pair of node types (e.g., imageimage, imagetext, texttext, etc.) has a corresponding hierarchical feature extractor module whose inputs are pairs of nodes. The outputs of these modules, i.e., the extracted features, are fed into a common prediction layer that predicts similarity between the nodes, which is then used for backpropagation. The backpropagation, in turn, leads to learning the weights that define the embedding function.
All embedding methods described above assume the underlying graph (whether directed or undirected) to be unweighted. Cao et al. (2016) extend node embeddings to weighted graphs; in their method, structural information is represented using a random surfing model as opposed to random walks.
Our work is most closely related to the Biased RDF2Vec approach by Cochez et al. (2017). While Cochez et al. have shown that biasing random walks can improve the quality of the resulting node embeddings (and the accuracy of downstream machine learning tasks) compared to uniform random walks, the biasing strategies they use are purely based on structural information from the graph; all semantic information is ignored. We propose an alternate approach that utilizes this semantic information. Specifically, we make the following contributions:

We use knowledge of the underlying graph domain (e.g., “What do the nodes and edges of this graph represent?”) to formulate random walk biasing strategies that are domain specific.

We show that the domainaware strategies presented in this paper do not require graphwide computations such as the frequency of each edge label in the graph, or the indegrees of every node with an inbound edge, that domainindependent strategies depend on. In terms of computational cost, this implies that our methods don’t require graphwide queries, and rely solely on local neighborhood information which is amenable to caching due to locality of reference.

We use real world data to show that appropriately selected domainspecific biasing strategies produce embeddings with either comparable or substantially improved accuracy (depending on the underlying graph) in downstream machine learning tasks such as classification.
The next section describes our contributions in detail.
3 Methodology
We begin this section by reviewing existing methods that serve as building blocks in our approach. We then present our novel approach of incorporating semantics from underlying graph domains to generate high quality node embeddings.
3.1 Preliminaries
Word2vec
Word2vec, by Mikolov et al. (2013), is a natural language processing (NLP) technique that scans a corpus of text and produces realvalued vector representations (also known as a word embeddings) of all the unique words in that corpus
Mikolov et al. give two alternative neural architectures for computing word embeddings, namely, Continuous Bag of Words (CBOW) and Skipgram. Both architectures are shallow, in that each has only one hidden layer. The two architectures differ in how the relationship between a word and its contexts is viewed. In CBOW, the network accepts a context of words as input, and computes, for every word in the vocabulary, the probability of that word occurring at the center of that context. In Skipgram, the input to the network is a single word, and the output is a probability mass function over the vocabulary that gives the probability of finding each word as a result of randomly selecting a word within a certain window around that input (McCormick, 2016).
These word embedding architectures require data to be sequential (e.g., sentences of words). However, they have been applied to higher dimensional data, namely graphs, by extending the notion of context from a onedimensional window around a word to a neighborhood structure of a graph’s node. A set of walks rooted at any node may be considered an approximate representation of the node’s neighborhood. The advantage of such a representation is that it is sequential, and therefore may be ingested by a word embedding architecture to produce embeddings of graph nodes. Although this approach has been utilized in a number of algorithms (e.g., DeepWalk by Perozzi et al. (2014), node2vec by Grover and Leskovec (2016), and RDF2Vec by Ristoski and Paulheim (2016), here we focus on the RDF2Vec algorithm due to its relevance to the present work.
RDF2Vec
RDF2Vec uses several techniques to represent neighborhood structures as linear sequences: (a) enumerating all walks of a given depth rooted at every node, generated using breadthfirst search, (b) WeisfeilerLehman Subtree RDF Graph Kernels (de Vries and de Rooij, 2015), and (c) a set of random walks rooted at every node, where the maximum number of walks rooted at any node and the maximum length of any walk are both specified. For webscale datasets, only random walks are used because the other neighborhood representations are prohibitively expensive to compute. The set of graph walks thus generated is sequential data, and is used to train a word2vec model to compute node embeddings. We note that the random walks used by Ristoski and Paulheim (2016) are uniform random walks, i.e., the edge to traverse next is selected uniformly at random from the set of all outbound edges originating from the current node. Subsequently, Cochez, Ristoski, Ponzetto, and Paulheim (2017) have demonstrated that the quality of embeddings and the accuracy of downstream tasks (such as classification and regression) may be improved by using random walks that are biased as opposed to uniform. We review this approach next.
Biased RDF2Vec
Biased RDF2Vec formulates a biasing strategy as an edge weighting function, , that assigns a positive realvalued weight to every edge in the graph. Here, is the set of graph edges and is the set of positive real numbers. At any node, an outgoing edge is selected with probability proportional to its weight assigned by the weighting function. In other words, if the current node has outgoing edges , then the probability of selecting edge for the next traversal is given by:
(1) 
Various edge weighting functions lead to various biasing strategies, such as, predicate frequency where the probability of traversing a (labeled) edge is proportional to the frequency of occurrence of its edge label in the graph, object frequency where the probability is proportional to the indegree of the node to which the edge leads, and so on. We note that the biasing strategies used in Biased RDF2Vec are derived purely from structural information, and do not utilize any knowledge of what the graph really represents.
In contrast, our approach is to formulate biasing strategies that are aware of the semantics of the underlying graph domains. Such biasing strategies are domainspecific. Our empirical results suggest that in realworld scenarios, embeddings produced using domainspecific strategies may outperform embeddings produced based on structure alone, in terms of accuracy of downstream tasks such as classification.
3.2 DomainAware Biased RDF2Vec
As mentioned in Section 2, we propose an approach to generate node embeddings that uses graph walks biased by an understanding of the underlying semantics. Specifically, we present a framework consisting of the following steps:

Data Exploration. We begin by identifying all the entity types (node types) and relationship types (edge types)
^{5} , and for every relationship type we identify all pairs of entity types that it connects. 
Attribute Selection. We examine the end goal (e.g., inferring the value of a property of an entity such as a class label) and identify aspects of the graph (e.g., node labels, edge types) that may be relevant to that goal.

Strategy Definition. We use the selected attributes to define a family of biasing strategies that increase (or decrease) the likelihood of selecting an edge during any random walk.

Evaluation. We generate walks using these biasing strategies and, treating these walks as sentences, apply word2vec to compute node embeddings. The quality of the embeddings is measured using a supervised learning task such as classification of nodes based on their corresponding embeddings.
We now illustrate this approach using a realworld example.
The AIFB Dataset
The AIFB Dataset is an RDF dataset pertaining to the Institute for Applied Informatics and Formal Description Methods at the Karlsruhe Institute of Technology
Strategy 1
Data Exploration.
We begin with the observation that the RDF graph for AIFB, downloaded from http://data.dws.informatik.unimannheim.de/rmlod/LOD_ML_Datasets/data/datasets/RDF_Datasets/AIFB/, explicitly includes nodes (i.e., entities) of type “research group”, as well as edges (i.e., relationships) labeled “affiliation” from persons to research groups and edges labeled “member” from research groups to persons for the training and test instances. Ristoski and Paulheim (2016) show that RDF2Vec can learn the equivalence between the target of the classification problem, namely, research group affiliation, and the concept modeled using the above node and edge types. We note that this approach is different from that of Bloehdorn and Sure (2007) who treat research group affiliation as a latent variable to be inferred from the graph. We extend Ristoski and Paulheim’s approach.
Attribute Selection.
The random walks used by RDF2Vec do not exploit a priori knowledge of the correspondence between the notion of affiliation and edges labeled “member” and “affiliation” in the graph; specifically, the walks are agnostic to what these labels mean. On the other hand, if a human is shown a (small) subgraph of this data, he/she will probably utilize this correspondence to focus on these edges more than other edges. Our first strategy formalizes this intuition and uses walks that preferentially select edges with these labels for traversal more often than edges with other labels.
Strategy Definition.
Recall from Equation 1 that every assignment of weights to edges defines a biased random walk strategy. Based on the above discussion, we assign weight to every edge labeled “member” or “affiliation”, and to every other edge, where hyperparameters and satisfy . The resulting strategy is shown in Algorithm 1. We note that normalization of weights inside the weighting function is not necessary because the weighted random walk algorithm uses Equation 1 to normalize the weights before choosing the next edge to traverse. We also note that since this strategy relies on hyperparameters, the quality of embeddings and the accuracy of downstream tasks are dependent on hyperparameter tuning. Hyperparameter tuning is necessary for all algorithms presented in this paper.
While this strategy relies solely on relationship types (i.e., edge labels), a strategy may also rely solely on entity types (i.e., node types), or on both relationship and entity types. The next two strategies uses entity types.
Strategy 2
Data Exploration.
From the AIFB RDF graph, we observe the in addition to persons and research groups, the graph also has nodes representing projects and research topics. Moreover, there are “is about” relationships connecting projects and topics, “works at project” relationships connecting persons and projects, and “is worked on by” relationships connecting research topics and persons
Attribute Selection.
Based on the intuition that people affiliated with the same research group have a higher likelihood of working on the same project or research topic, we explore whether graph walks that are biased towards nodes of type “person”, “project”, “research topic” and “research group” are helpful in predicting affiliation.
Strategy Definition.
A strategy based on the above intuition is shown in Algorithm 2, where any edge that leads to a node of type “person”, “project”, “research group” or “research topic” is assigned weight whereas all other edges are assigned weight . Here, hyperparameters and satisfy . Like Algorithm 1, this algorithm requires parameter tuning. However, unlike Algorithm 1 that selects edges based on edge labels, this strategy selects edges based on the types of nodes to which those edges lead.
Strategy 3
Data Exploration.
We observe that the AIFB RDF graph has nodes representing publications, and edges connecting persons and publications to indicate authorship.
Attribute Selection.
We begin with the hypothesis that coauthorship in a publication is a relationship that tends to connect members of the same research group. However, when we experiment with strategies that bias graph walks towards publication nodes, we find that these strategies have a negative impact on classification accuracy, which suggests that there may be a substantial number of publications whose authors do not belong to the same research group. In fact, in Section 4.2 of their paper, Bloehdorn and Sure (2007) mention the existence of several authors in the dataset who are external to the AIFB department. As a result, it is difficult to infer affiliation based on coauthorship. This observation leads to the interesting question of whether ignoring coauthorship relationships makes the problem of inferring affiliation simpler.
Strategy Definition.
To explore this possibility, we devise a strategy to decrease the likelihood of selecting edges that lead to publication nodes. In the strategy shown in Algorithm 3, any edge leading to a node of type “publication” is assigned weight while all other edges are assigned weight , .
We now demonstrate that a strategy may depend both on entity types as well as relationship types, and further that simple biases may be combined to form more complex ones.
Strategy 4 We combine the approaches of Algorithm 1 (i.e, favoring affiliation and member edges) and Algorithm 3 (i.e., avoiding publication nodes) to get Algorithm 4 which does both. This algorithm has three hyperparameters, , , and , with . Any edge that leads to a node of type “publication” is assigned weight , any edge whose label is “affiliation” or “member” is assigned weight , and all other edges are assigned weight .
As demonstrated by our experimental results in the next section, these simple domainaware biasing strategies achieve substantially higher classification accuracy compared to several domainindependent strategies, namely, predicate frequency, inverse predicate frequency, object frequency, inverse object frequency, and uniform random walks. At the same time, the domainaware strategies presented in this section don’t require graphwide calculations that many domainindependent strategies rely on, such as computing frequency distributions of edge labels in the graph, or computing indegrees of all nodes with at least one incoming edge in the graph.
In the next section, we demonstrate similar results for a British Geological Society (BGS) dataset publicly available in RDF format.
The BGS Dataset
As described by Ristoski et al. (2016), the British Geological Society (BGS) RDF dataset documents observed geological properties of rock types in Great Britain. This dataset was used in machine learning by de Vries (2013) to predict lithogenesis (method of formation) types of named rock units. We now apply our methodology to derive domainspecific biased random walk strategies for the BGS dataset.
Strategy 1
Data Exploration.
As with AIFB, we note that the RDF graph for BGS, available from http://data.dws.informatik.unimannheim.de/rmlod/LOD_ML_Datasets/data/datasets/RDF_Datasets/BGS/, contains nodes(i.e., entities) representing lithogenesis types, and edges (i.e., relationships) labeled “hasLithogenesis” associating rock types to lithogenesis types. This suggests that when trained on the BGS graph, RDF2Vec learns that the equivalence between the target attribute of the classification problem and the concept represented by nodes of type “lithogenesis” together with edges labeled “hasLithogenesis”.
Attribute Selection.
However, the random walks used by RDF2Vec do not exploit any prior knowledge of this equivalence; in particular, such walks are agnostic to the meaning of the node type “lithogenesis” and edge label “hasLithogenesis”. To exploit this semantic information, we explore walks where edges labeled “hasLithogenesis” have a higher likelihood of being traversed compared to other edges.
Strategy Definition.
Our strategy resulting from the above discussion is shown in Algorithm 5, where any edge labeled “hasLithogenesis” is assigned weight whereas other edges are assigned weight , with
Strategy 2 In Section 4, we show that biasing walks in favor of “hasLithogenesis” edges substantially improves classification accuracy. However, it is also interesting to explore the impact of biasing walks against these edges, to gain insight into the usefulness of semantic information contained in the rest of the graph. One such biasing strategy is in Algorithm 6.
In Algorithm 6, we note that setting to be several orders of magnitude less than effectively amounts to removing the edges labeled “hasLithogenesis” and performing uniform random walks on the remainder of the graph. We show in Section 4 that under the above conditions Algorithm 6 achieves approximately the same accuracy as uniform random walks.
An interesting aspect of the BGS dataset is the hierarchical organization of categories. In the next section, we discuss a biasing strategy that utilizes this hierarchy.
Strategy 3
Data Exploration.
In the BGS dataset, concepts are organized in a hierarchy using edges labeled “broader” and “narrower”. Intuitively, we expect rock types that are similar in the concept hierarchy to have similar modes of origin, which suggests that the distance between their vector representations should be small.
Attribute Selection.
To enable graph walks to discover concept hierarchies, we use a strategy that heavily biases walks towards edges labeled “broader” and “narrower”, while rarely traversing edges labeled “hasLithogenesis”.
Strategy Definition.
One implementation of this idea is shown in Algorithm 7, where the hyperparameters satisfy .
In Section 3.2.1, we had observed that walks on the AIFB graph that include edges irrelevant to the downstream task can reduce accuracy. This is true for the BGS dataset as well, as we show next.
Strategy 4
We empirically found that edges labeled “inScheme” contribute negatively to the accuracy of the classification task. This observation leads to Algorithm 8 that tends to avoid these edges.
As with AIFB, it is possible to combine simple biasing strategies for walks on the BGS graph into more sophisticated strategies, as shown next.
Strategy 5 Similar to Algorithm 7, the goal here is to enable graph walks to exploit concept hierarchies. However, once a walk has arrived at the most general concept reachable, we want to use the “hasLithogenesis edge” (if available) to access the lithogenesis type for that general concept. Intuitively, this strategy is expected to succeed if specialized rock types inherit lithogenesis types from more general rock categories. The strategy is shown in Algorithm 9, where the hyperparameters satisfy .
In the next section, we experimentally evaluate our embedding methods and compare them against techniques that use domainindependent biased walks and uniform random walks.
4 Evaluation
For our empirical study, we have implemented a framework for extracting random walks from any RDF graph and subsequently deriving node embeddings from those walks. This framework permits a broad range of biasing strategies including domainspecific, domainindependent, and uniform random walks, making it suitable for experimental evaluation of all three kinds of strategies. This section begins with a description of the framework, followed by experimental results and discussion.
4.1 Experimental Setup
Our node embedding framework for RDF graphs is depicted in Figure 1
and is composed of the following components:

RDF Importer. We use the neosemantics
^{8} package, specifically, the semantics.importRDF storedprocedure in that package, to read the RDF file into a graph database. 
Graph Database. We use Neo4j
^{9} as our database management system for storing and querying graphs. 
Weighted Random Walk Generator. This is a program we have written in Python that accepts an edgeweighting function as input, and performs weighted random walks on the graph stored in Neo4j, using Equation 1 for edge selection. The program uses the Py2Neo
^{10} connectivity library to query the Neo4j database, and generates a maximum of walks rooted at every node of the graph, where is an input parameter that we set to in our experiments. Every walk can include one, two, three, or four edge traversals (i.e., hops). 
EdgeWeighting Function. Edgeweighting functions are Python implementations of biasing strategies. Since the Weighted Random Walk Generator accepts this function as input, various types of biased random walks can be easily generated by implementing the corresponding edgeweighting functions, which typically require very few lines of code.

Corpus of GraphWalk “Sentences”. The random walks, printed out by the Weighted Random Walk Generator one walk at a time, are typically redirected for storage into a plain text file. This file plays the role of a corpus of sentences.

Word2vec. We use the Gensim
^{11} implementation of word2vec, written by Řehůřek and Sojka (2010). Word2vec scans the sentences generated in the previous step to compute embeddings for each word in the sentence (i.e., each node in the graph). Our word2vec parameter settings closely follow those of Ristoski and Paulheim (2016). In particular, we set the window size to , number of epochs (i.e., the number of passes through the entire corpus) to , number of negative samples to , and the dimensionality of the embedding space to and . However, since we haven’t observed a substantial improvement in performance at the higher dimensionality, only results for dimensions are reported below. 
Embeddings. Finally, the output of the system is a set of embeddings, i.e., dense vector representations of every node in the graph. The embeddings may be stored as a Python pickle file, or directly used by downstream components such as classifiers.
We evaluate the quality of the resulting embeddings using Nearest Neighbor (NN) classification on the AIFB and BGS datasets. While Ristoski and Paulheim (2016) use , we set for AIFB and for BGS so that our implementation can reproduce the baseline results for uniform (i.e., unbiased) random walks reported in their paper. The training and test datasets, available from the same URIs as the RDF files, consist only of nodeidentifiers and class labels (as well as a row index with respect to the training/test files), where the nodeidentifiers are the same identifiers as those used in the RDF graphs (and therefore in the Neo4j database and in the extracted graph walks). Notably, no additional tabular attributes are used to guide the classification. We measure classification accuracy as the percentage of correctly classified test instances. We next present results obtained from our experiments.
4.2 Experimental Results
Classification results for the AIFB dataset are presented in Table 1. In this table, all numbers correspond to Continuous Bag of Words (CBOW) word2vec embedding into a dimensional space. Increasing the dimensionality to does not improve the accuracy appreciably. Skipgram word2vec reduces our measured accuracy for AIFB.
Biasing Strategy  CBOW  
Uniform  
Domainindependent  Predicate Frequency  
Inverse Predicate Frequency  
Object Frequency  
Inverse Object Frequency  
Domainspecific  AIFBWeightFunction1  
AIFBWeightFunction2  
AIFBWeightFunction3  
AIFBWeightFunction4 
As mentioned in the previous section, hyperparameter setting affects the quality of embeddings. In this study, we have tuned hyperparameters manually; use of automated hyperparameter tuning is a possible extension of this work. Hyperparameter values used in the domainspecific biasing strategies in Table 1 are shown Table 2.
Algorithm  Hyperparameters 
AIFBWeightFunction1  
AIFBWeightFunction2  
AIFBWeightFunction3  
AIFBWeightFunction4 
Classification results for the BGS dataset are presented in Table 3. In this table, we include results for Continuous Bag of Words (CBOW) as well as Skipgram word2vec embeddings into a dimensional space. For the BGS dataset, we include Skipgram results since in some cases, Skipgram provides slightly better results than CBOW. Increasing the dimensionality to does not improve the accuracy appreciably.
Biasing Strategy  CBOW  Skipgram  
Uniform  
Domainindependent  Predicate Frequency  
Inverse Predicate Frequency  
Object Frequency  
Inverse Object Frequency  
Domainspecific  BGSWeightFunction1  
BGSWeightFunction2  
BGSWeightFunction3  
BGSWeightFunction4  
BGSWeightFunction5 
Hyperparameter values used in the domainspecific biasing strategies in Table 3 are shown in Table 4.
Algorithm  Hyperparameters 
BGSWeightFunction1  
BGSWeightFunction2  
BGSWeightFunction3  
BGSWeightFunction4 
4.3 Remarks
In the case of AIFB, we observe that domainspecific biases provide a substantial improvement in classification accuracy compared to domainindependent ones. In particular, the simple strategy of avoiding publication nodes (AIFBWeightFunction3) leads to a classification accuracy of . When we avoid publication nodes and also preferentially select affiliation and member edges for traversal (AIFBWeightFunction4), the classification accuracy increases to . We further note that the domainspecific algorithms do not require graphwide calculations that the domainindependent biased walks rely on, such as the frequency distribution of edge labels (i.e., predicate frequencies). In the domainspecific walks, all decisions are made on locally available data which, due to locality of reference, is also highly cacheable.
In the case of BGS, we observe that the most successful domainspecific biasing strategy (BGSWeightFunction1) performs in the ballpark of the most successful domainindependent biasing strategy (inverse predicate frequency). Even under these circumstances, the domainspecific BGSWeightFunction1 provides the advantage of not requiring graphwide computation of predicate (i.e., edgelabel) frequencies. We haven’t yet been able to define a domainspecific strategy that can outperform the best domainindependent strategy.
Of the two realworld domains studied here, the AIFB domain (involving researchers, research groups, publications, etc.) is closer to our own field of work than the BGS domain (involving rock formations) which pertains to geology. As a result, it was easier for us to effectively exploit AIFB semantic information to devise biasing strategies that outperform domainindependent walks, whereas in BGS, our strategies could match the performance of domainindependent walks but not outperform them. It is possible that a domain expert in the methods of rock formation might be able to invent biasing strategies for BGS that outperform domainindependent methods.
5 Conclusion
To summarize, we have demonstrated that semantics of underlying graph domains provide valuable information for computing graph embeddings. We have presented a framework for devising biased random walk strategies that are informed by what the nodes and edges really mean, and use knowledge of downstream machine learning tasks to identify which graph substructures are more relevant, and therefore should be included more often in graph walks, than others . We have used this framework in two reallife domains, and shown that the resulting embeddings are simple to implement and achieve equal or higher accuracy in machine learning tasks compared to domain independent approaches. Our results also suggest that when this approach is applied to other reallife domains in future, domain expertise may be a key ingredient in the development of biasing algorithms that outperform domainindependent methods.
Footnotes
 https://snap.stanford.edu/data/
 https://lodcloud.net/
 https://www.w3.org/RDF/
 with the exception of “stop words” such as “a”, “an”, “the” etc.
 Entitysets and relationshipsets in ER terminology
 http://www.aifb.kit.edu/web/Hauptseite
 However, the “subtopic” relationship amongst topics, as depicted in Figure 1 by Bloehdorn and Sure (2007), is not present in the RDF graph.
 https://github.com/jbarrasa/neosemantics
 https://neo4j.com/
 http://py2neo.org
 https://radimrehurek.com/gensim/
References
 Christian Bizer, Tom Heath, and Tim BernersLee. Linked data  the story so far. Int. J. Semantic Web Inf. Syst., 5:1–22, 2009.
 Stephan Bloehdorn and York Sure. Kernel methods for mining instance data in ontologies. In Karl Aberer, KeySun Choi, Natasha Fridman Noy, Dean Allemang, KyungIl Lee, Lyndon J. B. Nixon, Jennifer Golbeck, Peter Mika, Diana Maynard, Riichiro Mizoguchi, Guus Schreiber, and Philippe CudréMauroux, editors, The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 1115, 2007., volume 4825 of Lecture Notes in Computer Science, pages 58–71. Springer, 2007. ISBN 9783540762973. doi: 10.1007/9783540762980˙5. URL https://doi.org/10.1007/9783540762980_5.
 Longbing Cao, Chengqi Zhang, Thorsten Joachims, Geoffrey I. Webb, Dragos D. Margineantu, and Graham Williams, editors. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 1013, 2015, 2015. ACM. ISBN 9781450336642. URL http://dl.acm.org/citation.cfm?id=2783258.
 Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning graph representations. In Dale Schuurmans and Michael P. Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 1217, 2016, Phoenix, Arizona, USA., pages 1145–1152. AAAI Press, 2016. ISBN 9781577357605. URL http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12423.
 Shiyu Chang, Wei Han, Jiliang Tang, GuoJun Qi, Charu C. Aggarwal, and Thomas S. Huang. Heterogeneous network embedding via deep architectures. In Cao et al. (2015), pages 119–128. ISBN 9781450336642. doi: 10.1145/2783258.2783296. URL http://doi.acm.org/10.1145/2783258.2783296.
 Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Biased graph walks for RDF graph embeddings. In Rajendra Akerkar, Alfredo Cuzzocrea, Jannong Cao, and MohandSaid Hacid, editors, Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS 2017, Amantea, Italy, June 1922, 2017, pages 21:1–21:12. ACM, 2017. ISBN 9781450352253. doi: 10.1145/3102254.3102279. URL http://doi.acm.org/10.1145/3102254.3102279.
 Gerben Klaas Dirk de Vries. A fast approximation of the weisfeilerlehman graph kernel for RDF data. In Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, and Filip Zelezný, editors, Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2013, Prague, Czech Republic, September 2327, 2013, Proceedings, Part I, volume 8188 of Lecture Notes in Computer Science, pages 606–621. Springer, 2013. ISBN 9783642409875. doi: 10.1007/9783642409882˙39. URL https://doi.org/10.1007/9783642409882_39.
 Gerben Klaas Dirk de Vries and Steven de Rooij. Substructure counting graph kernels for machine learning from RDF data. J. Web Sem., 35:71–84, 2015. doi: 10.1016/j.websem.2015.08.002. URL https://doi.org/10.1016/j.websem.2015.08.002.
 Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 1317, 2016, pages 855–864. ACM, 2016. ISBN 9781450342322. doi: 10.1145/2939672.2939754. URL http://doi.acm.org/10.1145/2939672.2939754.
 Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graphstructured data. CoRR, abs/1506.05163, 2015. URL http://arxiv.org/abs/1506.05163.
 Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
 C. McCormick. Word2vec tutorial  the skipgram model. Retrieved from http://www.mccormickml.com, April 2016.
 Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781, 2013. URL http://arxiv.org/abs/1301.3781.
 Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In MariaFlorina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 2014–2023. JMLR.org, 2016. URL http://jmlr.org/proceedings/papers/v48/niepert16.html.
 Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: online learning of social representations. In Sofus A. Macskassy, Claudia Perlich, Jure Leskovec, Wei Wang, and Rayid Ghani, editors, The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA  August 24  27, 2014, pages 701–710. ACM, 2014. ISBN 9781450329569. doi: 10.1145/2623330.2623732. URL http://doi.acm.org/10.1145/2623330.2623732.
 Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
 Petar Ristoski and Heiko Paulheim. Rdf2vec: RDF graph embeddings for data mining. In Paul T. Groth, Elena Simperl, Alasdair J. G. Gray, Marta Sabou, Markus Krötzsch, Freddy Lécué, Fabian Flöck, and Yolanda Gil, editors, The Semantic Web  ISWC 2016  15th International Semantic Web Conference, Kobe, Japan, October 1721, 2016, Proceedings, Part I, volume 9981 of Lecture Notes in Computer Science, pages 498–514, 2016. ISBN 9783319465227. doi: 10.1007/9783319465234˙30. URL https://doi.org/10.1007/9783319465234_30.
 Petar Ristoski, Gerben Klaas Dirk de Vries, and Heiko Paulheim. A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In Paul T. Groth, Elena Simperl, Alasdair J. G. Gray, Marta Sabou, Markus Krötzsch, Freddy Lécué, Fabian Flöck, and Yolanda Gil, editors, The Semantic Web  ISWC 2016  15th International Semantic Web Conference, Kobe, Japan, October 1721, 2016, Proceedings, Part II, volume 9982 of Lecture Notes in Computer Science, pages 186–194, 2016. ISBN 9783319465463. doi: 10.1007/9783319465470˙20. URL https://doi.org/10.1007/9783319465470_20.
 Max Schmachtenberg, Christian Bizer, and Heiko Paulheim. Adoption of the linked data best practices in different topical domains. In Peter Mika, Tania Tudorache, Abraham Bernstein, Chris Welty, Craig A. Knoblock, Denny Vrandecic, Paul T. Groth, Natasha F. Noy, Krzysztof Janowicz, and Carole A. Goble, editors, The Semantic Web  ISWC 2014  13th International Semantic Web Conference, Riva del Garda, Italy, October 1923, 2014. Proceedings, Part I, volume 8796 of Lecture Notes in Computer Science, pages 245–260. Springer, 2014. ISBN 9783319119632. doi: 10.1007/9783319119649˙16. URL https://doi.org/10.1007/9783319119649_16.
 Pinar Yanardag and S. V. N. Vishwanathan. Deep graph kernels. In Cao et al. (2015), pages 1365–1374. ISBN 9781450336642. doi: 10.1145/2783258.2783417. URL http://doi.acm.org/10.1145/2783258.2783417.
 Marinka Zitnik, Rok Sosič, Sagar Maheshwari, and Jure Leskovec. BioSNAP Datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata, August 2018.