motif2vec: Motif Aware Node Representation Learning for Heterogeneous Networks

motif2vec: Motif Aware Node Representation Learning for Heterogeneous Networks

Manoj Reddy Dareddy* *Work done as an intern at Visa Research. University of California, Los Angeles
Los Angeles, CA, USA
mdareddy@g.ucla.edu
   Mahashweta Das Visa Research
Palo Alto, CA, USA
mahdas@visa.com
   Hao Yang Visa Research
Palo Alto, CA, USA
haoyang@visa.com
Abstract

Recent years have witnessed a surge of interest in machine learning on graphs and networks with applications ranging from vehicular network design to IoT traffic management to social network recommendations. Supervised machine learning tasks in networks such as node classification and link prediction require us to perform feature engineering that is known and agreed to be the key to success in applied machine learning. Research efforts dedicated to representation learning, especially representation learning using deep learning, has shown us ways to automatically learn relevant features from vast amounts of potentially noisy, raw data. However, most of the methods are not adequate to handle heterogeneous information networks which pretty much represents most real world data today. The methods cannot preserve the structure and semantic of multiple types of nodes and links well enough, capture higher-order heterogeneous connectivity patterns, and ensure coverage of nodes for which representations are generated. In this paper, we propose a novel efficient algorithm, motif2vec that learns node representations or embeddings for heterogeneous networks. Specifically, we leverage higher-order, recurring, and statistically significant network connectivity patterns in the form of motifs to transform the original graph to motif graph(s), conduct biased random walk to efficiently explore higher order neighborhoods, and then employ heterogeneous skip-gram model to generate the embeddings. Unlike previous efforts that uses different graph meta-structures to guide the random walk, we use graph motifs to transform the original network and preserve the heterogeneity. We evaluate the proposed algorithm on multiple real-world networks from diverse domains and against existing state-of-the-art methods on multi-class node classification and link prediction tasks, and demonstrate its consistent superiority over prior work.

Author Terms heterogeneous information networks, network embedding, network representation learning, feature learning, motifs

I Introduction

Recent years have witnessed a surge of interest in machine learning on graphs and networks with applications ranging from vehicular network design to IoT traffic management to drug discovery to social network recommendations. Graph-based data representation enables us to understand objects with respect to the neighboring world instead of just observing them in isolation. Thus, there is an increasing trend of representing data, that is not naturally connected, as graphs. Examples include item graph constructed from users’ behavior history that is originally sequential in nature [35], product review graph constructed from reviews written by users for stores [34], credit card fraud network constructed from fraudulent and non-fraudulent transaction activity data [33], etc.

Supervised machine learning tasks over nodes and links in networks111We use the term network (nodes, links) and graph (vertices, edges) interchangeably throughout the paper. such as node classification and link prediction require us to perform feature engineering that is known and agreed to be the key to success in applied machine learning. However, feature engineering is challenging and tedious since the traditional process relies on domain knowledge, intuition, data manipulation, and manual intervention. Research efforts dedicated to representation learning, i.e., learning representations of the data that make it easier to extract useful information when training classifiers or other predictors, has shown us ways to automatically learn relevant features from vast amounts of potentially noisy, raw data. Of particular interest to the academic and industry research community has been representation learning using deep learning [2] that are formed by the composition of multiple non-linear transformations with the goal of yielding more useful representations. There has been a series of work over the past demi decade that focuses on graph node representation or graph embedding algorithms [7]. The common goal of these works is to obtain a low-dimensional feature representation of each node of the graph such that the method is scalable and the vector representation preserves some structure and connectivity pattern between individual nodes in the graph. The graph embedding methods are broadly classified into three categories namely factorization based, random walk based, and deep learning based with applications in network compression, visualization, clustering, link prediction, and node classification [7]. Among the three categories, random walk based graph embedding techniques have emerged to be the most popular since they help approximate many network properties, are useful when network is too large to measure in its entirety, and can work with partially observable network. The popular random-walk based methods include DeepWalk [18], node2vec [8], LINE [29], HARP [5], etc.

However, most of these methods are designed for homogeneous networks and are inadequate to handle heterogeneous information networks, i.e., networks with multiple types of nodes and links, which pretty much represents most real world data today. Contemporary information networks like Facebook, DBLP, Yelp, Flickr, etc. contain multi-type interacting components. For example, social network Facebook has different types of objects (nodes) such as users, posts, photos as well as different kinds of associations (links) such as user-user friendship, person-photo tagging relationship, post-post replying relationship, etc. Researchers today acknowledge that heterogeneous networks fuse more information and support richer semantic representation of the real world [22][24]. They also emphasize that data mining approaches designed for homogeneous graphs are not well-suited to handle heterogeneous graphs. For example, classification in homogeneous networks is traditionally done on objects of the same entity type, makes strong assumptions on the network structure, and assumes that data is independently and identically distributed (i.i.d.). Contrarily, classification in heterogeneous networks need to simultaneously classify multiple types of objects which may be organized arbitrarily and may violate the i.i.d assumption. Thus, there is an innate need to develop graph embedding methods for heterogeneous networks.

Dong et al. formally introduced the problem and proposed a novel algorithmic solution metapath2vec [6] that leverages metapath, the most popular graph meta-structure for heterogeneous network mining [24]. A more recent work proposed metagraph2vec [37] that leverages metagraph in order to capture richer structural contexts and semantics between distant nodes. Other heterogeneous network embedding methods include PTE [28] that is a semi-supervised representation learning method for text data; HNE [4] that learns representation for each modality of the network separately and then unifies them into a common space using linear transformations; LANE [11] that generates embeddings for attributed networks; and ASPEM [23] that captures the incompatibility in heterogeneous networks by decomposing the input graph into multiple aspects and learns embeddings independently for each aspect. None of PTE, HNE, LANE, or ASPEM is aligned to the generic task of task-independent heterogeneous network embedding learning. The heterogeneity in PTE stems from links in a text network while the raw input belongs to the same object type; HNE works on a heterogeneous graph with image and text where the simultaneous interactions among multi-typed objects are decomposed into several scattered pairwise interactions in a single-typed network; LANE defines heterogeneity as diverse information sources (namely, network topology and node label information) that need to be jointly learnt; while ASPEM models heterogeneous network incompatibility and learns embeddings for each aspect independently. Among the related art, metapath2vec and metagraph2vec consider the general problem of learning node representations for heterogeneous networks. However, the methods cannot preserve the structure and semantics of multi-type nodes and links well enough, capture higher-order heterogeneous connectivity patterns, and ensure coverage of nodes for which representations are generated, as demonstrated in Section IV.

In this paper, we propose a novel efficient algorithm motif2vec that learns node representations or embeddings for heterogeneous information networks. Specifically, we leverage higher-order, recurring, and statistically significant network connectivity patterns in the form of motifs to learn higher quality embeddings. Motifs are one of the most common higher-order data structures for understanding complex networks and have been popularly recognized as fundamental units of network [3]. It has been successfully used in many network mining tasks such as clustering [32][36], anomaly detection [31], and convolution [21]. However, no prior work has investigated the scope and impact of motifs in learning node embeddings for heterogeneous networks. Rossi et al. introduced the problem of higher-order network representation learning using motifs for homogeneous networks [20]. But the method cannot be extended to handle heterogeneous networks. HONE [20] does not combine the best of both worlds random walk based method that accounts for local neighborhood structure and motif-aware method that accounts for higher-order global network connectivity patterns, as we do. In addition, HONE (as well other existing methods) do not include the original network in the learning process, as we do. The latter ensures higher coverage of connected nodes.

Our algorithm motif2vec transforms the original graph to motif graph(s), conduct biased random walk to efficiently explore higher order neighborhoods, and then employ heterogeneous skip-gram model to generate the embeddings. Related efforts in heterogeneous network node embedding, namely, metapath2vec [6] and metagraph2vec [37] are limited to only exploring neighborhoods, nodes, and links participating in the meta-structure of interest. motif2vec leverages motifs to transform the original graph to a motif representation and conduct regular random walks on the entire transformed graph. We evaluate our algorithm on multiple real-world networks from diverse domains and against existing state-of-the-art techniques on multiple machine learning tasks and demonstrate its consistent superiority over prior work. To summarize, we make the following contributions:

  • We propose motif2vec, an efficient and effective novel algorithm for representation learning in heterogeneous information networks. Specifically, we leverage higher-order, recurring, and statistically significant network connectivity patterns in the form of motifs to learn higher quality embeddings.

  • Our method preserves both local and global higher-order structural relationships as well as semantic correlations in a heterogeneous network. Unlike existing efforts, our method does not focus on refining the random walk to achieve the goal. Instead, we present a graph transformation method that enable us to capture sub-graph pattern significances.

  • We empirically evaluate our algorithm for multiple heterogeneous network mining tasks, namely multi-class classification and link prediction on multiple real-world datasets from different domains and demonstrate its consistent superiority over state-of-the-art baselines.

Ii Preliminaries

We introduce our problem definition and related concepts and notations before presenting our framework in Section III.

Definition II.1

Heterogeneous Information Network: A heterogeneous information network is defined as a directed graph G = (V, E, , ) in which each node v V is associated with mapping function (v): V and each link e E is associated with mapping function (v): E . and denote the sets of node types and link types in G, 1 and 1.

(a) DBLP
(b) Yelp
Fig. 1: Examples of heterogeneous network schema

Examples of heterogeneous information network include the popular DBLP bibliographic network and Yelp social information network. Figure 1 presents the network schema, i.e., meta-template for an information network, for each of the example instances. In Figure 1(a), multiple types of objects such as authors, papers, conference venues, author organizations, and paper keywords are connected by multiple types of relationships such as authorship (author paper), affiliation (author organization), etc. In Figure 1(b), multiple types of objects such as users, businesses, business locations, user reviews, and review terms are connected by multiple types of relationships such as check-in (user business), etc.

We define our representation learning task on such a heterogeneous network.

Definition II.2

Heterogeneous Network Representation Learning: Given a heterogeneous network G, the goal of representation learning is to learn a function f: V that maps nodes in G to d-dimensional features in vector space and learns , d such that network structure and semantic heterogeneity is preserved.

We leverage motifs to design our heterogeneous network representation learning method.

Definition II.3

Heterogeneous Network Motif: A network motif = (, , , ) is an isomorphic induced directed subgraph consisting of a subset of k nodes from directed heterogeneous network G with , , , , such that:

(i) ,

(ii) consists of all of the edges in that have both endpoints in ,

(iii) iff for mapping function : , and

(iv) frequency of appearance of in is above a predefined threshold (i.e., statistically significant).

A recurring pattern is considered statistically significant if the frequency of its appearance in a graph is significantly higher than the frequency of its appearance in any randomized network.

Motifs are one of the most common higher-order data structures for understanding complex networks and have been popularly recognized as fundamental units of network [3]. It has been successfully used in many network mining tasks such as clustering [32], anomaly detection (densest subgraph sparsifiers) [31], and convolution [21]. In this work, we focus on directed motifs and directed heterogeneous network since they offer greater scope of representing rich semantics. Figure 2(a) presents all possible 3-node network motifs. Figure 2(b) is a toy example showing how to find motif instance(s) in a graph [3].

In the toy example Figure 2(b), Figure 2(b)(right) depicts the graph and Figure 2(b)(left) depicts the motif. We observe that there are two instances of the motif in the graph: (i) ({a, b, c}, {a, b}) and (ii)({a, b, e}, {a, b}). The instance ({a, b, d}, {a, b}) is not included as an instance because the induced subgraph on the nodes a, b, and d is not isomorphic to the original graph.

Motifs are distinctly different from some of the other popular graph meta-structures such as metapath and metagraph. We discuss this in details in Section IV.

(a) All 3-node network motifs
(b) Toy example
Fig. 2: Introduction to motifs

Problem 1
Given a directed unweighted heterogeneous information network and a set of network motifs, learn low-dimensional latent representations for each node in the network such that the higher-order heterogeneous network neighborhood structure and semantics is preserved.

Iii The motif2vec Framework

We present our general motif2vec framework that learns high quality node embeddings for heterogeneous networks. Our approach returns representations that help maximize the likelihood of preserving network neighborhoods of multi-type nodes and links.

Skip-gram Model: First, we introduce word2vec [15][16] and discuss its application to network embedding generation tasks. Mikolov et al. introduced word2vec group of models that learns the distributed representations of words in a corpus. Specifically, the skip-gram model learns high-quality vector representations of words from large amounts of unstructured text data. The algorithm scans the words in a document and aims to embed every word such that the word’s features can predict nearby context words. Deepwalk [18] and node2vec [8] generalized the idea to a homogeneous network by converting a network into a ordered sequence of nodes. For this, both methods sample sequences of nodes from the original network by random walk strategies.

Random Walk: A walk in a graph or directed graph is a sequence of nodes (), not necessarily distinct, such that . When the consecutive nodes in the sequence are selected at random, we generate a a random sequence of nodes known as the random walk on the graph. A random walk on a graph is a special case of a Markov chain that is time-reversible. The probability of transition from node to is a function of the out-degree of node . We explore the neighborhood of a node in a graph or a directed graph using random walk. Specifically, we employ biased random walk procedure that efficiently explores nodes’ diverse neighborhoods in both breadth-first and depth-first search fashion [8].

Such a random walk combined with skip-gram based embedding method learns feature representations for node in a homogeneous graph that predicts node s context neighborhood .

(1)

Unlike all previous efforts belonging to this family of node embedding algorithms that employ random walk on the original graph [6][8][18][29][37], we conduct random walk on a transformed graph, known as the motif graph.

Motif Graph: The network motif literature has defined several graph features and concepts for motifs such as motif cut, motif volume, motif conductance, etc. [3]. We present one of them, namely motif graph or motif adjacency matrix, which has been used in our algorithmic framework. Given a directed heterogeneous network G = (V, E, , ) and a motif set , we compute the motif adjacency matrices . The weighted motif adjacency matrix for motif is defined as:

(2)

The motif adjacency matrix, also known as the motif co-occurrence matrix, differs from the original graph structurally. The motif graph captures pairwise relationships between nodes in the original graph with respect to a motif. The larger the value in is, the more significant the relation between nodes and is with respect to the motif . The motif adjacency matrix can be both weighted or binary. In the later case, is either 1 or 0 indicating the existence of a relationship between nodes and for motif . The motif adjacency matrix is symmetric, and thereby undirected. All edges in the original graph may not exist in the motif graph since a motif may not appear for a given edge. The edges in a motif graph are likely to have different weights than the original graph since a motif may appear at a different frequency than another random motif for a given edge. Thus, the number of edges in a weighted motif graph is usually greater than the number of edges in the original graph.

We transform the original graph to a motif graph in order to simultaneously encode the heterogeneity in structure and semantics, and conduct random walks on the motif graph itself. Additionally, we conduct random walks on the original graph to ensure greater coverage of higher-order connected nodes, which may otherwise be missed due to their non-participation in popular motifs. This strategy enables our random walk to be not dependent on the type of the node or link, as in prior art for heterogeneous networks [6][37]. Note that, meta-structure (metapath, metagraph, etc.) driven random walks limit the scope of a walk to explore higher-order diverse neighborhoods. The generated walk sequences are aggregated and shuffled before being fed to skip-gram. Our graph transformation followed by graph meta-structure independent biased random walk enable the sequences to carry both higher-order heterogeneous network structural patterns as well as heterogeneous semantic relationships. We demonstrate the superiority our novel idea empirically in Section IV.

Fig. 3: The motif2vec framework

Figure 3 illustrates our motif2vec framework. Given a heterogeneous information network and a set of motifs , the goal of the framework is to output -dimensional embedding vectors for each node in . The steps in the framework presented in Figure 3 includes: motif instance discovery, random walk sequence generation, aggregation and shuffling of the generated walk sequences, skip-gram neural net training, and finally embedding generation. We present the three phases of our framework next.

INPUT: Heterogeneous information network , motif set , embedding dimension , walks per node , walk length , neighborhood size , return parameter , in-out parameter

OUTPUT: Latent node representations

1:Initialize
2:Initialize
3: Discover-Motif-Instances(, )
4:Initialize
5:for  to  do
6:      Create-Weighted-Motif-Graph (, , )
7:Initialize walks
8:for  and  do
9:     for all nodes  do // current iter
10:          Generate-Random-Walk
11:         Append to      
12:Initialize sequences
13: Shuffle ()
14: Skip-Gram-Model(, , )
15:Return
Algorithm 1 The motif2vec algorithm

Network Transformation: First, we find instances of the motif(s) under consideration in the original network. This is referred to as the motif discovery task in the literature and is a computationally expensive operation. Many motif discovery algorithms have been proposed over the years, each with the intent of improving the computational aspects of the state-of-the-art [12]. We use the method presented in [9] for motif discovery. Once the motif instances are received, we compute the weighted motif adjacency matrix. Thus, we transform the original network to motif graph(s) that encodes the heterogeneity in network structure and semantics.

Sequence Generation: Next, we generate random walk sequences for motif graph(s) and the original graph. We generate random walks on both transformed graph(s) and original graph. We use the method in [8] for generating sequences. We aggregate and shuffle the sequences generated from the original and the motif graph(s) before feeding them to the neural net. Thus, our generated walk sequences encompasses both local and global network heterogeneous connectivity.

Embedding Generation: Finally, we input the walk sequences from the previous step and output node embeddings. We use the skip-gram neural net model architecture in  [19] for learning the latent feature representations. We minimize our optimization function using SGD with negative sampling that is known to learn accurate representations efficiently [16]. Following Equation 1, we optimize embedding () for node in graph for random walk co-occurrences according to:

(3)

where is the neighborhood of node and node is seen on a random walk staring from node .

The pseudo-code for motif2vec is presented in Algorithm 1.

Iv Experiments

We evaluate the heterogeneous network node embeddings obtained through motif2vec on two standard supervised machine learning tasks: multi-label node classification and link prediction.

Iv-a Experimental Setup

We compare motif2vec with several recent network representation learning algorithms on multiple datasets.

Iv-A1 Datasets

We use three popular publicly available heterogeneous networks data from the literature:

DBLP-P Dataset: It is a bibliographic network composed of three types of nodes: author (A), paper (P), and venue (V) connected by three types of links: A P, P V, and P P. We use a subset of the DBLP dataset made available by  [21][26] for paper classification task. The papers are labeled to belong to 10 classes such as information retrieval, databases, networking, artificial intelligence, operating systems, etc. that are extracted from Cora [14]. There are 17,411 authors (A), 18,059 papers (P), and 300 conferences, i.e., venues (V).

AMiner-CS Dataset: It is another bibliographic network graph composed of three types of nodes: author (A), paper (P), and venue (V) connected by three types of links: A P, P V, and P P. We use a version of the AMiner Computer Science (CS) dataset made available by [6] for author classification. It comprises of 1,693,531 authors (A), 3,194,405 papers (P), and 3,883 venues (V). Author research categories are labeled to belong to 8 classes such as theoretical computer science, computer graphics, human computer interaction, computer vision and pattern recognition, etc. based on the categories in Google Scholar [6]. There are 246,678 labeled authors in this dataset. We use it for author node classification task.

Fig. 4: Heterogeneous information network schemas

Yelp-Restaurant Dataset: We consider the data obtained from the 12 round of Yelp Dataset Challenge. We build a heterogeneous network composed of four types of nodes: users (U), businesses, i.e., restaurants (R), location (L), and category (C) connected by three types of links: U R, R C, and C L. Yelp dataset includes users who have very few reviews. In fact, about 49% of the users have only one review [38] making the dataset very sparse and hence difficult for evaluation purposes. Following the common practice by other works (e.g., [38]), we filter out users with less than twenty business reviews over fourteen years (2004 - 2018). There are 36,432 users (U), 18,256 restaurants (R), 5,514 locations, and 419 categories (C). We use this data for U R link prediction task.

Amazon-Electronics Dataset: We consider the Amazon-200k data-set [10][39] which contains ratings provided by users on electronics items in Amazon. Similar to Yelp-Restaurant dataset, we build a heterogeneous network composed of four types of nodes: users (U), items (I), brand (B) and category (C) connected by 3 types of links: U I, I B, and I C. There are 59,297 users (U), 21,000 items (I), 2,059 brands (B), and 683 categories (C). We use this dataset for U I link prediction.

Dataset #Nodes #Links
DBLP-P 35,770 131,636
AMiner-CS 4,891,819 12,506,615
Yelp-Restaurant 60,621 189,423
Amazon-Electronics 83,039 284,650
TABLE I: Heterogeneous information network statistics
Method Multi-Class Node Classification Link Prediction
DBLP-P AMiner-CS Yelp-Restaurant Amazon-Electronics
motif2vec 78.80 91.68 58.38 58.90
metapath2vec 60.08 73.90 43.30 50.89
metapath2vec++ 49.40 72.31 29.21 57.02
metagraph2vec 64.48 82.09 29.24 55.53
metagraph2vec++ 53.24 35.58 39.60 60.02
TABLE II: Quantitative results (accuracy in %) for different machine learning tasks, different datasets, different embedding methods under the same experimental settings

We build heterogeneous information networks out of each of the datasets. The network schema and statistics are presented in Figure 4 and Table I respectively.

Iv-A2 Baseline Methods

We compare motif2vec with recent network representation learning methods focused on heterogeneous networks. Specifically, we focus on the family of node embedding methods to which motif2vec belongs.

metapath2vec, metapath2vec++ [6]: Dong et al. study the problem of representation learning in heterogeneous networks. They propose two models: metapath2vec and metapath2vec++ that first leverages meta-path based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to generate the embeddings.

metagraph2vec, metagraph2vec++ [37]: Zhang et al. proposed a network embedding learning method for heterogeneous networks that leverages metagraph to capture richer structural contexts and semantics between distant nodes. The method uses metagraph to guide the generation of random walks and then skip-gram model to learn latent embeddings of multi-typed heterogeneous network nodes. metagraph2vec uses homogeneous skip-gram model while metagraph2vec++ uses heterogeneous skip-gram model.

The authors in [6] demonstrate how the proposed method beats some of the popular state-of-art at that time, namely DeepWalk [18], node2vec [8], LINE [29], Spectral Clustering [30] and Graph Factorization [1]. Thus, we exclude them from our experiments. Note that, each of this work considers homogeneous networks.

Iv-A3 Machine Learning Tasks

We consider two standard supervised machine learning tasks to evaluate our embedding.

Node Classification: Node classification is a downstream machine learning task that classifies nodes in a heterogeneous network into a pre-defined set of classes. We follow a standard classification setup and use the generated node embeddings as features for the classifier. We conduct paper node multi-class classification for DBLP-P data and author node classification for AMiner-CS data. The classifier, parameter values, and train/test data is fixed for the various embedding approaches to avoid any confounding factor. We use traditional SVM classifier for both DBLP-P and AMiner-CS datasets, without any parameter tuning.

Link Prediction: Link prediction is a widely popular machine learning task in heterogeneous networks that predicts links that are likely to be added to the network in the near future. We leverage node embedding features to predict links. We partition the links in a network to train and test instances in order to hide a fraction of the existing links during embedding learning. The probability of a link appearing between two nodes in a network is calculated by computing cosine similarity between the respective feature vector embeddings. In our experiments, if the embedding-based similarity score between a pair of nodes is higher than a threshold, we infer that an edge could exist between the two nodes. In order to penalize embeddings that generate a high similarity value for any random pair of nodes, we generate an equal number of fake links in the test set. These fake links correspond to links that do not exist in the original network. The intuition is that these embeddings are expected to return a similarity score less than the threshold. We evaluate our embeddings for link prediction on Yelp-Restaurant and Amazon-Electronics datasets.

Iv-A4 Evaluation Metric

In traditional classification task, accuracy is a popular evaluation metric and we consider that. We perform the standard 70:30 split for train and test data, and measure the percentage of correct predictions for the test instances during multi-class classification. For link prediction, we split data into 70:30 such that the links present in the test set are removed from the original network on which embeddings are learnt. We measure the percentage of correct predictions, i.e., presence of links, for the test instances. We refer to our link prediction evaluation metric as accuracy too.

Iv-A5 Settings

For all embedding methods, we use the exact same parameters listed below. The parameter settings used are in line with typical values used in prior art [8].

The embedding vector dimension : 128

The walk length : 80

The number of walks per node : 10

The context size for optimization : 10

Random walk return parameter : 1

Random walk in-out parameter : 1

The optimization is run for a single epoch. Each of our reported numbers is an average of five runs. All codes are implemented in Python All experiments are conducted on a Linux machine with 2.60GHz Intel processor, 28 CPU cores, and 800GB RAM.

Fig. 5: Example 3-node, 4-node, and 5-node motifs for DBLP-P/AMiner-CS datasets for the heterogeneous network schema in Figure 4(a)
Fig. 6: metapath and metagraph for DBLP-P/AMiner-CS datasets for the heterogeneous network schema in Figure 4(a)

Iv-B Results: Accuracy

Table II presents our experimental results for the different machine learning tasks: multi-class node classification and link prediction, different datasets: DBLP-P, AMiner-CS, Yelp-Restaurants, and Amazon-Electronics, different embedding methods: motif2vec, metapath2vec, metapath2vec++, metagraph2vec, and metagraph2vec++ under the same configuration parameter settings, as detailed in Section 4.1. We observe that our algorithm motif2vec consistently and significantly outperforms the baseline methods for both tasks and across all four datasets.

For paper node classification task on DBLP-P data, motif2vec beats the best baseline by 22%. For author research category classification task on AMiner-CS data, motif2vec beats the best baseline by 24%. For link prediction task on Yelp-Restaurant data, motif2vec achieves 34% improvement, while for link prediction on Amazon-Electronics data, motif2vec achieves 3% improvement over the second best baseline method. We observe that metagraph2vec++ is the best algorithm for Amazon-Electronics dataset. However, metagraph2vec and metagraph2vec++ are fairly inconsistent, as is evident from their accuracy numbers for the remaining datasets in the table. The authors in [6] and [37] introduce the “++” version of metapath2vec and metagraph2vec since the heterogeneous skip-gram model is expected to accommodate the heterogeneity in network better. However, the results presented by the authors in both works fail to showcase the steady benefits of heterogeneous skip-gram. We do not propose an extended version of motif2vec in this paper.

Motif Accuracy (in %)
A 3-node motif () 78.50
A 4-node motif () 78.80
A 5-node motif () 78.43
All 3-node motifs () 78.00
All 4-node motifs () 77.75
TABLE III: Multi-Class Paper Node Classification for DBLP-P

In summary, our method learns consistently and significantly better (achieving relative improvements as high as 24% and 34% over benchmarks for classification and prediction respectively) heterogeneous network node embeddings than existing state-of-the-art methods. This is primarily because transforming the original complex graph to a motif graph helps accommodate heterogeneous network structure and semantic heterogeneity effectively.

(a) Yelp-Restaurant
(b) Amazon-Electronics
Fig. 7: motif, metapath, and metagraph for Yelp-Restaurant and Amazon-Electronics datasets for the heterogeneous network schemas in Figure 4(b) and Figure 4(c) respectively

Iv-C motif vs. metapath vs. metagraph

Figure 5 illustrates example 3-node, 4-node, and 5-node motifs for the machine learning task on DBLP-P and AMiner-CS datasets. Figure 6 presents the metapath and metagraph for node classification task on DBLP-P and AMiner-CS datasets.

Motifs are crucial to understanding the structure, semantics, and functions of meaningful patterns in complex networks. Thus, interesting motifs in a heterogeneous bibliographic network may include (motifs in Figure 5): (i) authors collaborating on the same paper ( and ), (ii) papers published at the same venue ( and ), (iii) authors collaborating on a paper that gets published at a venue (), (iv) authors collaborating on papers (), (v) author publishing papers at the same venue (), (vi) authors collaborating on papers that get published at a venue (), and (vii) authors publishing papers at the same venue ().

Table III presents the effectiveness of each motif in generating higher quality embeddings useful for classification. In our framework, we can combine multiple motifs for learning more effective node representations. Some combinations of motifs are more useful than the others. In Table III, all 3-node motifs and all 4-node motifs do not return the highest classification accuracy since at least one non-useful motif in each set pulls the classification accuracy down. Determining the best combination of motifs for higher quality embedding learning is combinatorially expensive and not the focus of this paper. In our experiments, we consider only one motif, i.e., in order to ensure a fair comparison with the baseline methods which consider one metapath and one metagraph with the same semnatics (as ) respectively.

Authors in [6] surveyed metapath related efforts and found that the most popularly used meta-path schemes in bibliographics networks are and . denotes co-authorship relationship while represents authors publishing papers at the same venue. We consider as the metapath for experiments involving DBLP-P and AMiner-CS dataset since it can be generalized to diverse tasks in a heterogeneous bibliographic network [6]. Authors in [37] extend metapaths to metagraphs in order to capture rich contexts and semantic relations between nodes better. The augmentation of path to the directed metapath helps the meta-structure encode semantic relations between distant nodes. Thus, we choose the same metagraph as the one shown in [37] for heterogeneous bibliographic network mining.

Figure 7(a) and Figure 7(b) present the motif, metapath, and metagraph for link prediction on Yelp-Restaurant and Amazon-Electronics datasets respectively. Our choice of metapath and metagraph for Yelp-Restaurant and Amazon-Electronics datasets is inspired by [39]. Similar to our set-up for DBLP-P and Aminer-CS datasets, we choose one motif from the set of possible motifs for motif2vec in order to ensure a fair comparison with metapath2vec and metagraph2vec. Note that, each of our motif, metapath, and metagraph in Figure 7(a) and Figure 7(b) consist of three node types though the original schema has four node types. This is because metapath and metagraph cannot handle four types of nodes, as discussed next. Experimental results replacing node type location (L) with node type category (C) and node type brand (B) with node type category (C) for Yelp-Restaurant and Amazon-Electronics dataset respectively are similar to the results in Table II.

Advantages of motif over metapath and metagraph: Motifs are capable of capturing greater context and leveraging richer semantics than both metapaths and metagraphs. This is because both metapaths and metagraphs are commonly used in a symmetric way thereby facilitating a recursive guidance for random walkers [6][22][24][37]. Thus, they cannot build meaningful meta-structures for heterogeneous network schemas like Yelp-Restaurant and Amazon-Electronics having four node types. Figure 5 reveals how a lot more interesting heterogeneous patterns can be captured by motif than by metapath and metagraph. Figure 8 showcases two example bifan motifs, known to occur frequently in complex networks, for Yelp-Restaurant dataset that a metapath or a metagraph cannot capture. In addition, metapath2vec and metagraph2vec are designed to operate only on a single metapath and a single metagraph respectively, unlike motif2vec.

Fig. 8: Example bifan motifs in Yelp-Restaurant dataset for the heterogeneous network schema in Figure 4(b)

Iv-D Efficiency

It is important to investigate the efficiency of motif2vec in today’s age of big graph data. The computationally expensive steps in the algorithm are: motif instance discovery, sequence generation, and embedding learning. The motif instance discovery literature is about two decades old and boasts of many efficient algorithms [12]. In this work, we consider the widely adopted formalization of the motif discovery task, namely subgraph isomorphism, and use the fast method presented in  [9] (NetworkX library) for motif discovery. For sequence generation and embedding learning using skip-gram, we employ the parallelization tricks suggested by prior-art [8][16].

Figure 9 illustrates the time taken by each of the individual steps: motif instance extraction, weighted graph creation, random walk simulation, and skip-gram neural net training for two datasets, one from each machine learning task, under consideration. As expected, the time to extract the motif instances dominate motif2vec algorithm’s end-to-end execution time, followed by the time taken to generate random walk sequences. Thus, our algorithm is limited by the scalability of motif instance extraction in the worst case.

Fig. 9: motif2vec execution time analysis

motif2vec efficiency for 4.9M nodes, 12.5M edge graph: We conduct experiments on AMiner-CS dataset heterogeneous network consisting of 4.9M nodes and 12.5 million edges to highlight the effectiveness of our method for big graphs, in spite of the computational expenses associated with motif instance extraction and random walk simulation. Given a heterogeneous network with well-defined semantics and relations (see Figure 4) and a motif of interest (see motif in Figure 5), we implement our own heuristic motif instance extraction method that is guided by the pattern in the motif to identify the matching sub-graphs in the overall network. While NetworkX’s module for motif instance extraction returns all possible matching sub-graphs which are far more in number than what is relevant to our task, our heuristic method prunes the candidate space and speeds up the extraction process. We skip the details of our heuristic due to lack of space.

For Aminer-CS dataset, we run experiments with 24 threads with each of them utilizing a CPU core, under the same parameter settings (Section IV). The random walk sequence generation step uses OpenMP to automatically decide the number of cores, as seen in the high-performance graph analytics SNAP repository[8]. For Aminer-CS dataset, motif2vec took 8 hours 55 minutes for end-to-end execution, of which:

26 minutes is taken by our heuristic motif instance extraction method,

43 minutes is taken by random walk simulation method,

7 hours 42 minutes is taken by word2vec/skip-gram neural net model training.

Our algorithm involves a number of parameters, as evident in Algorithm 1. We do not conduct parameter sensitivity analysis experiments. Instead, we learn from the authors’ efforts in [6][8] and make our choices.

V Related Work

Network representation learning: Network representation learning is a well-studied research problem owing to the ubiquitous nature of networks in the real-world and applications such as node classification, link prediction, visualization, clustering, etc. Various approaches have been studied in the literature to address this problem [7]. The early approaches aimed to learn node representations by factorizing the graph adjacency matrix as performed in recommender systems [1][17], and are computationally expensive. The random-walk based methods have emerged to be the most popular and includes DeepWalk [18], node2vec [8], LINE [29], HARP [5], etc. Most of the efforts are designed for homogeneous networks and are inadequate to handle heterogeneous networks. Heterogeneous network representation learning methods include metapath2vec [6], metagraph2vec [37], PTE [28], HNE [4], LANE [11], and ASPEM [23]. Except metapath2vec and metagraph2vec, none of these methods is aligned to our problem of generic unsupervised task-independent network embedding learning preserving the heterogeneity in structure and semantics, as discussed in Section I. Of late, deep learning based approaches have become popular for node representation learning. Further research focused on interpreting the embedding learned by these models can be useful.

Heterogeneous information network: Heterogeneous information networks are graphs that have various types of nodes and links fusing more information and containing richer semantics. Heterogeneous information networks are used to model most real-world networks today. In the literature, researchers have published various tasks related to heterogeneous networks such as similarity search [26], clustering [27], prediction [25], classification [13], etc. Each method is designed for a specific heterogeneous network mining application. In our work, we learn node representations that are effective for both classification and prediction. We also demonstrate how motif2vec outperforms existing heterogeneous network representation learning methods [6][37] that cannot capture higher-order heterogeneous connectivity patterns or preserve the structure and semantics of multiple types of nodes/links.

Network motifs: Network motifs are simple basic building blocks of complex networks. Motifs have originated from domains such as biochemistry and ecology where they are used for studying networks such as gene regulation, neuron synaptic connection, etc. It has been successfully used in many computer science network mining tasks such as clustering [32], anomaly detection [31], and convolution [21]. Rossi et al. addressed the problem of higher-order network representation learning using motifs for homogeneous networks [20]. But the method cannot be extended to handle heterogeneous networks. HONE [20] also does not combine the advantages of random-walk based method and motif-aware method, as we do.

Vi Conclusion

In this paper, we study the problem of node representation learning for heterogeneous information networks. We propose a novel efficient algorithm, motif2vec that leverage higher-order, recurring, and statistically significant network connectivity patterns in the form of network motifs to learn latent representations preserving heterogeneity in network structure and semantics. Unlike existing graph embedding methods for heterogeneous networks that employ some form of graph meta-structure to guide heterogeneous semantics aware random walks through the network, we employ motifs to transform the graph to a motif graph, which in turn, encode the heterogeneity. Our method preserves both local and global structural relationships in addition to rich semantic correlations in a network. We empirically demonstrate how the proposed algorithm consistently and significant outperforms state-of-the-art baselines on diverse real-world datasets. An important input to our algorithm is the choice of motif(s) from the set of all possible motifs. In the future, we intend to explore the possibility of automatically learning the motif weights for a network or for a task. It will also be interesting to study how our algorithm extends to handle the dynamics of evolving heterogeneous networks.

References

  • [1] Amr Ahmed, Nino Shervashidze, Shravan M. Narayanamurthy, Vanja Josifovski, and Alexander J. Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd WWW International Conference on World Wide Web, 2013.
  • [2] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
  • [3] Austin R. Benson, David F. Gleich, and Jure Leskovec. Higher-order organization of complex networks. Science Magazine, 353(6295):163–166, 2016.
  • [4] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, and Thomas S. Huang. Heterogeneous network embedding via deep architectures. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [5] Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. HARP: hierarchical representation learning for networks. In Proceedings of the 32nd AAAI International Conference on Artificial Intelligence.
  • [6] Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [7] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
  • [8] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [9] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference.
  • [10] Ruining He and Julian McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th WWW International Conference on World Wide Web.
  • [11] Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding. In Proceedings of the 10th ACM WSDM International Conference on Web Search and Data Mining.
  • [12] Yusuf Kavurucu. A comparative study on network motif discovery algorithms. International Journal of Data Mining and Bioinformatics, 11(2), 2015.
  • [13] Xiangnan Kong, Bokai Cao, Philip S. Yu, Ying Ding, and David J. Wild. Meta path-based collective classification in heterogeneous information networks. In Proceedings of the 25th ACM CIKM International Conference on Information and Knowledge Management.
  • [14] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127–163, 2000.
  • [15] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR Workshop, 2013.
  • [16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS International Conference on Neural Information Processing Systems.
  • [17] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016.
  • [18] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [19] Radim Řehůřek and Petr Sojka. Software framework for topic modelling with large corpora. In Proceedings of the LREC Workshop on New Challenges for NLP Frameworks.
  • [20] Ryan A. Rossi, Nesreen K. Ahmed, and Eunyee Koh. Higher-order network representation learning. In Companion of the WWW International Conference on World Wide Web.
  • [21] Aravind Sankar, Xinyang Zhang, and Kevin Chen-Chuan Chang. Motif-based convolutional neural network on graphs. CoRR, abs/1711.05697, 2017.
  • [22] Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and Philip S. Yu. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1):17–37, 2017.
  • [23] Yu Shi, Huan Gui, Qi Zhu, Lance M. Kaplan, and Jiawei Han. Aspem: Embedding learning by aspects in heterogeneous information networks. In Proceedings of the SIAM SDM International Conference on Data Mining.
  • [24] Yizhou Sun and Jiawei Han. Mining Heterogeneous Information Networks: Principles and Methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2012.
  • [25] Yizhou Sun, Jiawei Han, Charu C. Aggarwal, and Nitesh V. Chawla. When will it happen?: relationship prediction in heterogeneous information networks. In Proceedings of the 15th ACM WSDM International Conference on Web Search and Web Data Mining.
  • [26] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. PVLDB, 4(11):992–1003, 2011.
  • [27] Yizhou Sun, Brandon Norick, Jiawei Han, Xifeng Yan, Philip S. Yu, and Xiao Yu. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [28] Jian Tang, Meng Qu, and Qiaozhu Mei. PTE: predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [29] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: large-scale information network embedding. In Proceedings of the 24th WWW International Conference on World Wide Web.
  • [30] Lei Tang and Huan Liu. Leveraging social media networks for classification. Journal of Data Mining and Knowledge Discovery, 23(3):447–478, 2011.
  • [31] Charalampos E. Tsourakakis. Motif-driven graph analysis. In Proceedings of 54th Annual Allerton Conference on Communication, Control, and Computing, 2016.
  • [32] Charalampos E. Tsourakakis, Jakub Pachocki, and Michael Mitzenmacher. Scalable motif-aware graph clustering. In Proceedings of the 26th WWW International Conference on World Wide Web, pages 1451–1460, 2017.
  • [33] Véronique Van Vlasselaer, Cristián Bravo, Olivier Caelen, Tina Eliassi-Rad, Leman Akoglu, Monique Snoeck, and Bart Baesens. APATE: A novel approach for automated credit card transaction fraud detection using network-based extensions. Decision Support Systems, 75:38–48, 2015.
  • [34] Guan Wang, Sihong Xie, Bing Liu, and Philip S. Yu. Review graph based online store review spammer detection. In Proceedings of the 11th IEEE ICDM International Conference on Data Mining.
  • [35] Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  • [36] Hao Yin, Austin R. Benson, Jure Leskovec, and David F. Gleich. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  • [37] Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. Metagraph2vec: Complex semantic path augmented heterogeneous network embedding. In Advances in Knowledge Discovery and Data Mining - Proceedings of the 22nd PAKDD International Pacific-Asia Conference.
  • [38] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping Ma. Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In Proceedings of the 37th ACM SIGIR International Conference on Research and Development in Information Retrieval.
  • [39] Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. Meta-graph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
387151
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description