FI-GRL: Fast Inductive Graph Representation Learning via Projection-Cost Preservation

FI-GRL: Fast Inductive Graph Representation Learning via Projection-Cost Preservation

Fei Jiang1, Lei Zheng2, Jin Xu1, Philip S. Yu23 1Department of Computer Science, Peking University, Beijing, China
2Department of Computer Science, University of Illinois at Chicago, IL, US
3Institute for Data Science, Tsinghua University, Beijing, China.
allen.feijiang@gmail.com, {lzheng21, psyu}@uic.edu, jxu@pku.edu.cn
Abstract

Graph representation learning aims at transforming graph data into meaningful low-dimensional vectors to facilitate the employment of machine learning and data mining algorithms designed for general data. Most current graph representation learning approaches are transductive, which means that they require all the nodes in the graph are known when learning graph representations and these approaches cannot naturally generalize to unseen nodes. In this paper, we present a Fast Inductive Graph Representation Learning framework (FI-GRL) to learn nodes’ low-dimensional representations. Our approach can obtain accurate representations for seen nodes with provable theoretical guarantees and can easily generalize to unseen nodes. Specifically, in order to explicitly decouple nodes’ relations expressed by the graph, we transform nodes into a randomized subspace spanned by a random projection matrix. This stage is guaranteed to preserve the projection-cost of the normalized random walk matrix which is highly related to the normalized cut of the graph. Then feature extraction is achieved by conducting singular value decomposition on the obtained matrix sketch. By leveraging the property of projection-cost preservation on the matrix sketch, the obtained representation result is nearly optimal. To deal with unseen nodes, we utilize folding-in technique to learn their meaningful representations. Empirically, when the amount of seen nodes are larger than that of unseen nodes, FI-GRL always achieves excellent results. Our algorithm is fast, simple to implement and theoretically guaranteed. Extensive experiments on real datasets demonstrate the superiority of our algorithm on both efficacy and efficiency over both macroscopic level (clustering) and microscopic level (structural hole detection) applications.

Graph Representation Learning, Inductive Learning, Graph Mining

I Introduction

Graphs with nodes representing entities and edges representing relationships between entities are ubiquitous in various research fields. However, since a graph is naturally expressed in a node-interrelated way, it is exhausted to directly design different complicated graph algorithms for various kinds of mining and analytic purposes on graph data. Graph representation learning (also known as, graph embedding or network embedding) aims at learning from graph data to obtain low-dimensional vectors to represent nodes without losing much information contained in the original graph. Afterwards, one can apply bunches of off-the-shelf machine learning and data mining algorithms designed for general data on various important applications (e.g., clustering [1], structural hole detection [2], link prediction [3], visualization [4], etc).

Due to its ability in facilitating graph analysis, graph representation learning has drawn researchers’ attentions from machine learning and data mining fields [5, 6, 7]. Most of these works are focusing on static networks, such as explicitly preserving local or high-order proximity [8, 9, 10], learning representations using truncated walks [11, 12]; using matrix factorization technique to obtain latent vectors [13, 14, 6], incorporating heterogeneous information [7, 5], etc. Most of these methods are lack of theoretical support to produce reliable representation results. These methods are inherently transductive, which means that they are acting as black boxes that only care about learning representations but do not have an internal mechanism to naturally generalize to unseen nodes.

Fig. 1: A simple illustration of the intuition of our framework. Blue and red points can be assigned to two clusters respectively and the green node is an unseen node. Our approach assigns the unseen node to an appropriate location with preserving clustering membership and local proximity.

Our goal of graph representation learning is to design a fast and flexible framework that can preserve important graph topological information (e.g., clustering membership, node similarity, etc) with provable theoretical guarantees and can be naturally generalize to unseen nodes. In this paper, in order to achieve this goal, the Fast Inductive Graph Representation Learning (FI-GRL) framework is proposed. FI-GRL consists of two stages: decoupling and feature extraction. The intuition of this architecture is illustrated in Figure 1. The first stage is designed for decoupling nodes’ relations by utilizing an oblivious algorithm, Johnson-Lindenstrauss random projection. For a graph , this stage generates a matrix sketch , where is the number of nodes of graph and is the sketch size, a parameter that can be automatically determined by our approach and is much smaller than . The matrix sketch approximates well with theoretical guarantees. The second stage extracts meaningful feature contained in by low rank approximation. Dimension reduction is achieved collaboratively in both of these two stages. The resulting representations by this framework are theoretically guaranteed to perform well on constrained low rank approximation tasks (e.g., -means clustering). The main contributions of this paper are summarized as follows:

  • Architecture and Randomization: The proposed framework FI-GRL is fast and flexible enough to handle large graphs. Since the decoupling stage adopts an oblivious randomized algorithm, nodes can be processed sequentially with a single pass and without storing the entire graph. The matrix sketch is much smaller that can be dealt with much faster by the feature extraction stage. Moreover, the first stage is projection-cost preserving, which makes sure that the resulting representations extracted by the second stage are optimal up to an approximation ratio . As far as we know, this is the first time that randomized algorithms are introduced to deal with the graph representation learning problem.

  • Theoretical Analysis: We analyze our algorithm theoretically. We prove the optimality of our algorithm in terms of the absolute difference of projection-cost between the learned representations and the desired representations, and also in terms of the absolute distance difference between their corresponding -mean centroids in Theorem 1. For the choice of the parameter (the sketch size ), we give a theoretical guidance in Section III-E and empirical analysis in Section IV-E.

  • Inductive Learning: Our two-stage framework can naturally generalize to learn representations of unseen nodes. We adopt an incremental singular value decomposition with folding-in technique on the matrix sketch to learn representations of unseen nodes. The empirical results reporting in Section IV-D demonstrate the effectiveness of our method on unseen nodes.

  • Empirical Study: FI-GRL can produce graph representations of different accuracy for different levels of applications. In Section IV, extensive experiments conducted on both macroscopic level (clustering) and microscopic level (structural hole detection) applications show the superiority of our framework in efficacy and efficiency.

Ii Preliminary and Problem Formulation

A graph is a tuple with three elements, denoting node set, denoting edge set and denoting weighted adjacency matrix. Without ambiguity, we use the term graph and network interchangeably. In this paper, to facilitate the distinction, we use lowercase letters (e.g., ) to denote scalars, bold lowercase letters (e.g., ) to denote vectors, bold uppercase letters (e.g., ) to denote matrices and calligraphic letters to denote graphs. The symbol table is shown in Table I.

Symbol Definition
Graph
Node, edge set and its corresponding volume
Neighbor set of node
Weighted adjacency, diagonal degree matrices
Normalized random walk matrix
Frobenius norm and spectral norm
TABLE I: List of basic symbols

We first define the problem of graph representation learning as follows:

Definition 1.

(Graph Representation Learning) Given a graph , for a fixed dimension number , graph representation learning aims at learning a map for every .

Graph representation learning will generate a low-dimensional vector for every node in the graph. The obtained vectors should preserve important information (e.g., clustering membership, node similarity, etc) hidden in the graph. The scenario is even more challenging when considering unseen nodes. We generally define the graph representation learning task for unseen nodes.

Definition 2.

(Graph Representation Learning for Unseen Nodes) A graph is extended from graph by adding nodes and associated edges after graph representations of have been obtained. For each node , we learn a map without recomputing representations obtained thus far.

Iii Methodology

In this section, we first give some observations to show 1) why we choose the normalized random walk matrix as the initial input matrix of the graph. 2) how important tasks (e.g., -means clustering) can be reduced to constrained low rank approximation problem on which our framework is theoretically guaranteed to work well. Then, we present our two-stage framework in detail and generalize it to unseen nodes. Finally, we provide a guidance on choosing parameters and show the complexity analysis. Our FI-GRL framework is illustrated in Figure 2.

Fig. 2: An illustration of FI-GRL framework. This framework consists of two stages. The first stage is used for decoupling the nodes’ relations by using a random projection algorithm. The second stage is used for feature extraction on matrix sketch . This two-stage framework provides a natural way for obtaining near-optimal representations for seen nodes and for generating approximate representations for unseen nodes.

Iii-a Observations

Iii-A1 Normalized Cut

In literature [15], image segmentation is treated as a graph partition problem based on three metrics average association, normalized cut and average cut. It is demonstrated that normalized cut is a desired choice which seeks balance between finding clumps and finding split nodes. Normally, for graph and two disjoint node sets , , the normalized cut is defined as:

(1)

The problem of finding a graph partition with optimal normalized cut can be reduced to finding the generalized eigenvector problem of the following equation:

(2)

Then, applying k-means clustering on several smallest non-trivial eigenvectors will achieve the optimal graph partitions in terms of normalized cuts. Thus, the matrix related to Equation 2 is a good choice for representing a graph.

Iii-A2 Constrained Low Rank Approximation

To design a graph representation learning algorithm with preserving graph information such as clustering membership, we consider the constrained low rank approximation problem to which many important tasks, including -means clustering, can be reduced. The constrained low rank approximation can be defined as

Definition 3.

(Constrained Low Rank Approximation) For a matrix and any set of rank orthogonal projection matrices in , constrained rank approximation tries to find

where is called as the projection-cost of .

-means clustering and approximate singular value decomposition (SVD) problems are constrained low rank approximation problems. More precisely, -means clustering aims at dividing vectors of , where , into clusters, . We denote the centroid of cluster as . The goal of -means clustering is to minimize the following objective function:

We transform this objective function into the matrix form. We denote the cluster indicator matrix as matrix , where if is divided into cluster . So in the matrix form, -means clustering is to minimize the following equation:

Clearly, is a projection matrix that projects the points’ vectors into their cluster centroids. Therefore, by definition 3, -means clustering is a constrained low rank approximation problem with the set being the all possible projection matrix , where is the cluster indicator matrix. For SVD, it is also a constrained low rank approximation problem that tries to find the optimal rank approximation of in the unconstrained set as all possible rank projection matrices. The optimal solution is its top left singular vectors .

Iii-B Decoupling with Projection-Cost Preservation

Suppose that if we can find a small matrix whose learned graph representations can achieve nearly the same results with that on the original matrix, we will get lots of speed and space benefits. In this subsection, we show how we can obtain the small matrix (we call it matrix sketch) and also demonstrate its optimality. Formally, for a matrix , we want to find a matrix sketch (where ) to approximate well in a way that for constrained low rank approximation problem as in Definition 3, we can optimize projection matrix over the sketch instead of optimizing over . Firstly, we define the projection-cost preservation sketch as in [16].

Definition 4.

(Rank Projection-Cost Preserving Sketch) is a rank projection-cost preserving sketch of with error if, for all rank orthogonal projection matrices ,

for some fixed non-negative constant that may depend on and but is independent of .

This definition implies that the projection cost of any projection on will be a good estimation of the projection cost of the same projection over . The following lemma indicates that if is a rank projection-cost preservation sketch of , one can optimize to get the optimal projection matrix to solve the constrained low rank approximation problem.

Lemma 1.

Suppose is a projection-cost preservation sketch of with approximation ratio over the set of all rank projection matrices. Let and . Then,

Proof.

Since is the optimal solution for , then

(3)

is a projection-cost preservation sketch of , so for the projection matrix , the following equation holds

(4)

Again, consider projection matrix , we have

(5)

Combining equation 3 and 5, we get

(6)

Finally, combining equation 4 and 6, it yields

The above lemma provides us a theoretical guarantee to get meaningful information from a matrix sketch, which is computationally efficient. To capture the graph information related to normalized cut, we consider Equation 2. Solving this generalized eigenvector problem is not convenient, since it’s not easy to compute incrementally and more importantly it’s not a constrained low rank approximation problem. Therefore, we transform it to a SVD problem by setting in equation 2, then we have

Removing non-relevant terms, we actually want to find the top eigenvectors of which we denote as . is actually called normalized random walk matrix. If we have a matrix sketch of , one can use top singular vectors of to approximate top eigenvectors of since is a symmetric matrix.

Next, we will decouple the nodes’ relations in the graph in terms of with preserving projection-cost. One can regard each node as a row vector in , which is generally sparse. To remove the connection between nodes, we randomly project their vectors onto orthogonal directions times so that nodes’ vectors are mapped into that orthogonal subspace. More precisely, for a node vector , we choose a map , where and is an orthogonal projection that maps vectors into a uniformly random subspace of dimension . This strategy should work fine if we have this random matrix . However, ensuring the orthogonality of the projection matrix takes unnecessary time. We can achieve the same goal without explicitly orthonormalize the projection matrix. We choose to be a Johnson-Lindenstrauss matrix, that is, the entries of are independently and uniformly drawn from Gaussian distribution . In this way, although the eigenvalues of are not confined in , the range of is indeed a uniformly random subspace. In a matrix form, it means that

(7)

Now, we have got a matrix sketch by Johnson-Lindenstrass random projection. The following lemma indicates that the matrix sketch generated by this procedure is indeed a projection-cost preserving sketch.

Lemma 2.

For matrix , let be a Johnson-Lindenstrauss matrix with each entry chosen independently and uniformly from Gaussian distribution . For , with probability at least , is a rank projection-cost preserving sketch of with approximation ratio , when .

By lemma 2, one can achieve an accurate sketch with a satisfied approximation ratio by increasing the sketch size . We will talk about how to choose this parameter later.

Iii-C Feature Extraction by Low Rank Approximation

After obtaining the projection-cost preserving sketch of , we want to extract meaningful information from the sketch and also further reduce its dimension. Singular value decomposition is a good choice, since it is a constrained low rank approximation problem which is suitable and advantageous for further factorizing projection-cost preserving sketch, and it is easy to adapt to unseen nodes. A partial singular value decomposition over matrix will give

(8)

Each row of is the learned graph representation for the corresponding node in the graph. Further, to demonstrate the effectiveness of this framework to learn accurate low-dimensional vectors and to facilitate important tasks (e.g., -means clustering), we present the following theorem.

Theorem 1.

Let . And has a singular value decomposition as with distinct singular values. Let and . If , then has the singular value decomposition , such that where and . Furthermore, -means clustering over will give a good approximate to that over in terms of -means centroids.

Proof.

For the first part of the proof, since and are symmetric matrices, we can view as a symmetric perturbation matrix. Deriving the upper bound of is similar to the derivation of the absolute perturbation bound for eigenvector decomposition in perturbation theory [17].

The key of the second part is to regard -means clustering as a constrained low rank approximation problem [18]. As defined in Section III-A2, is a projection matrix projecting the points vector into its cluster centroid and is the cluster indicator matrix. The discrepancy between the corresponding cluster position assigned to each node of two matrix and is . Applying the spectral submultiplicativity property, it yields

Since is a symmetric projection matrix, the spectral norm of is not greater than 1. Then, it becomes

It means that approximates nodes’ representations in well and the assigned clusters for nodes in and are well-matched. ∎

Generally, this theorem states that the obtained representations of our algorithm will give a strong guarantee if we perform -means clustering on them. In the experiment, we demonstrate that FI-GRL also achieve excellent results on clustering task using agglomerative method (AM) and on structural hole detection task.

Iii-D Inductive Learning and Entire Framework of FI-GRL

So far, we’ve presented the graph representation learning framework for static graphs. When considering inductive learning on an unseen node , we first get of the corresponding column vector of the normalized random walk matrix of the extended graph after added. The valid dimension of is at most since self-join is prohibited. Then, applying the random projection matrix on , we get a compressed vector , where is sketch size as above. Since the partial SVD on matrix is , we can regard row vectors of as vectors in the span of . So we can project onto the span of . Specifically,

(9)

Then, degree-normalized , namely , is the obtained representation of node . This method is fast and powerful when the graph is stable and gradually changes. We testify the effectiveness of our method at different proportions of unseen nodes in the experiment. Overall, our FI-GRL framework is summarized in Algorithm 1.

0:   Graph with totally nodes; Unseen node set ;Dimension , approximation ratio
0:  Low-dimensional vectors
1:  Construct matrix for
2:  Construct a matrix , whose entries are independently drawn from , where is
3:  Compute each row of the matrix sketch , , where denotes th row of
4:  Compute -singular value decomposition
5:  Compute
6:  for all unseen nodes  do
7:     Compute
8:     Compute
9:     Append as a new row of
10:  end for
11:  return  
Algorithm 1 FI-GRL: Fast Inductive Graph Representation Learning

Iii-E Parameter Analysis and Complexity Analysis

The parameter of the sketch size can actually be determined by the approximation ratio . The approximation ratio can be designated for the requirement of different tasks. will be sufficient for most tasks focusing on mining the macroscopic structure of the graph (e.g., clustering). One can always decrease to achieve better accuracy if computational power is allowed. Johnson-Lindenstrass lemma will provide us another perspective to determine the sketch size .

Lemma 3.

Let be arbitrary. Pick any and matrix as a Johnson-Lindenstrauss random projection matrix whose entries are independently and uniformly drawn from Gaussian distribution . Then, for , define for , then for any the following equations hold with probability

(10)

Taking each row of our matrix as , Lemma 3 proves that the norm of vectors and the distance between nodes are preserved in the low-dimensional subspace when . In fact, [19], where . When approximation ratio is known, we choose the as the sketch size .

To analyze the computational complexity of our algorithm, we first note that our algorithm is especially efficient since matrix is very sparse when the graph is large, and the matrix-vector product can be evaluated rapidly. The computational cost of our algorithm in static settings are in total, where is the total degree of the graph. is the cost of generating the Johnson-Lindenstrass random projection matrix. is the cost of computing the projection-cost preserving sketch. is for computing the partial singular value decomposition. For unseen nodes, to learn a node representation of an unseen node , we need time where is the degree of . is the time of projecting the node into -dimensional space and is the cost of folding-in to the span of right singular vectors.

Iii-F Discussion

Since matrix is symmetric, we are able to use double-sided random projection with projection-cost preserved and use eigenvector decomposition as a generalized constrained low rank approximation. To see this, we compute the projection-cost of a projection on double sides of the symmetric matrix , then it yields

(11)

The lemmas and theorems in this paper can be devised correspondingly. However, this approach needs an additional computation of the product of the top k eigenvectors of the matrix sketch and the random projection matrix. Therefore, double-sided projection-cost preservation is not necessary and regular projection-cost preservation is sufficient for graph representation learning purpose.

Iv Experimental Results

To quantitatively testify our FI-GRL framework, we perform various experiments using the learned graph representations. The implementation of our algorithm is publicly available111https://github.com/Jafree/FastInductiveGraphEmbedding.

Iv-a Datasets Description and Comparison Methods

All datasets used in this paper are undirected graphs, which are available in SNAP Datasets platform [20]. These networks vary widely from network type, network scale, edge density, connecting patterns and cluster profiles, which contain three social networks: karate (real), youtube (online), enron-email (communication); three collaboration networks: ca-hepth, dblp, ca-condmat; three entity networks: dolphins (animals), us-football (organizations), polblogs (hyperlinks). To show the characteristics of these datasets, we use a community detection algorithm, RankCom [21], designed for graphs to reveal the cluster profiles (cluster numbers and max size of clusters). The detailed information is summarized in Table II.

Characteristics #Cluster #Max members
Datasets # Node # Edge RankCom RankCom
karate 34 78 2 18
dolphins 62 159 3 29
us-football 115 613 11 17
polblogs 1,224 19,090 7 675
ca-hepth 9,877 25,998 995 446
ca-condmat 23,133 93,497 2,456 797
email-enron 36,692 183,831 3,888 3,914
youtube 334,863 925,872 15,863 37,255
dblp 317,080 1,049,866 25,633 1,099
TABLE II: Summary of datasets and their cluster profiles.

Our framework is testified against state-of-the-art algorithms. The first five are graph representation learning methods. The others are structural hole detection methods . We summarize them as follows:

  • FI-GRL: Our inductive representation learning approach.

  • GraphSAGE [22]: Sampling and aggregating strategy is applied to integrate neighbors’ information.

  • Deepwalk [11]: Truncated random walk and language modeling techniques are adopted to learn representations.

  • node2vec [23]: Skip-gram framework is extended to networks.

  • LINE [8]: The version of combining first-order and second-order proximity is used here.

  • HAM [2]: A harmonic modularity function is presented to tackle the structural hole detection problem.

  • Constraint [24]: Constructing a constraint to prune nodes without certain connectivity.

  • Pagerank [25]: The assumption that structural holes are nodes with high pagerank score is adopted .

  • Betweenness Centrality (BC) [26]: Nodes with highest BC will be selected as structural holes.

  • HIS [27]:Optimizing the provided objective function by a two-stage information flow model .

  • AP_BICC [28]: This method is designed by exploiting the approximate inverse closeness centralities.

Iv-B Clustering

Modularity Permanence
Datasets Clustering Methods FI-GRL node2vec GraphSAGE LINE Deepwalk FI-GRL node2vec GraphSAGE LINE Deepwalk
karate k-means 0.410(1) 0.335(5) 0.381(4) 0.403(2) 0.396(3) 0.474(1) 0.335(3) 0.322(4) 0.182(5) 0.350(2)
AM 0.410(2) 0.335(4) 0.401(3) 0.239(5) 0.430(1) 0.474(1) 0.205(5) 0.339(2) 0.232(4) 0.311(3)
dolphins k-means 0.489(1) 0.460(2) 0.370(4) 0.187(5) 0.401(3) 0.235(1) 0.196(2) 0.158(4) -0.166(5) 0.187(3)
AM 0.462(1) 0.458(2) 0.355(4) 0.271(5) 0.393(3) 0.215(1) 0.132(3) 0.121(4) -0.189(5) 0.189(2)
us-football k-means 0.607(1) 0.605(2) 0.485(4) 0.562(3) 0.464(5) 0.323(1) 0.304(3) 0.124(4) 0.311(2) 0.039(5)
AM 0.611(1) 0.589(2) 0.470(4) 0.492(3) 0.464(5) 0.315(1) 0.279(3) 0.116(4) 0.307(2) 0.039(5)
ca-hepth k-means 0.611(1) 0.597(2) 0.399(4) 0.01(5) 0.424(3) 0.393(1) 0.379(2) 0.287(3) -0.948(5) 0.261(4)
AM 0.623(1) 0.606(2) 0.423(4) 0.05(5) 0.453(3) 0.427(1) 0.406(2) 0.327(4) -0.949(5) 0.338(3)
condmat k-means 0.527(1) 0.515(2) 0.409(3) 0(5) 0.357(4) 0.371(1) 0.330(2) 0.206(3) -0.984(5) 0.197(4)
AM 0.544(1) 0.520(2) 0.427(3) 0(5) 0.370(4) 0.392(1) 0.388(2) 0.213(4) -0.994(5) 0.249(3)
enron-email k-means 0.322(1) 0.213(3) 0.231(2) 0(5) 0.178(4) 0.175(1) 0.080(2) 0.067(3) -0.985(5) 0.049(4)
AM 0.327(1) 0.218(2) 0.211(3) 0(5) 0.207(4) 0.187(1) 0.180(2) 0.058(4) -0.996(5) 0.108(3)
polblogs k-means 0.427(1) 0.357(2) 0.278(3) 0.200(4) 0.084(5) 0.130(1) -0.066(2) -0.106(3) -0.569(5) -0.187(4)
AM 0.425(1) 0.376(2) 0.291(3) 0.266(4) 0.065(5) 0.131(1) -0.096(2) -0.123(3) -0.509(5) -0.176(4)
TABLE III: Performance on Clustering evaluated by Modularity and Permanence

We first test our algorithm on the clustering task. We set the approximation ratio , and set the dimension of representations at most for all graphs. Firstly, we perform our FI-GRL algorithm on graphs to learn low-dimensional representations. Then, two different type of clustering algorithms, i.e., -means clustering and agglomerative clustering method (AM) are applied. We use two different metrics to evaluate the clustering results, i.e., modularity [29] and permanence [30].

  • Modularity [29]: This is the most widely used metric for evaluating clustering results on graphs. Modularity measures the benefits of nodes joining a cluster under the Null model. Specifically, modularity is defined as: where is the indicator function. indicates the cluster node belongs to. Generally, a modularity score greater than means a good clustering result. To punish clearly wrong cluster membership assignment, we add a penalty which is proportional to the inverse of the node’s degree.

  • Permanence [30]: Permanence is a node-based metric, which explicitly evaluate the cluster membership affiliation of each node. It is more strict, since it considers the cluster configuration nodes connecting to. For a node in cluster , the permanence is defined as follows: where is the internal degree, is the maximum degree that node links to another cluster, and is the internal clustering coefficient. The total permanence score of the graph is the sum of the permanence score of every node. Empirically, positive permanence score indicates a good clustering result.

The results are listed in Table III. Our algorithm FI-GRL outperforms other graph representation learning algorithms, i.e., GraphSAGE, node2vec, LINE, Deepwalk over almost all datasets using -means and AM in terms of both modularity and permanence (except on karate network, deepwalk outperforms FI-GRL in terms of modularity under AM). Specifically,node2vec achieves the second best results. It performs badly on small networks, i.e., karate in terms of modularity and karate, dolphins, us-football in terms of permanence. LINE fails on capturing the macroscopic structure of the graph since it only preserves local information. Some zero results of LINE on modularity means that there are a bunch of nodes that LINE assigns to a clearly wrong cluster. GraphSAGE and Deepwalk give mediocre results. In terms of permanence, all other methods cannot preserve the cluster information on polblog, which is a tough case. The overall performance is reported in Figure 3. More precisely, combining the results using -means and AM, in terms of modularity our algorithm FI-GRL improve node2vec by , GraphSAGE by , LINE by and Deepwalk by . FI-GRL gets an improvement of over node2vec, over GraphSAGE and over Deepwalk in term of permanence.

Fig. 3: The overall performance on Clustering in terms of Modularity and Permanence.

Iv-C Structural Hole Detection

We consider another task, which is focusing on the microscopic level, called structural hole detection. Structural holes are the important nodes that locate at key topological positions. Once they are removed, the network will fall apart. Finding structural holes in graphs is a critical task for graph theory and information diffusion. To achieve this task, we first transform graph into a low-dimensional subspace using our algorithm, and then find structural holes in that space. We devise a metric for ranking nodes in the low-dimensional subspace:

  • Relative Deviation Score (RDS): Let be the low-dimensional representation for node . -means will give a clustering result with cluster set . RDS estimates the deviation of a node from its own cluster attracted by other clusters in terms of relative radius. More precisely, , where is the cluster that belongs to. represents the radius of cluster . And is the center of cluster .

Nodes with highest RDS scores are regarded as the structural holes since they strongly connect at least two clusters. We use an evaluation metric called Structural Hole Influence Index (SHII) [2] to evaluate the selected structural holes. SHII is computed via a process of information diffusion. For each selected structural hole, we run the information diffusion under linear threshold model (LT) and independent cascade model (IC) 10000 times to get average SHII score. The SHII score is defined as follows:

  • Structural Hole Influence Index [2]: Note that generally a node cannot activate the influence maximization process by itself. For a selected structural hole , we want to do the following procedure several times: combining with some randomly chosen node set in cluster as a seed set to engage a influence maximization process in the network. SHII evaluate the ratio of activated nodes that are in other clusters , where is the set of communities and is the indicator function indicating whether node is influenced. And in our experiment, we set the size of the sampled activation set as .

The results are shown in Table IV. According to characteristics of different networks, we tune all algorithms to select a certain number of structural holes. Too many of them will result in the activation of the entire network. For karate network, three structural holes are selected. The topological structure of karate shown in [31] demonstrates that the structural holes our algorithm selected are in critical positions that are bridging two clusters. More precisely, our results are superior to other structural hole selection methods, including the state-of-the-art algorithm, HAM [2]. It demonstrates the efficacy of our algorithm in preserving microscopic structure.

Comparative Methods
Datasets #SH Influence Model FI-GRL HAM Constraint PageRank BC HIS AP_BICC
karate 3 LT 0.595 0.343 0.295 0.159 0.159 0.132 0.295
IC 0.003 0.002 0.002 0.001 0.001 0.001 0.002
Structural Holes [3 14 20] [3 20 9] [1 34 3] [34 1 33] [1 34 33] [32 9 14] [1 3 34]
youtube 78 LT 4.129 3.951 2.447 1.236 1.226 3.198 1.630
IC 3.024 2.452 1.254 0.662 0.791 2.148 0.799
dblp 42 LT 6.873 5.384 0.404 0.357 0.958 0.718 0.550
IC 5.251 3.578 0.229 0.190 0.821 0.304 0.495
TABLE IV: Performance on Structural Hole Detection under LT and IC Models

Iv-D Performance on Unseen Nodes

To evaluate the performance of our algorithm under the inductive learning scenario, we artificially simulate the process of generating unseen nodes. Specifically, for a static graph, we randomly extract proportion of nodes as an original graph for graph representation learning. The other nodes are treated as unseen nodes. In this experiment, we set the approximation ratio as , which is accurate enough for most applications. FI-GRL uses a folding-in technique to learn the meaningful representations for unseen nodes. If the learned representations are accurate, -means clustering results over the representations of the entire graph will be satisfactory.

The clustering performance in terms of modularity with the variation of the proportion of unseen nodes (i.e., ) is illustrated in Figure 4. As we can see, when the proportion of unseen nodes is not greater than , the clustering performance is stable at a good quality. After increasing the proportion to , too many unseen nodes added are dramatically changing the main skeleton of the network. Since we add a penalty to clearly wrong cluster assignment, the clustering performance degenerates sharply. For a small network like dolphins, the results fluctuate to a certain extent, e.g., at and . In fact, polblog and football give the most stable performance, as they retain almost the same results from to . We conjecture that the representations of the unseen nodes are more accurate if the nodes are well-connected (polblog has a relatively high edge density) or the network are well-structured and invulnerable (football has 11 clusters with nearly equal size). Overall, our FI-GRL is flexible enough to give a satisfactory representation learning result for inductive learning even when the proportion of the unseen nodes is large (up to ).

Fig. 4: Performance on unseen nodes evaluated on clustering

Iv-E Parameter Analysis: Approximation Ratio and Sketch Size

To quantitatively measure the ability of our framework to capture crucial information of graphs and preserve the projection-cost, we evaluate the performance of our algorithm by varying the approximation ratio and the sketch size. To get a visual sense, we plot three-dimensional and two-dimensional representations of the karate network [31], which has two ground-truth clusters and several structural holes, in Figure 5, when we set the approximation ratio and , separately. As we can see, at , nodes between two clusters are mixed up with each other. So at a low resolution, the cluster information is not well-maintained in the learned subspace. While at , nodes are located in clusters exactly the same as the ground truth. Moreover, structural holes bridging between two clusters can be easily identified from the two-dimensional view where nodes in different clusters form nearly orthogonal subspaces, and they are linearly separable.

Fig. 5: Visualization at different approximation ratios

To demonstrate the ability of our algorithm in preserving projection-cost, we compute the relative projection-cost with the variation of the sketch size. More precisely, we calculate

where is the residual of optimal rank approximation on . The dimension is set to , which is sufficient for applications we concern. We perform our algorithm times at each sketch size and the result is shown in Figure 6. Relative projection-cost has decreased rapidly at very small sketch size. At sketch size of , FI-GRL already can achieve excellent results. Towards sketch size of , the result is nearly optimal for graph representation learning purpose. For large networks, the approximation is even more accurate. Since the network is usually very sparse, nodes are laying in a small subspace compared to the size of the network. Although our algorithm is a randomized algorithm, the variance of at each sketch size is rather small. It implies that we can treat FI-GRL as a deterministic algorithm since the chance of the failure of our algorithm in preserving projection-cost is pretty rare especially when the sketch size is large.

Fig. 6: Relative Projection-Cost with the variation of Sketch Size

Iv-F Running Time

Finally, the computational time of our algorithm in the static scenario against other competing graph representation algorithms is listed in Table V. All other algorithms learn the representations of dimension. FI-GRL is outstanding in terms of computational cost. At sketch size of where FI-GRL can give nearly optimal results for tested datasets, it takes only minutes to learn the graph representations on dblp. While GraphSAGE, node2vec and deepwalk take more than 10 hours in order to achieve the same task.

Methods (Sketch Size) karate dolphins us-football polblog ca-hepth ca-condmat email-enron youtube dblp
FI-GRL(100) 0.005s 0.007s 0.042s 0.032s 0.181s 0.424s 0.672s 7.14s 8.33s
FI-GRL(200) 0.006s 0.009s 0.051s 0.064s 0.388s 0.913s 1.508s 15.85s 18.09s
FI-GRL(500) 0.019s 0.015s 0.037s 0.081s 0.976s 2.804s 6.776s 45.73s 57.30s
FI-GRL(1000) 0.040s 0.035s 0.047s 0.159s 2.576s 7.262s 11.48s 1m41s 2m14s
node2vec 0.807s 3.110s 1.442s 33.34s 74.83s 2m57s 48m17s 10h 10h
Deepwalk 4.123s 10.876s 10.92s 2m10s 15m59s 43m9s 1h18m 10h 10h
GraphSAGE 18.348s 43.791s 37.252s 6m3s 49m20s 3h51m 4h39m 10h 10h
TABLE V: Running time

V Related Work

V-a Graph Representation Learning

Graph representation learning has been an important problem to facilitate the implementation of classic machine learning and data mining algorithms on graphs. Some methods try to explicitly preserve proximity between nodes, such as [8] introduces an edge-sampling method, [9] develops a semi-supervised deep model, [32] enhances communities and structural holes by non-backtracking random walk. Some methods exploit matrix factorization technique, e.g., [33] factorizes asymmetric transitivity related matrices on directed graphs, [13] proposes an update algorithm on matrix forms. Some algorithms are formulating the problem into a traditional machine learning approach, such as, Deepwalk [11] learns latent representations by treating truncated walks as sentences, [12] optimizes a max-margin classifier. Several methods focus on heterogeneous scenario, such as [6] models the multi-view graph data as tensors, [10] learns the representations of clusters, [7] learns context-aware representations, [34] creates a multi-resolution deep architecture, [14] formulates a Deepwalk-based matrix factorization with incorporating text features, [5] introduces metapath-based random walks for representation learning.

One line of work that are similar to our approach is graph representation learning on dynamic networks. [35] investigates the role of closed triads at different time steps. [36] integrates node attributes by utilizing spectral decomposition and matrix perturbation theory in a dynamic setting. [37] aims at preserving high-order proximity by using nonparametric probabilistic modeling and deep learning, which can be generalize to unseen nodes. [22] presents several types of aggregators for aggregating features from nodes’ local neighborhoods. In contrast to these works, by introducing randomization and approximation strategies, our approach focuses on building a graph representation learning framework that is fast, theoretically guaranteed and can generalize to unseen nodes.

V-B Randomized Dimension Reduction

Randomized algorithms are often adopted in dimension reduction due to its speed and the solid theory supporting it. [38] surveys randomized algorithms for low rank approximation and presents several algorithms for address different situations. The algorithm proposed is more robust than Krylov subspace methods for sparse input matrix. [39] adapt a well known streaming algorithm for approximating item frequencies to find the matrix sketch. Combined with SVD and a special update strategy, the proposed algorithm becomes deterministic and computationally competitive. [16] devises a theoretical framework by deriving a series of bounds in terms of required dimensions for applying random row projection, column selection, and approximate SVD, which can used to better solve -means clustering and low rank approximation problem. [40] uses random projection in a cluster ensemble approach to achieve better and more robust clustering performance. [41] uses random projection technique to deal with text and image data. It empirically demonstrates that random projection yields comparable results compared to conventional deterministic methods (e.g., PCA), but it is computationally significantly less expensive than PCA. [42] presents the first provably accurate feature selection method for -means clustering. Two feature extraction methods using random projection and fast approximate SVD are proposed, which improves upon the existing results in terms of time complexity. Our approach uses state-of-the-art random strategies in graph representation learning and several bounds and theorems are proved to guarantee the performance of the learned representations.

Vi Conclusion

In this paper, we propose a fast inductive graph representation learning framework, namely FI-GRL, to transform the topological structure of graphs into a low-dimensional space. It explicitly decouples relational information in graphs into a randomized subspace spanned by a random projection matrix. The sketch obtained are much smaller and yet inherits the property associated with the normalized cut by preserving projection-cost. By exploiting the constrained low rank approximation, the dimension of the sketch is further reduced and the compact hidden pattern is finally extracted. The connection between randomized algorithm and graph representation learning is built by thoroughly theoretical analysis. FI-GRL is flexible enough to deal with massive scale graphs and graph with unseen nodes. Overall, our algorithm is fast, easy to implement and theoretically guaranteed. The empirical study demonstrates the superiority of our algorithm on both efficacy and efficiency.

Acknowledgement

This work is supported in part by National Key R& D Program of China through grants 2016YFB0800700, NSF through grants IIS-1526499, IIS-1763325, CNS-1626432, and NSFC 61672313, 61672051, 61872101.

References

  • [1] H. Qiu and E. R. Hancock, “Clustering and embedding using commute times,” TPAMI, 2007.
  • [2] L. He, C.-T. Lu, J. Ma, J. Cao, L. Shen, and P. S. Yu, “Joint community and structural hole spanner detection via harmonic modularity,” in KDD, 2016.
  • [3] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion.” in AAAI, vol. 15, 2015, pp. 2181–2187.
  • [4] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [5] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in KDD.   ACM, 2017.
  • [6] G. Ma, L. He, C.-T. Lu, W. Shao, P. S. Yu, A. D. Leow, and A. B. Ragin, “Multi-view clustering with graph embedding for connectome analysis,” in CIKM.   ACM, 2017.
  • [7] C. Tu, H. Liu, Z. Liu, and M. Sun, “Cane: Context-aware network embedding for relation modeling,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2017, pp. 1722–1731.
  • [8] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web.   ACM, 2015, pp. 1067–1077.
  • [9] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in KDD.   ACM, 2016.
  • [10] S. Cavallari, V. W. Zheng, H. Cai, K. C.-C. Chang, and E. Cambria, “Learning community embedding with community detection and node embedding on graphs,” in CIKM.   ACM, 2017.
  • [11] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in KDD.   ACM, 2014.
  • [12] C. Tu, W. Zhang, Z. Liu, and M. Sun, “Max-margin deepwalk: Discriminative learning of network representation.” in IJCAI, 2016, pp. 3889–3895.
  • [13] C. Yang, M. Sun, Z. Liu, and C. Tu, “Fast network embedding enhancement via high order proximity approximation,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI, 2017, pp. 19–25.
  • [14] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network representation learning with rich text information.” in IJCAI, 2015, pp. 2111–2117.
  • [15] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
  • [16] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, “Dimensionality reduction for k-means clustering and low rank approximation,” in Proceedings of the forty-seventh annual ACM symposium on Theory of computing.   ACM, 2015, pp. 163–172.
  • [17] X. S. Chen, W. Li, W. W. Xu et al., “Perturbation analysis of the eigenvector matrix and singular vector matrices,” Taiwanese Journal of Mathematics, vol. 16, no. 1, pp. 179–194, 2012.
  • [18] C. Boutsidis, P. Drineas, and M. W. Mahoney, “Unsupervised feature selection for the -means clustering problem,” in Advances in Neural Information Processing Systems, 2009, pp. 153–161.
  • [19] S. Dasgupta, “Experiments with random projection,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence.   Morgan Kaufmann Publishers Inc., 2000, pp. 143–151.
  • [20] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, 2014.
  • [21] F. Jiang, Y. Yang, S. Jin, and J. Xu, “Fast search to detect communities by truncated inverse page rank in social networks,” in Mobile Services (MS), 2015 IEEE International Conference on.   IEEE, 2015, pp. 239–246.
  • [22] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1025–1035.
  • [23] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on ledge Discovery and Data Mining.   ACM, 2016, pp. 855–864.
  • [24] R. S. Burt, Structural holes: The social structure of competition.   Harvard university press, 2009.
  • [25] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
  • [26] U. Brandes, “A faster algorithm for betweenness centrality,” Journal of mathematical sociology, 2001.
  • [27] T. Lou and J. Tang, “Mining structural hole spanners through information diffusion in social networks,” in WWW, 2013.
  • [28] M. Rezvani, W. Liang, W. Xu, and C. Liu, “Identifying top-k structural hole spanners in large-scale social networks,” in CIKM, 2015.
  • [29] M. E. Newman, “Modularity and community structure in networks,” PNAS, 2006.
  • [30] T. Chakraborty, S. Srinivasan, N. Ganguly, and S. Bhowmick, “On the permanence of vertices in network communities,” in KDD, 2014.
  • [31] W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of anthropological research, vol. 33, no. 4, pp. 452–473, 1977.
  • [32] F. Jiang, L. He, Y. Zheng, E. Zhu, J. Xu, and P. S. Yu, “On spectral graph embedding: A non-backtracking perspective and graph approximation,” in Proceedings of the 2018 SIAM International Conference on Data Mining.   SIAM, 2018, pp. 324–332.
  • [33] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in KDD, 2016.
  • [34] S. Chang, W. Han, J. Tang, G.-J. Qi, C. C. Aggarwal, and T. S. Huang, “Heterogeneous network embedding via deep architectures,” in KDD.   ACM, 2015.
  • [35] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang, “Dynamic network embedding by modeling triadic closure process,” in AAAI, 2018.
  • [36] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu, “Attributed network embedding for learning in a dynamic environment,” in Proceedings of CIKM.   ACM, 2017.
  • [37] J. Ma, P. Cui, and W. Zhu, “Depthlgp: Learning embeddings of out-of-sample nodes in dynamic networks,” in AAAI, 2018.
  • [38] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
  • [39] E. Liberty, “Simple and deterministic matrix sketching,” in KDD.   ACM, 2013, pp. 581–588.
  • [40] X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A cluster ensemble approach,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 186–193.
  • [41] E. Bingham and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in KDD.   ACM, 2001.
  • [42] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas, “Randomized dimensionality reduction for -means clustering,” IEEE Transactions on Information Theory, vol. 61, no. 2, pp. 1045–1062, 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
283168
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description