FIGRL: Fast Inductive Graph Representation Learning via ProjectionCost Preservation
Abstract
Graph representation learning aims at transforming graph data into meaningful lowdimensional vectors to facilitate the employment of machine learning and data mining algorithms designed for general data. Most current graph representation learning approaches are transductive, which means that they require all the nodes in the graph are known when learning graph representations and these approaches cannot naturally generalize to unseen nodes. In this paper, we present a Fast Inductive Graph Representation Learning framework (FIGRL) to learn nodes’ lowdimensional representations. Our approach can obtain accurate representations for seen nodes with provable theoretical guarantees and can easily generalize to unseen nodes. Specifically, in order to explicitly decouple nodes’ relations expressed by the graph, we transform nodes into a randomized subspace spanned by a random projection matrix. This stage is guaranteed to preserve the projectioncost of the normalized random walk matrix which is highly related to the normalized cut of the graph. Then feature extraction is achieved by conducting singular value decomposition on the obtained matrix sketch. By leveraging the property of projectioncost preservation on the matrix sketch, the obtained representation result is nearly optimal. To deal with unseen nodes, we utilize foldingin technique to learn their meaningful representations. Empirically, when the amount of seen nodes are larger than that of unseen nodes, FIGRL always achieves excellent results. Our algorithm is fast, simple to implement and theoretically guaranteed. Extensive experiments on real datasets demonstrate the superiority of our algorithm on both efficacy and efficiency over both macroscopic level (clustering) and microscopic level (structural hole detection) applications.
I Introduction
Graphs with nodes representing entities and edges representing relationships between entities are ubiquitous in various research fields. However, since a graph is naturally expressed in a nodeinterrelated way, it is exhausted to directly design different complicated graph algorithms for various kinds of mining and analytic purposes on graph data. Graph representation learning (also known as, graph embedding or network embedding) aims at learning from graph data to obtain lowdimensional vectors to represent nodes without losing much information contained in the original graph. Afterwards, one can apply bunches of offtheshelf machine learning and data mining algorithms designed for general data on various important applications (e.g., clustering [1], structural hole detection [2], link prediction [3], visualization [4], etc).
Due to its ability in facilitating graph analysis, graph representation learning has drawn researchers’ attentions from machine learning and data mining fields [5, 6, 7]. Most of these works are focusing on static networks, such as explicitly preserving local or highorder proximity [8, 9, 10], learning representations using truncated walks [11, 12]; using matrix factorization technique to obtain latent vectors [13, 14, 6], incorporating heterogeneous information [7, 5], etc. Most of these methods are lack of theoretical support to produce reliable representation results. These methods are inherently transductive, which means that they are acting as black boxes that only care about learning representations but do not have an internal mechanism to naturally generalize to unseen nodes.
Our goal of graph representation learning is to design a fast and flexible framework that can preserve important graph topological information (e.g., clustering membership, node similarity, etc) with provable theoretical guarantees and can be naturally generalize to unseen nodes. In this paper, in order to achieve this goal, the Fast Inductive Graph Representation Learning (FIGRL) framework is proposed. FIGRL consists of two stages: decoupling and feature extraction. The intuition of this architecture is illustrated in Figure 1. The first stage is designed for decoupling nodes’ relations by utilizing an oblivious algorithm, JohnsonLindenstrauss random projection. For a graph , this stage generates a matrix sketch , where is the number of nodes of graph and is the sketch size, a parameter that can be automatically determined by our approach and is much smaller than . The matrix sketch approximates well with theoretical guarantees. The second stage extracts meaningful feature contained in by low rank approximation. Dimension reduction is achieved collaboratively in both of these two stages. The resulting representations by this framework are theoretically guaranteed to perform well on constrained low rank approximation tasks (e.g., means clustering). The main contributions of this paper are summarized as follows:

Architecture and Randomization: The proposed framework FIGRL is fast and flexible enough to handle large graphs. Since the decoupling stage adopts an oblivious randomized algorithm, nodes can be processed sequentially with a single pass and without storing the entire graph. The matrix sketch is much smaller that can be dealt with much faster by the feature extraction stage. Moreover, the first stage is projectioncost preserving, which makes sure that the resulting representations extracted by the second stage are optimal up to an approximation ratio . As far as we know, this is the first time that randomized algorithms are introduced to deal with the graph representation learning problem.

Theoretical Analysis: We analyze our algorithm theoretically. We prove the optimality of our algorithm in terms of the absolute difference of projectioncost between the learned representations and the desired representations, and also in terms of the absolute distance difference between their corresponding mean centroids in Theorem 1. For the choice of the parameter (the sketch size ), we give a theoretical guidance in Section IIIE and empirical analysis in Section IVE.

Inductive Learning: Our twostage framework can naturally generalize to learn representations of unseen nodes. We adopt an incremental singular value decomposition with foldingin technique on the matrix sketch to learn representations of unseen nodes. The empirical results reporting in Section IVD demonstrate the effectiveness of our method on unseen nodes.

Empirical Study: FIGRL can produce graph representations of different accuracy for different levels of applications. In Section IV, extensive experiments conducted on both macroscopic level (clustering) and microscopic level (structural hole detection) applications show the superiority of our framework in efficacy and efficiency.
Ii Preliminary and Problem Formulation
A graph is a tuple with three elements, denoting node set, denoting edge set and denoting weighted adjacency matrix. Without ambiguity, we use the term graph and network interchangeably. In this paper, to facilitate the distinction, we use lowercase letters (e.g., ) to denote scalars, bold lowercase letters (e.g., ) to denote vectors, bold uppercase letters (e.g., ) to denote matrices and calligraphic letters to denote graphs. The symbol table is shown in Table I.
Symbol  Definition 

Graph  
Node, edge set and its corresponding volume  
Neighbor set of node  
Weighted adjacency, diagonal degree matrices  
Normalized random walk matrix  
Frobenius norm and spectral norm  
We first define the problem of graph representation learning as follows:
Definition 1.
(Graph Representation Learning) Given a graph , for a fixed dimension number , graph representation learning aims at learning a map for every .
Graph representation learning will generate a lowdimensional vector for every node in the graph. The obtained vectors should preserve important information (e.g., clustering membership, node similarity, etc) hidden in the graph. The scenario is even more challenging when considering unseen nodes. We generally define the graph representation learning task for unseen nodes.
Definition 2.
(Graph Representation Learning for Unseen Nodes) A graph is extended from graph by adding nodes and associated edges after graph representations of have been obtained. For each node , we learn a map without recomputing representations obtained thus far.
Iii Methodology
In this section, we first give some observations to show 1) why we choose the normalized random walk matrix as the initial input matrix of the graph. 2) how important tasks (e.g., means clustering) can be reduced to constrained low rank approximation problem on which our framework is theoretically guaranteed to work well. Then, we present our twostage framework in detail and generalize it to unseen nodes. Finally, we provide a guidance on choosing parameters and show the complexity analysis. Our FIGRL framework is illustrated in Figure 2.
Iiia Observations
IiiA1 Normalized Cut
In literature [15], image segmentation is treated as a graph partition problem based on three metrics average association, normalized cut and average cut. It is demonstrated that normalized cut is a desired choice which seeks balance between finding clumps and finding split nodes. Normally, for graph and two disjoint node sets , , the normalized cut is defined as:
(1) 
The problem of finding a graph partition with optimal normalized cut can be reduced to finding the generalized eigenvector problem of the following equation:
(2) 
Then, applying kmeans clustering on several smallest nontrivial eigenvectors will achieve the optimal graph partitions in terms of normalized cuts. Thus, the matrix related to Equation 2 is a good choice for representing a graph.
IiiA2 Constrained Low Rank Approximation
To design a graph representation learning algorithm with preserving graph information such as clustering membership, we consider the constrained low rank approximation problem to which many important tasks, including means clustering, can be reduced. The constrained low rank approximation can be defined as
Definition 3.
(Constrained Low Rank Approximation) For a matrix and any set of rank orthogonal projection matrices in , constrained rank approximation tries to find
where is called as the projectioncost of .
means clustering and approximate singular value decomposition (SVD) problems are constrained low rank approximation problems. More precisely, means clustering aims at dividing vectors of , where , into clusters, . We denote the centroid of cluster as . The goal of means clustering is to minimize the following objective function:
We transform this objective function into the matrix form. We denote the cluster indicator matrix as matrix , where if is divided into cluster . So in the matrix form, means clustering is to minimize the following equation:
Clearly, is a projection matrix that projects the points’ vectors into their cluster centroids. Therefore, by definition 3, means clustering is a constrained low rank approximation problem with the set being the all possible projection matrix , where is the cluster indicator matrix. For SVD, it is also a constrained low rank approximation problem that tries to find the optimal rank approximation of in the unconstrained set as all possible rank projection matrices. The optimal solution is its top left singular vectors .
IiiB Decoupling with ProjectionCost Preservation
Suppose that if we can find a small matrix whose learned graph representations can achieve nearly the same results with that on the original matrix, we will get lots of speed and space benefits. In this subsection, we show how we can obtain the small matrix (we call it matrix sketch) and also demonstrate its optimality. Formally, for a matrix , we want to find a matrix sketch (where ) to approximate well in a way that for constrained low rank approximation problem as in Definition 3, we can optimize projection matrix over the sketch instead of optimizing over . Firstly, we define the projectioncost preservation sketch as in [16].
Definition 4.
(Rank ProjectionCost Preserving Sketch) is a rank projectioncost preserving sketch of with error if, for all rank orthogonal projection matrices ,
for some fixed nonnegative constant that may depend on and but is independent of .
This definition implies that the projection cost of any projection on will be a good estimation of the projection cost of the same projection over . The following lemma indicates that if is a rank projectioncost preservation sketch of , one can optimize to get the optimal projection matrix to solve the constrained low rank approximation problem.
Lemma 1.
Suppose is a projectioncost preservation sketch of with approximation ratio over the set of all rank projection matrices. Let and . Then,
Proof.
The above lemma provides us a theoretical guarantee to get meaningful information from a matrix sketch, which is computationally efficient. To capture the graph information related to normalized cut, we consider Equation 2. Solving this generalized eigenvector problem is not convenient, since it’s not easy to compute incrementally and more importantly it’s not a constrained low rank approximation problem. Therefore, we transform it to a SVD problem by setting in equation 2, then we have
Removing nonrelevant terms, we actually want to find the top eigenvectors of which we denote as . is actually called normalized random walk matrix. If we have a matrix sketch of , one can use top singular vectors of to approximate top eigenvectors of since is a symmetric matrix.
Next, we will decouple the nodes’ relations in the graph in terms of with preserving projectioncost. One can regard each node as a row vector in , which is generally sparse. To remove the connection between nodes, we randomly project their vectors onto orthogonal directions times so that nodes’ vectors are mapped into that orthogonal subspace. More precisely, for a node vector , we choose a map , where and is an orthogonal projection that maps vectors into a uniformly random subspace of dimension . This strategy should work fine if we have this random matrix . However, ensuring the orthogonality of the projection matrix takes unnecessary time. We can achieve the same goal without explicitly orthonormalize the projection matrix. We choose to be a JohnsonLindenstrauss matrix, that is, the entries of are independently and uniformly drawn from Gaussian distribution . In this way, although the eigenvalues of are not confined in , the range of is indeed a uniformly random subspace. In a matrix form, it means that
(7) 
Now, we have got a matrix sketch by JohnsonLindenstrass random projection. The following lemma indicates that the matrix sketch generated by this procedure is indeed a projectioncost preserving sketch.
Lemma 2.
For matrix , let be a JohnsonLindenstrauss matrix with each entry chosen independently and uniformly from Gaussian distribution . For , with probability at least , is a rank projectioncost preserving sketch of with approximation ratio , when .
By lemma 2, one can achieve an accurate sketch with a satisfied approximation ratio by increasing the sketch size . We will talk about how to choose this parameter later.
IiiC Feature Extraction by Low Rank Approximation
After obtaining the projectioncost preserving sketch of , we want to extract meaningful information from the sketch and also further reduce its dimension. Singular value decomposition is a good choice, since it is a constrained low rank approximation problem which is suitable and advantageous for further factorizing projectioncost preserving sketch, and it is easy to adapt to unseen nodes. A partial singular value decomposition over matrix will give
(8) 
Each row of is the learned graph representation for the corresponding node in the graph. Further, to demonstrate the effectiveness of this framework to learn accurate lowdimensional vectors and to facilitate important tasks (e.g., means clustering), we present the following theorem.
Theorem 1.
Let . And has a singular value decomposition as with distinct singular values. Let and . If , then has the singular value decomposition , such that where and . Furthermore, means clustering over will give a good approximate to that over in terms of means centroids.
Proof.
For the first part of the proof, since and are symmetric matrices, we can view as a symmetric perturbation matrix. Deriving the upper bound of is similar to the derivation of the absolute perturbation bound for eigenvector decomposition in perturbation theory [17].
The key of the second part is to regard means clustering as a constrained low rank approximation problem [18]. As defined in Section IIIA2, is a projection matrix projecting the points vector into its cluster centroid and is the cluster indicator matrix. The discrepancy between the corresponding cluster position assigned to each node of two matrix and is . Applying the spectral submultiplicativity property, it yields
Since is a symmetric projection matrix, the spectral norm of is not greater than 1. Then, it becomes
It means that approximates nodes’ representations in well and the assigned clusters for nodes in and are wellmatched. ∎
Generally, this theorem states that the obtained representations of our algorithm will give a strong guarantee if we perform means clustering on them. In the experiment, we demonstrate that FIGRL also achieve excellent results on clustering task using agglomerative method (AM) and on structural hole detection task.
IiiD Inductive Learning and Entire Framework of FIGRL
So far, we’ve presented the graph representation learning framework for static graphs. When considering inductive learning on an unseen node , we first get of the corresponding column vector of the normalized random walk matrix of the extended graph after added. The valid dimension of is at most since selfjoin is prohibited. Then, applying the random projection matrix on , we get a compressed vector , where is sketch size as above. Since the partial SVD on matrix is , we can regard row vectors of as vectors in the span of . So we can project onto the span of . Specifically,
(9) 
Then, degreenormalized , namely , is the obtained representation of node . This method is fast and powerful when the graph is stable and gradually changes. We testify the effectiveness of our method at different proportions of unseen nodes in the experiment. Overall, our FIGRL framework is summarized in Algorithm 1.
IiiE Parameter Analysis and Complexity Analysis
The parameter of the sketch size can actually be determined by the approximation ratio . The approximation ratio can be designated for the requirement of different tasks. will be sufficient for most tasks focusing on mining the macroscopic structure of the graph (e.g., clustering). One can always decrease to achieve better accuracy if computational power is allowed. JohnsonLindenstrass lemma will provide us another perspective to determine the sketch size .
Lemma 3.
Let be arbitrary. Pick any and matrix as a JohnsonLindenstrauss random projection matrix whose entries are independently and uniformly drawn from Gaussian distribution . Then, for , define for , then for any the following equations hold with probability
(10)  
Taking each row of our matrix as , Lemma 3 proves that the norm of vectors and the distance between nodes are preserved in the lowdimensional subspace when . In fact, [19], where . When approximation ratio is known, we choose the as the sketch size .
To analyze the computational complexity of our algorithm, we first note that our algorithm is especially efficient since matrix is very sparse when the graph is large, and the matrixvector product can be evaluated rapidly. The computational cost of our algorithm in static settings are in total, where is the total degree of the graph. is the cost of generating the JohnsonLindenstrass random projection matrix. is the cost of computing the projectioncost preserving sketch. is for computing the partial singular value decomposition. For unseen nodes, to learn a node representation of an unseen node , we need time where is the degree of . is the time of projecting the node into dimensional space and is the cost of foldingin to the span of right singular vectors.
IiiF Discussion
Since matrix is symmetric, we are able to use doublesided random projection with projectioncost preserved and use eigenvector decomposition as a generalized constrained low rank approximation. To see this, we compute the projectioncost of a projection on double sides of the symmetric matrix , then it yields
(11)  
The lemmas and theorems in this paper can be devised correspondingly. However, this approach needs an additional computation of the product of the top k eigenvectors of the matrix sketch and the random projection matrix. Therefore, doublesided projectioncost preservation is not necessary and regular projectioncost preservation is sufficient for graph representation learning purpose.
Iv Experimental Results
To quantitatively testify our FIGRL framework, we perform various experiments using the learned graph representations. The implementation of our algorithm is publicly available^{1}^{1}1https://github.com/Jafree/FastInductiveGraphEmbedding.
Iva Datasets Description and Comparison Methods
All datasets used in this paper are undirected graphs, which are available in SNAP Datasets platform [20]. These networks vary widely from network type, network scale, edge density, connecting patterns and cluster profiles, which contain three social networks: karate (real), youtube (online), enronemail (communication); three collaboration networks: cahepth, dblp, cacondmat; three entity networks: dolphins (animals), usfootball (organizations), polblogs (hyperlinks). To show the characteristics of these datasets, we use a community detection algorithm, RankCom [21], designed for graphs to reveal the cluster profiles (cluster numbers and max size of clusters). The detailed information is summarized in Table II.
Characteristics  #Cluster  #Max members  

Datasets  # Node  # Edge  RankCom  RankCom 
karate  34  78  2  18 
dolphins  62  159  3  29 
usfootball  115  613  11  17 
polblogs  1,224  19,090  7  675 
cahepth  9,877  25,998  995  446 
cacondmat  23,133  93,497  2,456  797 
emailenron  36,692  183,831  3,888  3,914 
youtube  334,863  925,872  15,863  37,255 
dblp  317,080  1,049,866  25,633  1,099 
Our framework is testified against stateoftheart algorithms. The first five are graph representation learning methods. The others are structural hole detection methods . We summarize them as follows:

FIGRL: Our inductive representation learning approach.

GraphSAGE [22]: Sampling and aggregating strategy is applied to integrate neighbors’ information.

Deepwalk [11]: Truncated random walk and language modeling techniques are adopted to learn representations.

node2vec [23]: Skipgram framework is extended to networks.

LINE [8]: The version of combining firstorder and secondorder proximity is used here.

HAM [2]: A harmonic modularity function is presented to tackle the structural hole detection problem.

Constraint [24]: Constructing a constraint to prune nodes without certain connectivity.

Pagerank [25]: The assumption that structural holes are nodes with high pagerank score is adopted .

Betweenness Centrality (BC) [26]: Nodes with highest BC will be selected as structural holes.

HIS [27]:Optimizing the provided objective function by a twostage information flow model .

AP_BICC [28]: This method is designed by exploiting the approximate inverse closeness centralities.
IvB Clustering
Modularity  Permanence  

Datasets  Clustering Methods  FIGRL  node2vec  GraphSAGE  LINE  Deepwalk  FIGRL  node2vec  GraphSAGE  LINE  Deepwalk 
karate  kmeans  0.410(1)  0.335(5)  0.381(4)  0.403(2)  0.396(3)  0.474(1)  0.335(3)  0.322(4)  0.182(5)  0.350(2) 
AM  0.410(2)  0.335(4)  0.401(3)  0.239(5)  0.430(1)  0.474(1)  0.205(5)  0.339(2)  0.232(4)  0.311(3)  
dolphins  kmeans  0.489(1)  0.460(2)  0.370(4)  0.187(5)  0.401(3)  0.235(1)  0.196(2)  0.158(4)  0.166(5)  0.187(3) 
AM  0.462(1)  0.458(2)  0.355(4)  0.271(5)  0.393(3)  0.215(1)  0.132(3)  0.121(4)  0.189(5)  0.189(2)  
usfootball  kmeans  0.607(1)  0.605(2)  0.485(4)  0.562(3)  0.464(5)  0.323(1)  0.304(3)  0.124(4)  0.311(2)  0.039(5) 
AM  0.611(1)  0.589(2)  0.470(4)  0.492(3)  0.464(5)  0.315(1)  0.279(3)  0.116(4)  0.307(2)  0.039(5)  
cahepth  kmeans  0.611(1)  0.597(2)  0.399(4)  0.01(5)  0.424(3)  0.393(1)  0.379(2)  0.287(3)  0.948(5)  0.261(4) 
AM  0.623(1)  0.606(2)  0.423(4)  0.05(5)  0.453(3)  0.427(1)  0.406(2)  0.327(4)  0.949(5)  0.338(3)  
condmat  kmeans  0.527(1)  0.515(2)  0.409(3)  0(5)  0.357(4)  0.371(1)  0.330(2)  0.206(3)  0.984(5)  0.197(4) 
AM  0.544(1)  0.520(2)  0.427(3)  0(5)  0.370(4)  0.392(1)  0.388(2)  0.213(4)  0.994(5)  0.249(3)  
enronemail  kmeans  0.322(1)  0.213(3)  0.231(2)  0(5)  0.178(4)  0.175(1)  0.080(2)  0.067(3)  0.985(5)  0.049(4) 
AM  0.327(1)  0.218(2)  0.211(3)  0(5)  0.207(4)  0.187(1)  0.180(2)  0.058(4)  0.996(5)  0.108(3)  
polblogs  kmeans  0.427(1)  0.357(2)  0.278(3)  0.200(4)  0.084(5)  0.130(1)  0.066(2)  0.106(3)  0.569(5)  0.187(4) 
AM  0.425(1)  0.376(2)  0.291(3)  0.266(4)  0.065(5)  0.131(1)  0.096(2)  0.123(3)  0.509(5)  0.176(4)  
We first test our algorithm on the clustering task. We set the approximation ratio , and set the dimension of representations at most for all graphs. Firstly, we perform our FIGRL algorithm on graphs to learn lowdimensional representations. Then, two different type of clustering algorithms, i.e., means clustering and agglomerative clustering method (AM) are applied. We use two different metrics to evaluate the clustering results, i.e., modularity [29] and permanence [30].

Modularity [29]: This is the most widely used metric for evaluating clustering results on graphs. Modularity measures the benefits of nodes joining a cluster under the Null model. Specifically, modularity is defined as: where is the indicator function. indicates the cluster node belongs to. Generally, a modularity score greater than means a good clustering result. To punish clearly wrong cluster membership assignment, we add a penalty which is proportional to the inverse of the node’s degree.

Permanence [30]: Permanence is a nodebased metric, which explicitly evaluate the cluster membership affiliation of each node. It is more strict, since it considers the cluster configuration nodes connecting to. For a node in cluster , the permanence is defined as follows: where is the internal degree, is the maximum degree that node links to another cluster, and is the internal clustering coefficient. The total permanence score of the graph is the sum of the permanence score of every node. Empirically, positive permanence score indicates a good clustering result.
The results are listed in Table III. Our algorithm FIGRL outperforms other graph representation learning algorithms, i.e., GraphSAGE, node2vec, LINE, Deepwalk over almost all datasets using means and AM in terms of both modularity and permanence (except on karate network, deepwalk outperforms FIGRL in terms of modularity under AM). Specifically,node2vec achieves the second best results. It performs badly on small networks, i.e., karate in terms of modularity and karate, dolphins, usfootball in terms of permanence. LINE fails on capturing the macroscopic structure of the graph since it only preserves local information. Some zero results of LINE on modularity means that there are a bunch of nodes that LINE assigns to a clearly wrong cluster. GraphSAGE and Deepwalk give mediocre results. In terms of permanence, all other methods cannot preserve the cluster information on polblog, which is a tough case. The overall performance is reported in Figure 3. More precisely, combining the results using means and AM, in terms of modularity our algorithm FIGRL improve node2vec by , GraphSAGE by , LINE by and Deepwalk by . FIGRL gets an improvement of over node2vec, over GraphSAGE and over Deepwalk in term of permanence.
IvC Structural Hole Detection
We consider another task, which is focusing on the microscopic level, called structural hole detection. Structural holes are the important nodes that locate at key topological positions. Once they are removed, the network will fall apart. Finding structural holes in graphs is a critical task for graph theory and information diffusion. To achieve this task, we first transform graph into a lowdimensional subspace using our algorithm, and then find structural holes in that space. We devise a metric for ranking nodes in the lowdimensional subspace:

Relative Deviation Score (RDS): Let be the lowdimensional representation for node . means will give a clustering result with cluster set . RDS estimates the deviation of a node from its own cluster attracted by other clusters in terms of relative radius. More precisely, , where is the cluster that belongs to. represents the radius of cluster . And is the center of cluster .
Nodes with highest RDS scores are regarded as the structural holes since they strongly connect at least two clusters. We use an evaluation metric called Structural Hole Influence Index (SHII) [2] to evaluate the selected structural holes. SHII is computed via a process of information diffusion. For each selected structural hole, we run the information diffusion under linear threshold model (LT) and independent cascade model (IC) 10000 times to get average SHII score. The SHII score is defined as follows:

Structural Hole Influence Index [2]: Note that generally a node cannot activate the influence maximization process by itself. For a selected structural hole , we want to do the following procedure several times: combining with some randomly chosen node set in cluster as a seed set to engage a influence maximization process in the network. SHII evaluate the ratio of activated nodes that are in other clusters , where is the set of communities and is the indicator function indicating whether node is influenced. And in our experiment, we set the size of the sampled activation set as .
The results are shown in Table IV. According to characteristics of different networks, we tune all algorithms to select a certain number of structural holes. Too many of them will result in the activation of the entire network. For karate network, three structural holes are selected. The topological structure of karate shown in [31] demonstrates that the structural holes our algorithm selected are in critical positions that are bridging two clusters. More precisely, our results are superior to other structural hole selection methods, including the stateoftheart algorithm, HAM [2]. It demonstrates the efficacy of our algorithm in preserving microscopic structure.
Comparative Methods  
Datasets  #SH  Influence Model  FIGRL  HAM  Constraint  PageRank  BC  HIS  AP_BICC 
karate  3  LT  0.595  0.343  0.295  0.159  0.159  0.132  0.295 
IC  0.003  0.002  0.002  0.001  0.001  0.001  0.002  
Structural Holes  [3 14 20]  [3 20 9]  [1 34 3]  [34 1 33]  [1 34 33]  [32 9 14]  [1 3 34]  
youtube  78  LT  4.129  3.951  2.447  1.236  1.226  3.198  1.630 
IC  3.024  2.452  1.254  0.662  0.791  2.148  0.799  
dblp  42  LT  6.873  5.384  0.404  0.357  0.958  0.718  0.550 
IC  5.251  3.578  0.229  0.190  0.821  0.304  0.495 
IvD Performance on Unseen Nodes
To evaluate the performance of our algorithm under the inductive learning scenario, we artificially simulate the process of generating unseen nodes. Specifically, for a static graph, we randomly extract proportion of nodes as an original graph for graph representation learning. The other nodes are treated as unseen nodes. In this experiment, we set the approximation ratio as , which is accurate enough for most applications. FIGRL uses a foldingin technique to learn the meaningful representations for unseen nodes. If the learned representations are accurate, means clustering results over the representations of the entire graph will be satisfactory.
The clustering performance in terms of modularity with the variation of the proportion of unseen nodes (i.e., ) is illustrated in Figure 4. As we can see, when the proportion of unseen nodes is not greater than , the clustering performance is stable at a good quality. After increasing the proportion to , too many unseen nodes added are dramatically changing the main skeleton of the network. Since we add a penalty to clearly wrong cluster assignment, the clustering performance degenerates sharply. For a small network like dolphins, the results fluctuate to a certain extent, e.g., at and . In fact, polblog and football give the most stable performance, as they retain almost the same results from to . We conjecture that the representations of the unseen nodes are more accurate if the nodes are wellconnected (polblog has a relatively high edge density) or the network are wellstructured and invulnerable (football has 11 clusters with nearly equal size). Overall, our FIGRL is flexible enough to give a satisfactory representation learning result for inductive learning even when the proportion of the unseen nodes is large (up to ).
IvE Parameter Analysis: Approximation Ratio and Sketch Size
To quantitatively measure the ability of our framework to capture crucial information of graphs and preserve the projectioncost, we evaluate the performance of our algorithm by varying the approximation ratio and the sketch size. To get a visual sense, we plot threedimensional and twodimensional representations of the karate network [31], which has two groundtruth clusters and several structural holes, in Figure 5, when we set the approximation ratio and , separately. As we can see, at , nodes between two clusters are mixed up with each other. So at a low resolution, the cluster information is not wellmaintained in the learned subspace. While at , nodes are located in clusters exactly the same as the ground truth. Moreover, structural holes bridging between two clusters can be easily identified from the twodimensional view where nodes in different clusters form nearly orthogonal subspaces, and they are linearly separable.
To demonstrate the ability of our algorithm in preserving projectioncost, we compute the relative projectioncost with the variation of the sketch size. More precisely, we calculate
where is the residual of optimal rank approximation on . The dimension is set to , which is sufficient for applications we concern. We perform our algorithm times at each sketch size and the result is shown in Figure 6. Relative projectioncost has decreased rapidly at very small sketch size. At sketch size of , FIGRL already can achieve excellent results. Towards sketch size of , the result is nearly optimal for graph representation learning purpose. For large networks, the approximation is even more accurate. Since the network is usually very sparse, nodes are laying in a small subspace compared to the size of the network. Although our algorithm is a randomized algorithm, the variance of at each sketch size is rather small. It implies that we can treat FIGRL as a deterministic algorithm since the chance of the failure of our algorithm in preserving projectioncost is pretty rare especially when the sketch size is large.
IvF Running Time
Finally, the computational time of our algorithm in the static scenario against other competing graph representation algorithms is listed in Table V. All other algorithms learn the representations of dimension. FIGRL is outstanding in terms of computational cost. At sketch size of where FIGRL can give nearly optimal results for tested datasets, it takes only minutes to learn the graph representations on dblp. While GraphSAGE, node2vec and deepwalk take more than 10 hours in order to achieve the same task.
Methods (Sketch Size)  karate  dolphins  usfootball  polblog  cahepth  cacondmat  emailenron  youtube  dblp 

FIGRL(100)  0.005s  0.007s  0.042s  0.032s  0.181s  0.424s  0.672s  7.14s  8.33s 
FIGRL(200)  0.006s  0.009s  0.051s  0.064s  0.388s  0.913s  1.508s  15.85s  18.09s 
FIGRL(500)  0.019s  0.015s  0.037s  0.081s  0.976s  2.804s  6.776s  45.73s  57.30s 
FIGRL(1000)  0.040s  0.035s  0.047s  0.159s  2.576s  7.262s  11.48s  1m41s  2m14s 
node2vec  0.807s  3.110s  1.442s  33.34s  74.83s  2m57s  48m17s  10h  10h 
Deepwalk  4.123s  10.876s  10.92s  2m10s  15m59s  43m9s  1h18m  10h  10h 
GraphSAGE  18.348s  43.791s  37.252s  6m3s  49m20s  3h51m  4h39m  10h  10h 
V Related Work
Va Graph Representation Learning
Graph representation learning has been an important problem to facilitate the implementation of classic machine learning and data mining algorithms on graphs. Some methods try to explicitly preserve proximity between nodes, such as [8] introduces an edgesampling method, [9] develops a semisupervised deep model, [32] enhances communities and structural holes by nonbacktracking random walk. Some methods exploit matrix factorization technique, e.g., [33] factorizes asymmetric transitivity related matrices on directed graphs, [13] proposes an update algorithm on matrix forms. Some algorithms are formulating the problem into a traditional machine learning approach, such as, Deepwalk [11] learns latent representations by treating truncated walks as sentences, [12] optimizes a maxmargin classifier. Several methods focus on heterogeneous scenario, such as [6] models the multiview graph data as tensors, [10] learns the representations of clusters, [7] learns contextaware representations, [34] creates a multiresolution deep architecture, [14] formulates a Deepwalkbased matrix factorization with incorporating text features, [5] introduces metapathbased random walks for representation learning.
One line of work that are similar to our approach is graph representation learning on dynamic networks. [35] investigates the role of closed triads at different time steps. [36] integrates node attributes by utilizing spectral decomposition and matrix perturbation theory in a dynamic setting. [37] aims at preserving highorder proximity by using nonparametric probabilistic modeling and deep learning, which can be generalize to unseen nodes. [22] presents several types of aggregators for aggregating features from nodes’ local neighborhoods. In contrast to these works, by introducing randomization and approximation strategies, our approach focuses on building a graph representation learning framework that is fast, theoretically guaranteed and can generalize to unseen nodes.
VB Randomized Dimension Reduction
Randomized algorithms are often adopted in dimension reduction due to its speed and the solid theory supporting it. [38] surveys randomized algorithms for low rank approximation and presents several algorithms for address different situations. The algorithm proposed is more robust than Krylov subspace methods for sparse input matrix. [39] adapt a well known streaming algorithm for approximating item frequencies to find the matrix sketch. Combined with SVD and a special update strategy, the proposed algorithm becomes deterministic and computationally competitive. [16] devises a theoretical framework by deriving a series of bounds in terms of required dimensions for applying random row projection, column selection, and approximate SVD, which can used to better solve means clustering and low rank approximation problem. [40] uses random projection in a cluster ensemble approach to achieve better and more robust clustering performance. [41] uses random projection technique to deal with text and image data. It empirically demonstrates that random projection yields comparable results compared to conventional deterministic methods (e.g., PCA), but it is computationally significantly less expensive than PCA. [42] presents the first provably accurate feature selection method for means clustering. Two feature extraction methods using random projection and fast approximate SVD are proposed, which improves upon the existing results in terms of time complexity. Our approach uses stateoftheart random strategies in graph representation learning and several bounds and theorems are proved to guarantee the performance of the learned representations.
Vi Conclusion
In this paper, we propose a fast inductive graph representation learning framework, namely FIGRL, to transform the topological structure of graphs into a lowdimensional space. It explicitly decouples relational information in graphs into a randomized subspace spanned by a random projection matrix. The sketch obtained are much smaller and yet inherits the property associated with the normalized cut by preserving projectioncost. By exploiting the constrained low rank approximation, the dimension of the sketch is further reduced and the compact hidden pattern is finally extracted. The connection between randomized algorithm and graph representation learning is built by thoroughly theoretical analysis. FIGRL is flexible enough to deal with massive scale graphs and graph with unseen nodes. Overall, our algorithm is fast, easy to implement and theoretically guaranteed. The empirical study demonstrates the superiority of our algorithm on both efficacy and efficiency.
Acknowledgement
This work is supported in part by National Key R& D Program of China through grants 2016YFB0800700, NSF through grants IIS1526499, IIS1763325, CNS1626432, and NSFC 61672313, 61672051, 61872101.
References
 [1] H. Qiu and E. R. Hancock, “Clustering and embedding using commute times,” TPAMI, 2007.
 [2] L. He, C.T. Lu, J. Ma, J. Cao, L. Shen, and P. S. Yu, “Joint community and structural hole spanner detection via harmonic modularity,” in KDD, 2016.
 [3] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu, “Learning entity and relation embeddings for knowledge graph completion.” in AAAI, vol. 15, 2015, pp. 2181–2187.
 [4] L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
 [5] Y. Dong, N. V. Chawla, and A. Swami, “metapath2vec: Scalable representation learning for heterogeneous networks,” in KDD. ACM, 2017.
 [6] G. Ma, L. He, C.T. Lu, W. Shao, P. S. Yu, A. D. Leow, and A. B. Ragin, “Multiview clustering with graph embedding for connectome analysis,” in CIKM. ACM, 2017.
 [7] C. Tu, H. Liu, Z. Liu, and M. Sun, “Cane: Contextaware network embedding for relation modeling,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2017, pp. 1722–1731.
 [8] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Largescale information network embedding,” in Proceedings of the 24th International Conference on World Wide Web. ACM, 2015, pp. 1067–1077.
 [9] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in KDD. ACM, 2016.
 [10] S. Cavallari, V. W. Zheng, H. Cai, K. C.C. Chang, and E. Cambria, “Learning community embedding with community detection and node embedding on graphs,” in CIKM. ACM, 2017.
 [11] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in KDD. ACM, 2014.
 [12] C. Tu, W. Zhang, Z. Liu, and M. Sun, “Maxmargin deepwalk: Discriminative learning of network representation.” in IJCAI, 2016, pp. 3889–3895.
 [13] C. Yang, M. Sun, Z. Liu, and C. Tu, “Fast network embedding enhancement via high order proximity approximation,” in Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI, 2017, pp. 19–25.
 [14] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y. Chang, “Network representation learning with rich text information.” in IJCAI, 2015, pp. 2111–2117.
 [15] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
 [16] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, “Dimensionality reduction for kmeans clustering and low rank approximation,” in Proceedings of the fortyseventh annual ACM symposium on Theory of computing. ACM, 2015, pp. 163–172.
 [17] X. S. Chen, W. Li, W. W. Xu et al., “Perturbation analysis of the eigenvector matrix and singular vector matrices,” Taiwanese Journal of Mathematics, vol. 16, no. 1, pp. 179–194, 2012.
 [18] C. Boutsidis, P. Drineas, and M. W. Mahoney, “Unsupervised feature selection for the means clustering problem,” in Advances in Neural Information Processing Systems, 2009, pp. 153–161.
 [19] S. Dasgupta, “Experiments with random projection,” in Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2000, pp. 143–151.
 [20] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, 2014.
 [21] F. Jiang, Y. Yang, S. Jin, and J. Xu, “Fast search to detect communities by truncated inverse page rank in social networks,” in Mobile Services (MS), 2015 IEEE International Conference on. IEEE, 2015, pp. 239–246.
 [22] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in Advances in Neural Information Processing Systems, 2017, pp. 1025–1035.
 [23] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD International Conference on ledge Discovery and Data Mining. ACM, 2016, pp. 855–864.
 [24] R. S. Burt, Structural holes: The social structure of competition. Harvard university press, 2009.
 [25] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Tech. Rep., 1999.
 [26] U. Brandes, “A faster algorithm for betweenness centrality,” Journal of mathematical sociology, 2001.
 [27] T. Lou and J. Tang, “Mining structural hole spanners through information diffusion in social networks,” in WWW, 2013.
 [28] M. Rezvani, W. Liang, W. Xu, and C. Liu, “Identifying topk structural hole spanners in largescale social networks,” in CIKM, 2015.
 [29] M. E. Newman, “Modularity and community structure in networks,” PNAS, 2006.
 [30] T. Chakraborty, S. Srinivasan, N. Ganguly, and S. Bhowmick, “On the permanence of vertices in network communities,” in KDD, 2014.
 [31] W. W. Zachary, “An information flow model for conflict and fission in small groups,” Journal of anthropological research, vol. 33, no. 4, pp. 452–473, 1977.
 [32] F. Jiang, L. He, Y. Zheng, E. Zhu, J. Xu, and P. S. Yu, “On spectral graph embedding: A nonbacktracking perspective and graph approximation,” in Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM, 2018, pp. 324–332.
 [33] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in KDD, 2016.
 [34] S. Chang, W. Han, J. Tang, G.J. Qi, C. C. Aggarwal, and T. S. Huang, “Heterogeneous network embedding via deep architectures,” in KDD. ACM, 2015.
 [35] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang, “Dynamic network embedding by modeling triadic closure process,” in AAAI, 2018.
 [36] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu, “Attributed network embedding for learning in a dynamic environment,” in Proceedings of CIKM. ACM, 2017.
 [37] J. Ma, P. Cui, and W. Zhu, “Depthlgp: Learning embeddings of outofsample nodes in dynamic networks,” in AAAI, 2018.
 [38] N. Halko, P.G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
 [39] E. Liberty, “Simple and deterministic matrix sketching,” in KDD. ACM, 2013, pp. 581–588.
 [40] X. Z. Fern and C. E. Brodley, “Random projection for high dimensional data clustering: A cluster ensemble approach,” in Proceedings of the 20th International Conference on Machine Learning (ICML03), 2003, pp. 186–193.
 [41] E. Bingham and H. Mannila, “Random projection in dimensionality reduction: applications to image and text data,” in KDD. ACM, 2001.
 [42] C. Boutsidis, A. Zouzias, M. W. Mahoney, and P. Drineas, “Randomized dimensionality reduction for means clustering,” IEEE Transactions on Information Theory, vol. 61, no. 2, pp. 1045–1062, 2015.