Spectral Network Embedding: A Fast and Scalable Method via Sparsity

# Spectral Network Embedding: A Fast and Scalable Method via Sparsity

Jie Zhang, Yan Wang, Jie Tang Department of Computer Science and Technology, Tsinghua University
###### Abstract.

Network embedding aims to learn low-dimensional representations of nodes in a network, while the network structure and inherent properties are preserved. It has attracted tremendous attention recently due to significant progress in downstream network learning tasks, such as node classification, link prediction, and visualization. However, most existing network embedding methods suffer from the expensive computations due to the large volume of networks. In this paper, we propose a faster network embedding method, called Progle, by elegantly utilizing the sparsity property of online networks and spectral analysis. In Progle, we first construct a sparse proximity matrix and train the network embedding efficiently via sparse matrix decomposition. Then we introduce a network propagation pattern via spectral analysis to incorporate local and global structure information into the embedding. Besides, this model can be generalized to integrate network information into other insufficiently trained embeddings at speed. Benefiting from sparse spectral network embedding, our experiment on four different datasets shows that Progle outperforms or is comparable to state-of-the-art unsupervised comparison approaches—DeepWalk, LINE, node2vec, GraRep, and HOPE, regarding accuracy, while is faster than the fastest word2vec-based method. Finally, we validate the scalability of Progle both in real large-scale networks and multiple scales of synthetic networks.

network embedding, unsupervised learning, network spectral analysis, scalability
ccs: Information systems Data miningccs: Information systems Social networksccs: Computing methodologies Dimensionality reduction and manifold learningcopyright: none

## 1. Introduction

Networks exist in many applications, e.g., social networks, gene-protein networks, road networks, and the World Wide Web. One critical issue in network analysis is how to represent each node in the network. It is particularly challenging, as the network topology is often sophisticated and the scale of the network is extensive in many cases.

Network embedding, also called network representation learning, is a promising method to project nodes in a network to a low-dimensional continuous space while preserving some network properties. It could benefit a wide range of real applications such as link prediction, community detection, and node classification. Take the node classification as an example, and we can utilize network embedding to predict network node tags, such as interests of users in a social network, or functional labels of proteins in a protein-protein interaction network.

Roughly speaking, methodologies for network embedding fall into three categories: spectral embeddings, word2vec-based embeddings, and convolution-based embeddings. Spectral embedding approaches typically exploit the spectral properties of various matrix representations of graphs, especially the Laplacian and the adjacency matrices. Classical spectral methods can be viewed as dimensionality reduction techniques, and related works include Isomap (Tenenbaum et al., 2000), Laplacian Eigenmaps (Belkin and Niyogi, 2001), and spectral clustering (Yan et al., 2009). Involving matrix eigendecomposition, the expensive computational cost and poor scalability limit their application in real networks.

Recently there has been a dramatic increase in the scalability of network embedding methods due to the introduction of the neural language model, word2vec (Mikolov et al., 2013). The word2vec-based embedding methods, such as Deepwalk (Perozzi et al., 2014), and node2vec (Grover and Leskovec, 2016), analogize nodes into words and capture network structure via random walks, which results in a large “corpus” to train the node representations. They utilize SGD to optimize a neighborhood preserving likelihood objective that is usually non-convex. Though these methods are scalable and have achieved significant performance improvements in node classification and link prediction, they still suffer from high sampling time cost and optimization computation cost.

More recently, a few works have attempted to connect spectral embedding with word2vec-based embedding. Levy and Goldberg (2014) have proven that the neural word embedding is equal to implicit matrix factorization. Following this thread, several matrix factorization embedding approaches using spectral dimensionality reduction techniques (e.g., SVD) have been proposed, including GraRep (Cao et al., 2015) and NetMF (Qiu et al., 2017). These spectral methods demonstrated better statistical performances than original spectral approaches and corresponding word2vec-based counterparts. However, they also inherit the defeats of the spectral methods, that is, the expensive computation and low scalability.

The third category of embedding method is convolution-based embedding method, for example, Graph Convolution Network (GCN) (Kipf and Welling, 2016), Graph Attention Network (GAT) (Velickovic et al., 2017), Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) and GraphSage (Hamilton et al., 2017). The convolution-based embedding methods, as the supervised learning methods, further improve the embedding learning quality. Nevertheless, they need labeled data and are limited to specific tasks, and most of them are also not easily scaled up to handle large networks. More importantly, task-independent embedding learned in an unsupervised learning way often closely match task-specific supervised convolution-based approaches in predictive accuracy but benefit a broader range of other real applications. Therefore, we only discuss the unsupervised learning methods.

In summary, existing methods often suffer from the expensive computation or poor scalability. In this paper, inspired by the fact that most online networks follow power-laws and long-tailed distribution, and the relationship between higher-order Cheeger’s inequality and spectral graph partitioning (Lee et al., 2014; Bandeira et al., 2013), we propose a scalable spectral embedding method, called Progle, with high efficiency and high accuracy. In particular, we first construct the network proximity matrix via sparsity property. Secondly, it is deduced that the first phase network embedding (called sparse network embedding) in our model is decomposing a sparse matrix, and we can train network embeddings via sparse truncated Lanczos SVD algorithm. Finally, leveraging the relationship between higher-order Cheeger’s inequality and multi-way graph partitioning, we modulate the network spectrum and get the final embedding (called spectral network embedding) mainly by propagating the sparse network embedding in the modulated network to incorporate local and global network information.

The whole algorithm process, illustrated in Figure 1, only involves sparse matrix product and sparse spectral methods, and thus is scalable and several orders of magnitude faster than the common spectral embedding methods (e.g., GraRep and NetMF). Compared with word2vec-based methods, Figure 2 shows that Progle also achieves significantly better efficiency performances on all the different datasets ( faster). Moreover, regarding accuracy, Progle outperforms significantly, or marginally in some case, all the comparison methods—DeepWalk, LINE, node2vec, GraRep, and HOPE.

Organization. The rest of the paper is organized as follows. Section 2 formulates the problem definition. Section 3 describes the model framework and algorithms. Section 4 presents the experimental results. Section 5 reviews related works and Section 6 concludes the paper.

## 2. Problem Formulation

Network embedding aims to project network nodes into a low-dimensional continuous vector space. The resultant embedding vectors can be leveraged as features for various network learning tasks, such as node classification, link prediction, and community detection.

For the convenience of narration, we introduce following notations. Let be a connected network , with nodes , edges . We use to denote the adjacency matrix (binary or weighted), and to denote a degree matrix, with .

###### Definition 2.1 ().

Node Proximity and Proximity Matrix: Node proximity defines a graph kernel mapping two nodes to a real value, which reflects the similarity between node and in geometry distance or network structure roles they play, e.g., the network hub and the structural hole (Lou and Tang, 2013). The corresponding kernel matrix is called the proximity matrix .

###### Definition 2.2 ().

Network Embedding: Given a network , network embedding is a mapping function from network space to -dimension space, where . In the space, the distance characterizes some proximity between node and in the original network space so that embedding vector set can be utilized as features for machine learning in the network.

By convention, the downstream learning tasks are reversely utilized to evaluate the quality of the learned embedding. Generally, there are two goals for network embedding, the original network reconstruction from the learned embedding and supporting of network inference (Cui et al., 2017). The limitation of former is stated in (Cui et al., 2017): the embedding may overfit the adjacency matrix. We use the standard method used in Deepwalk to validate the embedding’s inference ability, multi-label classification task, where each node may have multiple labels to infer (Tang et al., 2009; Perozzi et al., 2014). All comparisons involved are unsupervised learning algorithms.

## 3. Model Framework

In this section, we first leverage the sparse property of networks to learn sparse network embedding, then incorporate local and global network information into the embedding via spectral methods. In our model, it’ s obvious to conclude that both time complexity and space complexity are linear to the volume of network and the proposed approach is efficient and scalable for large networks.

### 3.1. Sparse Network Embedding

Since our method is based on the spectral method, we first analyze the cause of expensive computation cost and poor scalability of state-of-the-art spectral matrix factorization embedding methods.

#### 3.1.1. Cause of Expensive Computation Cost and Poor Scalability

The notable word embedding method, word2vec, learning distributional representations for words, is based on the distribution hypothesis(Harris, 1954), which states words in similar contexts have similar meanings. Similarly, the word2vec-based embedding methods (Deepwalk, node2vec, LINE, etc.) and some spectral methods (GraRep, NetMF, etc.) transfer this hypothesis to networks and assume that nodes in similar network contexts are similar. In general, the contexts of a node are defined as the node set it can arrive within steps. These methods define the following node-context proximity matrix implicitly or explicitly (Levy and Goldberg, 2014; Qiu et al., 2017):

 (1) P=P1+P2+...+Pmm

where is called transmission matrix and the default value of is . The entity reflects the probability proximity value between node and it’s context node . The objective of state-of-the-art spectral embedding methods is to use a function of dot product of node embedding and context node embedding to approximate .

The equation (1) is also inherited from the linear bag-of-word context assumption in skip-gram model(Mikolov et al., 2013) that context words in a size window of the target word share the same weight. Although equation (1) is expressive to learn word embedding for the sequential property of the sentences, it still ignores spatial locality and sparsity of the network, resulting in the dense matrix . Since these matrix factorization methods are to factorize a matrix derived from , the dense matrix is a cause of expensive computation cost and poor scalability, with time complexity and storage space complexity.

Even if we make a trade-off between efficiency and effectiveness and set , the objective matrix is still dense. The denseness of the objective factorized matrix also comes from negative sampling tricks(Mikolov et al., 2013), resulting the factorized matrix is in the form of a shifted PMI matrix (Church and Hanks, 1990), with a global shifted bias ( is the negative sampling number). Even all negative values in the objective matrix are replaced by  (Levy and Goldberg, 2014; Cao et al., 2015), but it cannot reduce the magnitude order of non-zero entities.

To deal with the causes of expensive computation cost and poor scalability, we first give a reasonable sparse proximity matrix and restrict our mathematical derivation in the sparse node-context edge set and conclude our embedding can be achieved by sparse spectral matrix decomposition. In the Spectral Propagation section, we further eliminate the possible loss of accuracy and the model’s expression capacity resulting from sparse processing. The final experiments will support our model.

#### 3.1.2. Sparse Proximity Matrix

The transmission matrix to the power of different orders in the equation (1) reflects different order proximities between nodes and context nodes. Since the network is nonlinear, the weights of different order proximities are inconsistent with the linear bag-of-word model. Intuitively, lower order proximities reveal the basic connectivity of networks hence play the main part of proximity matrix. In GraRep (Cao et al., 2015), different order proximities are considered separately. In HOPE (Ou et al., 2016), Katz matrix is used to replace the equation (1),

 (2) SKarz=∞∑i=1βAi

where is a decay parameter.

Here we use an efficient edge dropout method to give an attenuation sum of different order proximities while remaining the resultant sum sparse. We draw the edges of adjacency matrix with a dropout ratio to achieve a series of . We replace the non-zero elements of a matrix with and denote the operation as . Our sparse proximity matrix is

 (3) P∼m∑i=1(D−1A)i∘⟨Ai−1∏j=1^Aj⟩

where is Hadamard product, also named element-wise product. controls the sparsity and plays the role of a decay factor. only needs to be set to , as the possible loss can be compensated later.

After normalizing the right hand of the equation (3), we get the sparse proximity matrix, still with high order proximities. The whole process only involves sparse matrix product and is efficient.

#### 3.1.3. Network Embedding as Sparse Matrix Decomposition

The edges of the sparse proximity matrix form a node-context pair set . We define the occurrence probability of context given node as

 (4) ^pi,j=σ(cTjri)

where is the sigmoid function, is the the context vector of context node and is the representation embedding vector of node . The objective function can be expressed as the weighted sum of log loss:

 (5) l=−∑(i,j)∈Dpi,jln^pi,j

where is the entity in proximity matrix and indicates the weight of in the node-context pair set .

However, this objective admits a trivial solution in which , ( is very large) then for every pair , resulting in the consequence that all the node embeddings are the same and cannot represent any network information about the nodes.

To prevent this trivial solution, we suppose that given an observed node-context pair , the appearance of context is caused by ’s stimulation, or by the background noise of the node-context set . The binary classifier should distinguish the noise as a negative sample. So the modified loss is:

 (6) l=−∑(i,j)∈D[pi,jlnσ(cTjri)+λPD,jlnσ(−cTjri)]

where is the negative noise sample ratio. , the background noise of the node-context set associated with context node , can be defined below:

 (7) PD,j=∑i:(i,j)∈Dpi,j∑(i,j)∈Dpi,j

A sufficient condition for minimizing the objective (6) is to let its partial derivative with respect to be zero. We can get:

 (8) rTicj=lnpi,j−ln(λPD,j),(i,j)∈D

We define a matrix whose entities are given below:

 (9) Mi,j={lnpli,j−ln(λPD,j),(i,j)∈D0,(i,j)∉D

From equation (9), the objective is transformed to approximate the matrix with the row-rank product of representation embedding matrix and context embedding matrix . An alternative optimization method is truncated Singular Value Decomposition (tSVD), which achieves the optimal rank factorization regarding loss. The objective is shown below:

 (10) minRd,Cd||M−RCT||F

where and are matrices whose row stands for a node’s embedding and context embedding respectively. The normalized is:

 (11) Rd←RdΣ1/2d

Benefiting from the sparse proximity matrix construction, , sparse truncated Lanczos SVD can be utilized on matrix with time complexity .

### 3.2. Spectral Propagation

In this section, we introduce a propagation way via spectral analysis and interpret the intuition why spectral propagation can incorporate the global clustering information and spatial locality smoothing into the sparse embedding. Hence it further eliminates the possible loss of accuracy or expression capacity resulting from sparse processing. Moreover, spectral propagation is also a general embedding enhancement method.

#### 3.2.1. Network Laplacian Transform

Unnormalized network Laplacian is an essential operator in spectral network analysis and defined as . The random walk normalized Laplacian definition is , where is the identity matrix. The random walk normalized Laplacian can be decomposed as , where with and is the square() matrix whose column is the eigenvector .

The complete set of orthonormal eigenvectors are also known as the graph Fourier modes, and the associated eigenvalues are identified as the frequencies of the graph. The graph Fourier transform of a signal is defined as while the inverse transform as . They are the transforms between original “temporal” space and spectral (frequency) space. A simple network propagation can be interpreted that the signal is first transformed into the spectral space and scaled by the eigenvalues, then transformed back.

We will demonstrate next that the eigenvalues in the spectral space are closely associate with networks’ spatial locality smoothing and global clustering, according to higher-order Cheeger’ inequality (Lee et al., 2014; Bandeira et al., 2013).

#### 3.2.2. Multi-way Graph Partitioning and Higher-order Cheeger’s Inequality

We first introduce the definitions of expansion and -way Cheeger constant.

###### Definition 3.1 ().

Expansion: For a node subset

 ϕ(S)=|E(S)|min{vol(S),vol(V−S)}

where E(S) is the set of edges with one point in and vol() is the sum of nodes’ degree in set .

###### Definition 3.2 ().

-way Cheeger constant:

 ρG(k)=min{max{ϕ(Si):S1,S2,...,Sk⊆V disjoint}}

According to the definitions, the expansion indicates the effect of the graph partitioned by a subset and -way Cheeger constant reflects the effect of the graph partitioned into parts. Smaller value means better partitioning effect. An example is illustrated in figure 3.

Higher-order Cheeger’s inequality (Lee et al., 2014; Bandeira et al., 2013) bridges the gap between network spectral analysis and graph partitioning via controling the bounds of -way Cheeger constant:

 (12) λk2≤ρG(k)≤O(k2)√λk

A basic fact in spectral graph theory is that the number of connected components in an undirected graph is equal to the multiplicity of the eigenvalue zero in the Laplacian matrix, which can be concluded from when setting in inequality (12).

From inequality (12), we can conclude that small (large) eigenvalues control global clustering (local smoothing) effect of the network partitioned into a few large parts (many small parts). This inspires us to incorporate global and local network information into network embedding when propagating the raw embedding in the modulated network. For the convenience of narration, we discuss the modulation of the Laplacian matrix and get the new embedding via

 (13) Rd←˜D−1ARd=(En−˜L)Rd

Let where is the modulator in the spectrum.

To take both local and global structure information into consideration, we design the spectral modulator as , where . Therefore,

 (14) ˜L =Udiag([e−12[(λ1−μ)2−1]θ,...,e−12[(λn−μ)2−1]θ])UT

is a band-pass filter kernel (Shuman et al., 2016; Hammond et al., 2011) that passes eigenvalues within a certain range and attenuates eigenvalues outside that range, hence is attenuated for corresponding top largest eigenvalues and top smallest eigenvalues, resulting in the amplified global and local network information. A band-pass graph modulation example is illustrated in Figure 1. As for the low-pass or high-pass filter, which only amplifies the local or global information respectively, can be achieved by tuning the position parameter of the band-pass filter. Therefore, the band-pass filter is an general spectral network modulator.

#### 3.2.3. Chebyshev Expansion

Note that the Fourier transform and inverse Fourier transform in equation (14) involve eigendecomposition, we utilize truncated Chebyshev expansion to avoid explicitly transforming. The Chebyshev polynomials of the first kind are defined by the recurrence relation with . Then

 (15) ˜L≈Uk−1∑i=0ci(θ)Ti(¯Λ)UT=k−1∑i=0ci(θ)Ti(¯L)

where , and scaled eigenvalues lie in . The new kernel is .

As are orthogonal with the weight on the interval , that is,

 (16) ∫1−1Ti(x)Tj(x)√1−x2dx=⎧⎪ ⎪⎨⎪ ⎪⎩0,i≠jπ,i=j=0π2,i=j≠0

the coefficients of Chebyshev expansion for can be get by:

 (17) ci(θ)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩1π∫1−1Ti(x)e−xθ√1−x2dx=(−)iIi(θ),i=02π∫1−1Ti(x)e−xθ√1−x2dx=2(−)iIi(θ),i≠0

where is the notable special function, modified Bessel function of the first kind (Andrews and Andrews, 1992).

Then, we get the series expansion of the modulated Laplacian:

 (18) ˜L≈I0(θ)T0(¯L)+2k−1∑i=1(−)iIi(θ)Ti(¯L)

The embedding matrix propagated by the spectral modulated network is:

 Rd ←˜D−1ARd=(En−˜L)Rd (19) ={En−[I0(θ)T0(¯L)+2k−1∑i=1(−)iIi(θ)Ti(¯L)]}Rd

To alleviate the perturbation of the spectrum of , we further modify the expression (3.2.3):

 Rd ←D−1A˜D−1ARd=D−1A(En−¯L)Rd (20) =D−1A{En−[I0(θ)T0(¯L)+2k−1∑i=1(−)iIi(θ)Ti(¯L)]}Rd

The computation of expression (3.2.3) can be efficiently executed in a recurrence way. Denote , then with and . Note that and is sparse, the overall complexity of equation (3.2.3) is .

In addition, while the spectral propagation incorporates more structure information into embedding, it also ruin the orthogonality of original embedding space achieved by sparse SVD. Hence, we mend it via common SVD on . As is , the time complexity is .

#### 3.2.4. Spectral Propagation as a General Enhancement Method

Though we propose spectral propagation to eliminate the possible loss of accuracy due to sparse processing, the spectral propagation is somewhat independent of sparse network embedding model. In other word, it is a general method of further incorporating network information efficiently and thus enhance the quality of embedding trained by other models. We will validate the effectiveness of enhancement in the experiment part.

### 3.3. Algorithm summary

In this part, we summarize the proposed algorithm, Progle. We firstly construct the network proximity matrix via sparsity property; Secondly we deduce that the raw network embedding in our model is decomposing a sparse matrix; Next, we can train network Embeddings via sparse truncated Lanczos SVD algorithm; Finally, according to higher-order Cheeger’s inequality, we modulate the network’s spectrum and and Propagate the raw embedding in the whole modulated network to incorporate Global and Local network structure information.

Without the product and decomposition operations on the dense matrix, the proposed method can be scalable for large networks efficiently with competitive performance and takes up fewer computation resources due to sparsity (Berry, 1991, 1992).

#### 3.3.1. Complexity analysis

In the sparse network embedding part, the sparse proximity matrix generation costs time and the sparse network embedding training (sparsity truncated SVD) costs . In the spectral propagation part, the propagtion complexity is and the final SVD costs . In all, the total time complexity is .

The space complexity to restore the sparse proximity matrix and embedding is .

Both of time complexity and space complexity are linear to the volume of network, which accounts for the efficiency and scalability.

## 4. Experiments

In this section, we conduct the multi-label node classification experiments to evaluate the proposed model and the improvement for other embedding methods. Then we compare our time cost with the baselines, and further validate Progle’s scalability and efficiency in large-scale real networks and multiple scales of synthetic networks.

### 4.1. Experimental Setup

#### 4.1.1. Datasets

We employ four widely-used datasets for the node classification experiments. The statistics are shown in Table 1.

• BlogCatalog (Zafarani and Liu, 2009): BlogCatalog is a social blogger network, where nodes and edges stand for bloggers and their pairwise social relationships, respectively. Bloggers submit their blogs with interest tags, which are considered as ground-truth labels.

• Wikipedia (Mahoney, 2009): This is a co-occurrence network of words in the first million bytes of the Wikipedia dump, and the labels represent the Part-of-Speech(POS) tags.

• Protein-Protein Interactions (PPI) (Breitkreutz et al., 2008): This is a subgraph of PPI network for Homo Sapiens. The subgraph corresponds to the graph induced by nodes which have labels from hallmark gene sets and represent biological states.

• DBLP(Tang et al., 2008): DBLP is an academic network dataset where authors are treated as nodes, citation relationships as edges and academic conferences or activities as labels.

Another three real large-scale networks without labels, and multiple scales of synthetic networks are added to validate Progle’s efficiency and scalability.

• Synthetic networks: They are regular graphs. The network degree ranges from 10 to 1000, and the number of node ranges from 1000 to 10,000,000.

• Flickr (Zafarani and Liu, 2009): A user contact network dataset crawled from Flickr.

• Youtube (Leskovec and Sosič, 2016): A friendship network of Youtube users.

• wiki-topcats (Leskovec and Sosič, 2016): A web graph of Wikipedia hyperlinks collected in September 2011.

#### 4.1.2. Comparison baselines

The following methods are baselines in the experiments:

• DeepWalk(Perozzi et al., 2014). DeepWalk transforms a graph structure into linear sequences by random walks and processes the sequences using word2vec model (skip-gram).

• LINE(Tang et al., 2015). It defines loss functions to preserve first-order or second-order proximity in the embedding.

• node2vec (Grover and Leskovec, 2016). It can be treated as the biased random walk version of DeepWalk.

• GraRep (Cao et al., 2015). It decomposes -step probability transition matrix to train the node embedding, then concatenate all -step representations.

• HOPE (Ou et al., 2016). It is a matrix factorization spectral method, approximating high-order proximity based on factorizing the Katz matrix.

#### 4.1.3. Evaluation Methods

For effectiveness validation, the prediction performance of embedding is evaluated via the average Micro-F1 in the multi-label classification task, which has also been used in the works of Deepwalk, LINE, node2vec and GraRep. Following the same experimental procedure in Deepwalk, We randomly sample different percentages of the labeled nodes for training and the rest for testing. For PPI, Wikipedia, and Blogcatalog datasets, the training ratio is varied from to . For DBLP, the training ratio is varied from to . For each comparison method, the resultant embeddings are used to train a one-vs-rest logistic regression classifier with L2 regularization. In the test phase, the one-vs-rest model yields a ranking of labels rather than an exact label assignment. To avoid the thresholding effect (Tang et al., 2009), assume the number of labels for test data is given (Tang et al., 2009; Perozzi et al., 2014). We repeat each experiment for times and report the average Micro-F1 and standard deviation. Analogous results hold for Macro-F1 or other accuracy metrics, and thus are not shown.

In the efficiency and scalability experiments, the efficiency is evaluated by the CPU time the method takes to get the embedding. The scalability of Progle is analyzed by the time cost in the multiple scale and different density networks.

The experiments were conducted on a Red Hat server with four Intel Xeon(R) CPU E5-4650 (2.70GHz) and 1T RAM.

#### 4.1.4. Parameter Settings

To facilitate comparison of methods, we set the dimension of all the embedding . For DeepWalk and node2vec, we follow Deepwalk’s preferred parameters – windows size , walks per node , walk length . in node2vec are searched over {0.25, 0.50, 1, 2, 4 }. For LINE, learning rate , negative sample number . For GraRep, the dimension of the concatenated embedding is for fairness. For HOPE, is calculated in authors’ code, and searched over to improve performance. For Progle, the term number of the Chebyshev expansion is and , are the default.

### 4.2. Multi-label classification on different datasets

Table 3 summarizes the prediction performance of all methods on the four datasets. It shows that Progle achieves significant improvement over baselines on different datasets at different training ratios. As for Blogcatalog, Progle only achieve relatively marginal improvement over Deepwalk and Node2vec, but with efficiency (in figure 2).

Combined with the efficiency experiment, we can conclude that Progle outperforms or in some case is comparable to these state-of-the-art comparison methods while is faster and takes up less computation resources.

### 4.3. Enhancement for other embedding methods

As we mentioned in the model section, spectral propagation is a general model to further incorporate network information and thus can enhance the quality of embedding trained by other models.

On Wikipedia dataset, the accuracy performances of the baselines can be improved further. We input these learned embeddings into Progle and modulate them in the spectral space when propagated in the network. The outputs are denoted as “Pro+” embeddings. The improvement is significant (average relatively gain for all methods at different training ratios; notably, for HOPE at 80% training ratio). The accuracy performances of “Pro” versions of these baselines are shown in Figure 4.

These improvements are completed in a second, as the input embeddings are low-rank matrices ().

### 4.4. Efficiency and Scalability Experiment

In this part, we first compare the efficiency of Progle with other scalable embedding methods, that is, Deepwalk, LINE, and node2vec. Then we show the time cost of Progle in larger networks and also analyze its scalability to validate the time and space complexity analysis. Notice that only Progle is single threaded while DeepWalk and node2vec are trained using 20 workers, and iterative epoch is set 1, Progle is more efficient than we show.

Table 4 displays the time cost of Progle and word2vec-based methods. Utterly different from other spectral matrix factorization embedding methods, which take up all the CPU processes automatically and run much slow, Progle is single threaded, but still faster than word2vec-based embedding methods. In a word, Progle takes up fewer computation and is faster without loss of effectiveness performance.

The time cost of Progle on real large-scale networks and multiple scale and density synthetic regular networks is shown in Table 5 and Figure 5. For synthetic networks , the number of nodes is fixed as 10,000 and the degree grows from to ; for synthetic networks A, B, C, D and E, the degree is fixed as and the number of nodes varies from to . Its time cost increases nearly linearly with the number of nodes and edges, validating Progle’s scalability. Note that the time cost of Progle in the largest networks in Table 5 is still comparable to that of baselines in much smaller networks (Table 4).

Figure 5 shows both the total time cost and sparse network embedding time cost increase with respect to network volume and density, while the rest time spent in spectral propagation phase is the gap between these two curves. Therefore, the time cost in both sparse network embedding phase and spectral propagation phase is linear to network volume.

As for other spectral matrix factorization embedding methods, e.g. GraRep, they cause memory error when the network nodes .

## 5. Related Work

Roughly, related works mainly fall into three categories: spectral matrix factorization embeddings, word2vec-based embeddings, and convolution-based embeddings.

Some original spectral network embedding works were related to spectral dimension reduction methods like Isomap (Tenenbaum et al., 2000), Laplacian Eigenmaps (Belkin and Niyogi, 2001) and spectral clustering (Yan et al., 2009). These methods typically exploiting the spectral properties of graph matrices, especially, Laplacian and adjacency matrices. Their computation involves dense matrix decomposition, and thus are time consuming due to the time complexity of at least quadratic to the number of nodes, which limits their application.

Another series of network embedding works have arisen since DeepWalk (Perozzi et al., 2014) built a bridge between network representation and neural language model. LINE (Tang et al., 2015) preserves the first-order proximity and second-order proximity of networks. Node2vec (Grover and Leskovec, 2016) designs a biased random walk procedure to make a trade-off between homophily similarity and structural equivalence. These methods often base on sampling, and optimize a non-convex objective, resulting in sampling error and optimization computation cost. A matrix decomposition comprehension for these word2vec-based methods was also be proposed, e.g., GraRep (Cao et al., 2015), NetMF (Qiu et al., 2017), and HOPE (Ou et al., 2016). Like original spectral methods, these matrix decomposition versions of word2vec-based methods are also expensive both in time and space, and not scalable.

The third category of embedding method is convolution-based embedding method, e.g., Graph Convolution Network (GCN) (Kipf and Welling, 2016), Graph Attention Network (GAT) (Velickovic et al., 2017), Message Passing Neural Networks (MPNNs) (Gilmer et al., 2017) and GraphSage (Hamilton et al., 2017). In these works, convolution operation is defined in the spectral space and parametric filters are learned via back-propagation algorithm. Most of them are also not easily scaled up to handle large networks. Different from them, our model features a band-pass filter incorporating both spatial locality smoothing and global structure, and the coefficients are calculated via integral in advance. Furthermore, the remarkable distinctions are that our method is without side-information features, unsupervised and task-independent, and benefits a broader range of downstream applications.

## 6. Conclusions

In this paper, network sparsity is utilized in two ways. First, edge dropout technique is used to design the sparse proximity matrix and sparse truncated SVD can be applied to get the embedding. Secondly, all the matrix product only involves sparse matrix, e.g. in the Chebyshev expansion phase. Third, sparse storage makes large-scale network embedding possible. Besides, spectral propagation in Progle incorporate local and global network structure into embeddings, which can not only help Progle outperform or at least be comparable to state-of-the-art baselines, but also enhance other methods at speed. In addition to the innovation on the model, the efficiency and effectiveness of the proposed method are also very significant. The model tackles the poor scalability and expensive computation of spectral methods and also takes up fewer computation resources compared with existing methods.

## References

• (1)
• Andrews and Andrews (1992) Larry C Andrews and Larry C Andrews. 1992. Special functions of mathematics for engineers. McGraw-Hill New York.
• Bandeira et al. (2013) Afonso S Bandeira, Amit Singer, and Daniel A Spielman. 2013. A Cheeger inequality for the graph connection Laplacian. SIAM J. Matrix Anal. Appl. 34, 4 (2013), 1611–1630.
• Belkin and Niyogi (2001) Mikhail Belkin and Partha Niyogi. 2001. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, Vol. 14. 585–591.
• Berry (1991) Michael Waitsel Berry. 1991. Multiprocessor sparse SVD algorithms and applications. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.
• Berry (1992) Michael W Berry. 1992. Large-scale sparse singular value computations. The International Journal of Supercomputing Applications 6, 1 (1992), 13–49.
• Breitkreutz et al. (2008) Bobby-Joe Breitkreutz, Chris Stark, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, Michael Livstone, Rose Oughtred, Daniel H Lackner, Jürg Bähler, Valerie Wood, et al. 2008. The BioGRID interaction database: 2008 update. Nucleic acids research 36, suppl 1 (2008), D637–D640.
• Cao et al. (2015) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global structural information. In Proceedings of CIKM. ACM, 891–900.
• Church and Hanks (1990) Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational linguistics 16, 1 (1990), 22–29.
• Cui et al. (2017) Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2017. A Survey on Network Embedding. arXiv preprint arXiv:1711.08752 (2017).
• Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212 (2017).
• Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of KDD. ACM, 855–864.
• Hamilton et al. (2017) Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems. 1025–1035.
• Hammond et al. (2011) David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. 2011. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis 30, 2 (2011), 129–150.
• Harris (1954) Zellig S Harris. 1954. Distributional structure. Word 10, 2-3 (1954), 146–162.
• Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
• Lee et al. (2014) James R Lee, Shayan Oveis Gharan, and Luca Trevisan. 2014. Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM (JACM) 61, 6 (2014), 37.
• Leskovec and Sosič (2016) Jure Leskovec and Rok Sosič. 2016. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST) 8, 1 (2016), 1.
• Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS. 2177–2185.
• Lou and Tang (2013) Tiancheng Lou and Jie Tang. 2013. Mining structural hole spanners through information diffusion in social networks. In Proceedings of WWW. ACM, 825–836.
• Mahoney (2009) Matt Mahoney. 2009. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text. html (2009).
• Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111–3119.
• Ou et al. (2016) Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In Proceedings of KDD. ACM, 1105–1114.
• Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of KDD. ACM, 701–710.
• Qiu et al. (2017) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2017. Network Embedding as Matrix Factorization: UnifyingDeepWalk, LINE, PTE, and node2vec. arXiv preprint arXiv:1710.02971 (2017).
• Shuman et al. (2016) David I Shuman, Benjamin Ricaud, and Pierre Vandergheynst. 2016. Vertex-frequency analysis on graphs. Applied and Computational Harmonic Analysis 40, 2 (2016), 260–291.
• Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In Proceedings of WWW. ACM, 1067–1077.
• Tang et al. (2008) Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. Arnetminer: extraction and mining of academic social networks. In Proceedings of KDD. ACM, 990–998.
• Tang et al. (2009) Lei Tang, Suju Rajan, and Vijay K Narayanan. 2009. Large scale multi-label classification via metalabeler. In Proceedings of the 18th international conference on World wide web. ACM, 211–220.
• Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science 290, 5500 (2000), 2319–2323.
• Velickovic et al. (2017) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph Attention Networks. arXiv preprint arXiv:1710.10903 (2017). arXiv:1710.10903 http://arxiv.org/abs/1710.10903
• Yan et al. (2009) Donghui Yan, Ling Huang, and Michael I Jordan. 2009. Fast approximate spectral clustering. In Proceedings of KDD. ACM, 907–916.
• Zafarani and Liu (2009) Reza Zafarani and Huan Liu. 2009. Social computing data repository at ASU.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters