Fast and Accurate Network Embeddings via Very Sparse Random Projection
Abstract.
We present FastRP, a scalable and performant algorithm for learning distributed node representations in a graph. FastRP is over 4,000 times faster than stateoftheart methods such as DeepWalk and node2vec, while achieving comparable or even better performance as evaluated on several realworld networks on various downstream tasks. We observe that most network embedding methods consist of two components: construct a node similarity matrix and then apply dimension reduction techniques to this matrix. We show that the success of these methods should be attributed to the proper construction of this similarity matrix, rather than the dimension reduction method employed.
FastRP is proposed as a scalable algorithm for network embeddings. Two key features of FastRP are: 1) it explicitly constructs a node similarity matrix that captures transitive relationships in a graph and normalizes matrix entries based on node degrees; 2) it utilizes very sparse random projection, which is a scalable optimizationfree method for dimension reduction. An extra benefit from combining these two design choices is that it allows the iterative computation of node embeddings so that the similarity matrix need not be explicitly constructed, which further speeds up FastRP. FastRP is also advantageous for its ease of implementation, parallelization and hyperparameter tuning. The source code is available at https://github.com/GTmac/FastRP.
1. Introduction
Network embedding methods learn lowdimensional distributed representation of nodes in a network. These learned representations can serve as latent features for a variety of inference tasks on graphs, such as node classification (Perozzi et al., 2014), link prediction (Grover and Leskovec, 2016) and network reconstruction (Wang et al., 2016).
Research on network embeddings dates back to early 2000s in the context of dimension reduction, when methods such as LLE (Roweis and Saul, 2000), IsoMap (Tenenbaum et al., 2000) and Laplacian Eigenmaps (Belkin and Niyogi, 2002) were proposed. These methods are general in that they embed an arbitrary feature matrix ( is the number of data points) into an embedding matrix, where . Although these methods produce highquality embeddings, their time complexity is at least which is prohibitive for large .
More recent work in this area shifts their focus to embedding graph data, which represents a special class of sparse feature matrix, where . The sparsity and discreteness of realworld graphs permit the design of more scalable network embedding algorithms. The pioneering work here is DeepWalk (Perozzi et al., 2014), which essentially samples node pairs from step transition matrices with different values of , and then train a Skipgram (Mikolov et al., 2013) model on these pairs to obtain node embeddings.
The most significant contribution of DeepWalk is that it introduces a twocomponent paradigm for representation learning on graphs: first explicitly constructing a node similarity matrix or implicitly sampling node pairs from it, then performing dimension reduction on the matrix to produce node embeddings. Much subsequent work has since followed to propose different strategies for both steps (Tang et al., 2015b; Grover and Leskovec, 2016; Wang et al., 2016; Tsitsulin et al., 2018).
Although most such methods are considered scalable with time complexity being linear to the number of nodes and/or edges, we note that the constant factor is often too large to be ignored. The reason is twofold. First, many of these methods are samplingbased and a huge number of samples is required to learn highquality embeddings. For example, DeepWalk samples about 32,000 context nodes per node under its default setting^{1}^{1}1We consider the recommended hyperparameter settings in the DeepWalk paper, where 80 random walks of length 40 are sampled per node and the window size for Skipgram is 10.. Second, the dimension reduction methods being used also incur substantial computational cost, making this constant factor even larger. Popular methods such as DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015b) and node2vec (Grover and Leskovec, 2016) all adopt Skipgram for learning node embeddings. However, optimizing the Skipgram model is timeconsuming due to the large number of gradient updates needed before the model converges. As such, despite being the most scalable stateoftheart network embedding algorithms, it still takes DeepWalk and LINE several days of CPU time to embed the Youtube graph (Tang and Liu, 2009), a moderately sized graph with 1M nodes and 3M edges.
Can we design a truly scalable network embedding method that produces node embeddings for millionscale graphs in several minutes without compromising the representation quality? To answer this question, we analyze several stateoftheart network embedding methods by examining their design choices for both similarity matrix construction and dimension reduction. Our analysis motivates us to propose FastRP, which presents much more scalable solutions to both steps without compromising embedding quality.
To illustrate the effectiveness of FastRP, we visualize the node representations produced by FastRP, DeepWalk (Perozzi et al., 2014) and an earlier random projectionbased method, RandNE (Zhang et al., 2018a) on the WWW network (Figure 1). The nodes are hostnames such as youtube.com and instagram.com. For the purpose of visualization, we use tSNE (Maaten and Hinton, 2008) to project the node embeddings to twodimensional space. We take the websites from five countries: United Kingdom (.uk), Japan (.jp), Brazil (.br), France (.fr) and Spain (.es) as indicated by the color of the dots. Observe that for FastRP and DeepWalk, the websites from each toplevel domain form clusters that are very well separated. For the RandNE embeddings, there is no clear boundary between the websites from different toplevel domains. FastRP achieves similar quality to DeepWalk while being over 4,000 times faster.
To sum up, our contributions are the following:
Improved Understanding of Existing Network Embedding Algorithms. By viewing representative network embedding algorithms as a procedure with two components, similarity matrix construction and dimension reduction, we gain an improved understanding of why these algorithms work and why do they have scalability issues. This improved understanding motivates us to propose new solutions for both components.
Better Formulation of the Node Similarity Matrix. We construct a node similarity matrix with two unique properties: 1) it considers the implicit, transitive relationships between nodes; 2) it normalizes pairwise similarity of nodes based on node degrees.
More Scalable Dimension Reduction Algorithm. Different from previous work that relies on timeconsuming dimension reduction methods such as Skipgram and SVD, we obtain node embeddings via very sparse random projection of the node similarity matrix. An additional benefit from combining these two design choices is that it allows the iterative computation of node embeddings, which has linear cost in the size of the graph.
DeepWalk Quality Embeddings that is Produced Over 4,000 Times Faster. Extensive experimental results show that FastRP produces highquality node embeddings comparable to stateoftheart methods while being at least three orders of magnitude faster.
2. Preliminaries
In this section, we give the formal definition of network embeddings and introduce the paradigm of network embeddings as a twocomponent process. We then detail the design decisions of several stateoftheart methods for both components and show why they have scalability issues.
2.1. Notation and Task Definition
We consider the problem of embedding a network: given an undirected graph, the goal is to learn a lowdimensional latent representation for each node in the graph^{2}^{2}2We use network and graph interchangeably.. Formally, let be a graph, where is the set of nodes and is the set of edges. Let be the number of nodes, be the number of edges, be the degree of the th node, and be the adjacency matrix of . The goal of network embeddings is to develop a mapping . For a node , we call the dimensional vector its embedding vector (or node embedding).
Network embeddings can be viewed as performing dimension reduction on graphs: the input is an feature matrix associated with the graph, on which we apply dimension reduction techniques to reduce its dimensionality to . This leads to two questions:

What is an appropriate node similarity matrix to perform dimension reduction on?

What dimension reduction techniques should be used?
We now review the existing solutions to both questions.
2.2. Node Similarity Matrix
The most straightforward input matrices to consider is the adjacency matrix or the transition matrix :
where is the degree matrix of :
However, directly applying dimension reduction techniques on or is problematic. Realworld graphs are usually extremely sparse, which means most of the entries in are zero. However, the absence of edge between two nodes and does not imply that there is no association between them. In particular, if two nodes are not adjacent but are connected by a large number of paths, it is still likely that there is a strong association between them.
The intuition above motivates us to exploit higherorder relationships in the graph. A natural highorder node similarity matrix is the step transition matrix:
(1) 
The th entry of denotes the probability of reaching from in exactly steps of random walk. We will show that many existing methods adopt variations of this definition of similarity matrix.
2.3. Dimension Reduction Techniques
Once an similarity matrix is constructed, network embedding methods perform dimension reduction on it to obtain node representations. In this section, we introduce two commonly used dimension reduction techniques: singular value decomposition (SVD) and Skipgram.
SVD. SVD (Halko et al., 2011) is a classical matrix factorization method for dimension reduction. SVD factorizes a feature matrix into the product of three matrices: , where and are orthonormal and is a diagonal matrix consisting of singular values. To perform dimension reduction with SVD, it is common to take the top singular values from and the corresponding columns from and .
Skipgram. Skipgram (Mikolov et al., 2013) is a method for learning word embeddings, which is also shown to be performant in the context of network embeddings. Skipgram works by sampling word pairs (or node pairs) from a word cooccurrence matrix (or node similarity matrix) and modeling the probability that a given wordcontext pair is from or not. Goldberg and Levy (Levy and Goldberg, 2014) showed that Skipgram is implicitly factorizing a shifted pointwise mutual information (PMI) matrix of word cooccurrences. Formally, the matrix Skipgram seeks to factorize has elements:
(2) 
where denotes the occurrence count of in and denotes the number of negative samples in Skipgram.
2.4. Representative Network Embedding Methods
In this section, we discuss how three representative network embedding methods: DeepWalk (Perozzi et al., 2014), LINE (Tang et al., 2015b) and GraRep (Cao et al., 2015) fit into the twocomponent procedure described above. The analysis of node2vec (Grover and Leskovec, 2016) is similar to that of DeepWalk, which we omit here.
DeepWalk (Perozzi et al., 2014). DeepWalk’s core idea is to sample node pairs from a weighted combination of , and then train a Skipgram model on these samples. Making use of Eq. 2, it can be shown that DeepWalk is implicitly factorizing the following matrix (Qiu et al., 2018):
(3) 
where .
LINE (Tang et al., 2015b). LINE can be seen as a variation of DeepWalk that only considers node pairs that are at most two hops away. Using a similar derivation, it can be shown that LINE implicitly factorizes:
(4) 
GraRep (Cao et al., 2015). GraRep can be regarded as the matrix factorization version of DeepWalk. Instead of sampling from , it directly computes these matrices and then factorizes the corresponding shifted PMI matrix for each power of .
2.5. Scalability of Representative Methods
Putting existing methods into this twocomponent framework reveals their intrinsic scalability issues as following:
Node Similarity Matrix Construction. Many previous studies have demonstrated the importance of preserving highorder proximity between nodes (Perozzi et al., 2014; Cao et al., 2015; Qiu et al., 2018; Zhang et al., 2018b, a), which is typically done by raising to th power and optionally normalize it afterward (see Eq. 3 for an example). This causes scalability issues since both computing and applying a transformation to each element in are at least quadratic. For methods such as DeepWalk and node2vec, this problem is slightly mitigated by sampling node pairs from instead. But still, a huge number of samples is required for them to get an accurate enough estimation of .
Dimension Reduction. The dimension reduction techniques adopted by these methods also affect their scalability. Both Skipgram and SVD are not among the fastest dimension reduction algorithms (Vempala, 2005).
In the next section, we present our solutions to both problems that allow for better scalability.
3. Method
In this section, we introduce FastRP. We first describe the usage of very sparse random projection for dimension reduction and its merit in preserving highorder proximity. Then, we present our design of the node similarity matrix. This matrix is carefully designed so that: 1) it preserves transitive relationships in the input graph; 2) its entries are properly normalized; 3) it can be formulated as matrix chain multiplication, so that applying random projection on this matrix only costs linear time. Lastly, we discuss several additional advantages of FastRP.
3.1. Very Sparse Random Projection
Random projection is a dimension reduction method that preserves pairwise distances between data points with strong theoretical guarantees (Vempala, 2005). The idea behind this is very simple: to reduce an (for graph data, we have ) feature matrix to an matrix where , we can simply multiply the feature matrix with an random projection matrix :
(5) 
As long as the entries of are i.i.d with zero mean, is able to preserve the pairwise distances in (Arriaga and Vempala, 1999).
The difference among different random projection algorithms is mostly in the construction of . The most studied one is Gaussian random projection, where entries of are sampled i.i.d. from a Gaussian distribution: . Since is a dense matrix, the time complexity of Gaussian random projection is .
As an improvement to Gaussian random projection, Achlioptas (Achlioptas, 2003) proposed sparse random projection, where entries of are sampled i.i.d. from
(6) 
where is used. This leads to a 3x speedup since of the entries of are zero. Additionally, this configuration of does not require any floatingpoint computation since the multiplication with can be delayed, providing additional speedup.
Li et al. (Li et al., 2006) extend Achlioptas (Achlioptas, 2003) by showing that can be used to further speed up the computation. They recommend setting , which achieves times speedup over Gaussian random projection while ensuring the quality of the embeddings. In this work, we consider this very sparse random projection method for dimension reduction of the node similarity matrix.
As an optimizationfree dimension reduction method, very sparse random projection wins over SVD and Skipgram for its superior computational efficiency. The fact that it only requires matrix multiplication also enables faster computation on accelerators such as GPUs, as well as easy parallelization.
Apart from these advantages, the random projection approach also benefits from the associative property of matrix multiplication. To see why this is important, consider the basic form of highorder similarity matrix , as defined in Eq. 1. To compute its random projection , there is no need to calculate from scratch since the computation can be done iteratively:
(7) 
This reduces the time complexity from to .
3.2. Similarity Matrix Construction
The next step is to construct a proper node similarity matrix leveraging the associative property of matrix multiplication. We make two key observation about the similarity matrices used by the existing method. First, it is important to preserve highorder proximity in the input graph – this is typically done by raising to th power. Second, elementwise normalization is performed (taking logarithm for DeepWalk, LINE and GraRep) on the raw similarity matrix before dimension reduction.
Most previous matrixbased network embedding methods emphasize on the importance of highorder proximity but skip the elementwise normalization step for either better scalability or ease of analysis (Qiu et al., 2018, 2019; Zhang et al., 2018a). Is normalization of the node similarity matrix important? If so, is there any other normalization method that allows for scalable computation? We answer these questions by analyzing the properties of from a spectral graph theory perspective.
To begin with, we consider a transformation of defined as . Since is a real symmetric matrix, it can be decomposed as where is a diagonal matrix of eigenvalues of , and is an orthogonal matrix consisting of the corresponding eigenvectors .
It is easy to verify that is an eigenpair of . Following the FrobeniusPerron Theorem (Frobenius et al., 1912), we have:
and
Now:
where .
For a particular entry we have:
(8) 
This derivation illustrates the importance of normalization. Since holds for (assuming is nonbipartite), we have when . Since many of the realworld graphs are scalefree (Barabási and Bonabeau, 2003), it follows that the entries in also has a heavytailed distribution.
The heavytailed distribution of data causes problems for dimension reduction methods (Li et al., 2006). The pairwise distances between data points are dominated by the columns with exceptionally large values, rending them less meaningful. In practice, term weighting schemes are applied to heavytailed data to reduce its kurtosis and skewness (Salton and Buckley, 1988). Here, we consider a scaled version of the Tukey transformation (Tukey et al., 1957). Concretely, we transform a feature into , where controls the strength of normalization. Now the only problem is that the exact feature values in are not known, and we do not want to calculate these values for better scalability. But again, we can rely on the fact that converges to . The normalization we consider is therefore:
(9) 
3.3. Our Algorithm: FastRP
Let , the normalization scheme in Eq. 9 can be represented in matrix form: where . This allows for matrix chain multiplication when performing random projection:
We further consider a weighted combination of different powers of , so that the embeddings of is computed as follows:
where are the weights. The outline of FastRP is presented in Algorithm 1.
3.4. Time Complexity
The time complexity of FastRP is for constructing the sparse random projection matrix (line 1), for random projection (line 2 to 5) for each power of and for merging embedding matrices (line 6). Overall, the time complexity of FastRP is , which is linear to the number of nodes and edges in .
3.5. Implementation, Parallelization and Hyperparameter Tuning
Implementation. The implementation of FastRP is very simple and straightforward, with less than 100 lines of Python code.
Parallelization. Besides the ease of implementation, our algorithm is also easy to parallelize, since the only operation involved is matrix multiplication. One easy way to speed up matrix multiplication is to perform block partitioning on the input matrices and :
(10) 
Then it is easy to see that:
(11) 
The recursive matrix multiplications and summations can be performed in parallel. The smaller block matrices can be further partitioned recursively for execution on more processors (Blumofe et al., 1996).
Efficient Hyperparameter Tuning. FastRP also allows for highly efficient hyperparameter tuning. The idea is to first precompute the embeddings derived from different orders of proximity matrices. Since the final embedding matrix is a weighted combination of , we only need to perform weighted summation during hyperparameter optimization. Furthermore, according to the JohnsonLindenstrauss lemma (Dasgupta and Gupta, 1999), the embedding dimensionality determines the approximation error of random projection. The implication of this is that we can efficiently tune hyperparameters on smaller values of . This is in contrast to most of the existing algorithms, which require retraining of the entire model for each hyperparameter configuration.
Besides these merits, the biggest advantage of FastRP is its superior computational efficiency. In the next section, we will show that FastRP is orders of magnitude faster than the stateoftheart methods while achieving comparable or even better performance.
4. Experiments
In this section, we conduct experiments to evaluate the performance of FastRP. We first provide an overview of the datasets. Then, we compare FastRP with a number of baseline methods both in terms of running time and performance on downstream tasks. We further discuss the performance of our method with regard to several important hyperparameters and its scalability.
4.1. Datasets
\topruleName  # Vertices  # Edges  # Classes  Task 
\midruleWWW200K  200,000  32,822,166    KNearest Neighbors 
WWW10K  10,000  3,904,610  50  Node Classification 
Blogcatalog  10,312  333,983  39  Node Classification 
Flickr  80,513  5,899,882  195  Node Classification 
\bottomrule 
Table 1 gives an overview of the datasets used in experiments.

WWW200K and WWW10K (Crawl, [n. d.]): these graphs are derived from the Web graph provided by Common Crawl, where the nodes are hostnames and the edges are the hyperlinks between these websites. For simplicity, we treat this graph as an undirected graph. The original graph has 385 million nodes and 2.5 billion edges, which is too large to be loaded into the memory of our machine. Thus, we construct subgraphs of this graph by taking the top 200,000 and 10,000 websites respectively as ranked by Harmonic Centrality (Boldi and Vigna, 2014). We also use the WWW10K graph for node classification, for which the label of a node is its toplevel domain name such as .org, .edu and .es.

Blogcatalog (Tang and Liu, 2009): this is a network between bloggers on the Blogcatalog website. The labels are the categories a blogger publishes in.

Flickr (Tang and Liu, 2009): this is a network between the users on the photo sharing website Flickr. The labels represent the interest groups a user joins.
4.2. Baseline Methods
We compare FastRP against the following baseline methods:

DeepWalk (Perozzi et al., 2014) – DeepWalk is a network embedding method that samples short random walks from the input graph. Then, these random walks are fed into the Skipgram model to produce node embeddings.

node2vec (Grover and Leskovec, 2016) – node2vec extends DeepWalk by performing biased random walks that balance between DFS and BFS. It also adopts SGNS as the dimension reduction method.

LINE (Tang et al., 2015b) – LINE samples node pairs that are either adjacent or two hops away. LINE adopts Skipgram with negative sampling (SGNS) to learn network embeddings from the samples.

RandNE (Zhang et al., 2018a) – RandNE constructs a node similarity matrix that preserves the highorder proximity by raising the adjacency (or transition) matrix to the th power. Then, node embeddings are obtained by applying Gaussian random projection to this matrix.
We realize that there are many other recently proposed network embedding methods. We do not include these methods as baselines since their performance are generally inferior to DeepWalk according to a recent comparative study (Khosla et al., 2019). Moreover, many of them are not scalable (Khosla et al., 2019).
4.3. Parameter Settings
Here we present the parameter settings for the baseline models and our model.
FastRP . For FastRP, we set embedding dimensionality to 512 and maximum power to 4. For the weights , we observe that simply use a weighted combination of and is already enough for achieving competitive results. Thus, we set to respectively and tune .
Overall, we only have two hyperparameters to tune: normalization strength and the weight for . We use optuna^{3}^{3}3https://github.com/pfnet/optuna, a Bayesian hyperparameter optimization framework to tune them. The hyperparameter optimization is performed for 20 rounds on a small validation set of labeled data; the search ranges for and are set to and respectively. We also use a lower embedding dimensionality of to speed up the tuning process.
RandNE. For RandNE, we set embedding dimensionality to 512 and maximum order to 3. We note that for RandNE, incorporating the embeddings from does not improve the quality of embeddings according to our experiments. To ensure a fair comparison, we also conduct hyperparameter search for the weights in RandNE using the same procedure as FastRP. The only difference is that instead of tuning and , we optimize the weights of and .
DeepWalk. For DeepWalk, we need to set the following parameters: the number of random walks , walk length , window size for the Skipgram model and representation size . We adopt the hyperparameter settings recommended in the original paper: .
node2vec. Since node2vec is built upon DeepWalk, we use the same parameter settings for node2vec as DeepWalk: . We notice that these parameter settings lead to better results than the default settings as described in the paper, possibly because the total number of samples is larger. For the inout parameter and return parameter , we conduct grid search over as suggested in the paper.
LINE. We use LINE with both the first order and second order proximity with the recommended hyperparameters. Concretely, we set the dimensionality of embeddings to 200, the number of node pair samples to 10 billion and the number of negative samples to 5.
All the experiments are conducted on a single machine with 128 GB memory and 40 CPU cores at 2.2 GHz. We note that FastRP and all the baseline methods support multithreading. However, for a fair running time comparison, we run all methods with a single thread and measure the CPU time (process time) consumed by each method.
4.4. Runtime Comparison
We first showcase the superior efficiency of our method by reporting the CPU time of FastRP and the baseline methods on all datasets in Table 2. FastRP achieves at least 4,000x speedup over the stateoftheart method DeepWalk. For example, it takes FastRP less than 3 minutes to embed the WWW200K graph, whereas DeepWalk takes almost a week to finish. Node2vec is even slower; although LINE is several times faster than DeepWalk and node2vec, it is still a few hundreds of times slower than FastRP. The only method with comparable running time is RandNE which uses Gaussian random projection for dimension reduction, but it is also slightly slower than FastRP. Moreover, in the experiments below, we will show that the quality of embeddings produced by FastRP is significantly better than that of RandNE.
\topruleDataset  Algorithm  Speedup over  

FastRP  RandNE  LINE  DeepWalk  node2vec  DeepWalk  
\midruleWWW200K  136.0 seconds  169.8 seconds  4.6 hours  6.9 days  63.8 days  4383x 
WWW10K  7.8 seconds  13.6 seconds  3.2 hours  9.2 hours  59.8 hours  4246x 
Blogcatalog  6.0 seconds  10.5 seconds  3.0 hours  8.7 hours  41.2 hours  5220x 
Flickr  33.1 seconds  45.1 seconds  4.2 hours  3.1 days  28.5 days  8091x 
\bottomrule 
\toprule[0.1em] Methods  FastRP  DeepWalk  RandNE  FastRP  DeepWalk  RandNE 

\midruleWebsites  nytimes.com  delta.com  
\midruleNeighbors  huffingtonpost.com  washingtonpost.com  huffingtonpost.com  aa.com  aa.com  aa.com 
washingtonpost.com  huffingtonpost.com  washingtonpost.com  united.com  united.com  southwest.com  
cnn.com  cnn.com  forbes.com  usairways.com  usairways.com  united.com  
npr.org  cbsnews.com  cnn.com  alaskaair.com  southwest.com  expedia.com  
latimes.com  time.com  npr.org  jetblue.com  jetblue.com  priceline.com  
\midrule[0.1em] Methods  FastRP  DeepWalk  RandNE  FastRP  DeepWalk  RandNE 
\midruleWebsites  vldb.org  arsenal.com  
\midruleNeighbors  sigmod.org  sigmod.org  comp.nus.edu.sg  chelseafc.com  chelseafc.com  liverpoolfc.com 
comp.nus.edu.sg  morganclaypool.com  cs.sfu.ca  mcfc.co.uk  tottenhamhotspur.com  manutd.com  
sigops.org  kdd.org  cs.rpi.edu  nufc.co.uk  manutd.com  chelseafc.com  
cidrdb.org  doi.acm.org  nlp.stanford.edu  avfc.co.uk  mcfc.co.uk  skysports.com  
cse.iitb.ac.in  informatic.unitrier.de  theory.stanford.edu  tottenhamhotspur.com  thefa.com  tottenhamhotspur.com  
\bottomrule[0.1em] 
4.5. Qualitative Case Study: WWW200K Network
We first conduct a case study on the WWW200K network to compare the embeddings produced by different network embedding algorithms qualitatively. For this part, we take RandNE and DeepWalk as the baselines since the results produced by LINE and node2vec are very similar to those of DeepWalk on this network.
Examining the nearest neighbors (KNN) of a word is a common way to measure the quality of word embeddings (Mikolov et al., 2013). In the same spirit, we examine the nearest neighbors of several representative websites in the node embedding space. Cosine similarity is used as the similarity metric. Table 3 lists the top 5 nearest neighbors of four representative websites: nytimes.com, delta.com, vldb.org, and arsenal.com based on the node embeddings produced by FastRP, RandNE and DeepWalk.
nytimes.com is the website of The New York Times, which is one of the most influential news websites in the US. We find that all three methods produce highquality nearest neighbors for nytimes.com: huffingtonpost.com, washingtonpost.com and cnn.com are also wellknown American news sites. It is also interesting to see that FastRP lists latimes.com (The Los Angeles Times) among the top 5 nearest neighbors of nytimes.com.
delta.com is the homepage of Delta Air Lines, which is a major American airline. Again, we find that the most similar websites discovered by FastRP and DeepWalk are the official websites of other major American airlines: American Airlines (aa.com and usairways.com), United Airlines (united.com), Alaska Airlines (alaskaair.com), etc. The nearest neighbors list provided by RandNE is worse, since it includes general purpose travel websites such as expedia.com and priceline.com.
vldb.org is the official website for VLDB Endowment, which steers the VLDB conference, a leading database research conference. Both FastRP and DeepWalk list sigmod.org as the most similar website to vldb.org; this is a positive sign since SIGMOD is another top database research conference. On the other hand, all three methods include several universities’ CS department websites in the nearest neighbors list, such as comp.nus.edu.sg and cs.sfu.ca. In particular, all top five websites provided by RandNE are CS department websites. In our opinion, this is reasonable but less satisfactory than having other CS research conferences’ websites in the list, such as sigops.org (The ACM Special Interest Group in Operating Systems ) and cidrdb.org (The Conference on Innovative Data Systems Research).
arsenal.com represents a football club that plays in the Premier League. It can be seen that all three methods list the other football clubs in the Premier League as the nearest neighbors of arsenal.com, such as Chelsea (chelseafc.com) and Manchester City (mcfc.co.uk). The only exception is RandNE, which also lists skysports.com (the dominant subscription TV sports brand in the United Kingdom) as one of the nearest neighbors. skysports.com has higher popularity but is less relevant to arsenal.com. In contrast, FastRP avoids this problem by properly downweights the influence of popular nodes. Overall, we find that the quality of nearest neighbors produced by FastRP is comparable to DeepWalk and significantly better than that of RandNE.
The experiments above serve as a qualitative evaluation of FastRP. In the following sections, we will conduct quantitative evaluations of FastRP on different downstream tasks across multiple datasets.
\topruleAlgorithm  Dataset  

WWW10K  BlogCatalog  Flickr  
\midruleLINE  6.66  19.63  10.69 
node2vec  27.42  21.44  11.89 
DeepWalk  25.54  21.30  14.00 
RandNE  15.68  20.88  13.64 
FastRP  26.92  23.43  15.02 
\bottomrule 
4.6. Multilabel Node Classification
For the task of node classification, we evaluate our method using the same experimental setup in DeepWalk (Perozzi et al., 2014). Firstly, a portion of nodes along with their labels are randomly sampled from the graph as training data, and the goal is to predict the labels for the rest of the nodes. Then, a onevsrest logistic regression model with L2 regularization is trained on the node embeddings for prediction. We use the logistic regression model implemented by LibLinear (Fan et al., 2008). To ensure the reliability of our experiment, the above process is repeated for 10 times, and the average Macro score is reported. The other evaluation metrics such as Micro score and accuracy follow the same trend as Macro score, thus are not shown.
Table 4 reports the Macro scores achieved on WWW10K, Blogcatalog and Flickr with 1%, 10% and 1% labeled nodes respectively. FastRP achieves the stateoftheart result on two out of three datasets and matches on the third. On Blogcatalog and Flickr, FastRP achieves gains of 9.3% and 7.3% over the best performing baseline method respectively. On WWW10K, the absolute difference in score between FastRP and node2vec is only 0.5%, but FastRP is over 10,000 times faster.
To have a detailed comparison between FastRP and the baseline methods, we vary the portion of labeled nodes for classification and present the macro scores in Figure 2. We can observe that FastRP consistently outperforms or matches the other neural baseline methods, while being at least three orders of magnitude faster. It is also clear that FastRP always outperforms RandNE by a large margin, proving the effectiveness of our node similarity matrix design.
4.7. Parameter Sensitivity
To examine how do the hyperparameters affect the quality of learned representations, we conduct a parameter sensitivity study on BlogCatalog. The parameters we investigate are the normalization strength , ’s weight , and the embedding dimensionality . We report the Macro score achieved with 10% labeled data in Figure 3.
Normalization strength. Figure 2(a) shows the effectiveness of our normalization scheme. At , FastRP achieves the highest score of over 20.6%. By setting to , no normalization is applied to the node similarity matrix and the score drops to 19.5%. We set to 128 for this experiment.
Weight . Figure 2(b) shows that the weight also plays an important role. The best score is achieved when is set to or on this dataset.
Embedding Dimensionality. Figure 2(c) shows that increasing the embedding dimensionality in general yields better node embeddings. On the other hand, we notice that FastRP already achieves better performance than all baseline method with embedding dimensionality of 256.
4.8. Scalability
In Section 3.4, we show that the time complexity of FastRP is linear to the number of nodes and the number of edges . Here, we empirically verifies this by learning embeddings on random graphs generated by the ErdosRenyi model. In Figure 3(a), we fix to and vary from to . In Figure 3(b), we fix to and vary from to . For both figures, we report the CPU time for FastRP to embed the graph. It can be seen that the empirical running time of FastRP also scales linearly with and .
5. Related Work
Network embeddings. Most of the early methods view network embeddings as a dimension reduction problem, where the goal is to preserve the local or global distances between data points in a lowdimensional manifold (Roweis and Saul, 2000; Tenenbaum et al., 2000; Belkin and Niyogi, 2002). With time complexity at least quadratic in the number of data points (or nodes), these methods do not scale to graphs with hundreds of thousands of nodes.
Inspired by the success of scalable neural methods for learning word embeddings, in particular the Skipgram model (Mikolov et al., 2013), neural methods are proposed for network embeddings (Perozzi et al., 2014; Tang et al., 2015b; Grover and Leskovec, 2016). These methods typically sample node pairs that are close to each other and then train a Skipgram model on the pairs to obtain node embeddings. The difference mostly lies in the strategy for node pairs sampling. DeepWalk (Perozzi et al., 2014) samples node pairs that are at most hops away via random walking on the graph. Node2vec (Grover and Leskovec, 2016) introduces a biased random walk strategy using a mixture of DFS and BFS. LINE (Tang et al., 2015b) considers the node pairs that are 1hop or 2hops away from each other. These methods not only produce highquality node embeddings but also scale to networks with millions of nodes.
In Levy and Goldberg’s seminal work (Levy and Goldberg, 2014) on interpreting Skipgram with negative sampling (SGNS), they prove that SGNS implicitly factorizes a shifted pointwise mutual information (PMI) matrix of word cooccurrences. Using a similar methodology, it is shown that methods like DeepWalk (Perozzi et al., 2014), LINE (Perozzi et al., 2014), PTE (Tang et al., 2015a) and node2vec (Grover and Leskovec, 2016) all implicitly approximate and factorize a node similarity matrix, which is usually some transformation of the step transition matrices (Yang et al., 2015; Qiu et al., 2018) . Following these analyses, matrix factorizationbased methods are also proposed for network embeddings (Yang et al., 2015; Qiu et al., 2018; Cao et al., 2015). A representative method is GraRep (Cao et al., 2015), which can be seen as the matrix factorization version of DeepWalk: it uses SVD to factorize the shifted PMI matrix of step transition matrices. However, GraRep is not scalable due to the high time complexity of both raising the transition matrix to higher powers and taking the elementwise logarithm of , which is a dense matrix.
A few recent work thus propose to speed up the construction of such a node similarity matrix (Zhang et al., 2018b; Qiu et al., 2018, 2019), which are inspired by the spectral graph theory. The basic idea is that if the top eigendecomposition of is given by , then can be approximated with (Qiu et al., 2018; Zhang et al., 2018b). The major drawback of these methods is that they need to get rid of the elementwise normalization (such as taking logarithm) on to achieve better scalability; this harms the quality of embeddings (Chen et al., 2017). A recent method (Qiu et al., 2019) proposes to sparsify for better scalability. However, the similarity matrix is still dense even after the sparsification: for a graph with nodes and edges, the number of entries in the sparsified matrix can be as high as (Qiu et al., 2019).
Perhaps the most relevant work is RandNE (Zhang et al., 2018a), which considers a Gaussian random projection approach for network embeddings. There are three key differences between our work and RandNE:

We are the first to identify the two key factors for constructing the node similarity matrix: highorder proximity preservation and element normalization. Specifically, the importance of normalization is overlooked in many previous studies, including RandNE (Zhang et al., 2018a; Qiu et al., 2018; Zhang et al., 2018b).

Based on theoretical analysis, we derive a normalization algorithm that properly downweights the influence of highdegree nodes in the node similarity matrix. An additional advantage of our normalization approach is that it can be formalized as a simple matrix multiplication operation, which enables fast iterative computation when combined with random projection.

We explore the usage of very sparse random projection for network embeddings, which is more efficient than traditional Gaussian random projection. As shown in the experiments, FastRP achieves substantially better performance on challenging downstream tasks while being faster.
Graphbased Recommendation Systems. Our work is also related to graphbased recommendation systems, which consider a special kind of graph: the bipartite graph between users and items. Typically, the goal is to generate the topK items that a user will be most interested in.
Several early work in this field emphasis on the importance of highorder, transitive relationships between users and items (Pirotte et al., 2007; Fouss et al., 2005; Cooper et al., 2014; Christoffel et al., 2015) for topK recommendation. Fouss et al. (Pirotte et al., 2007; Fouss et al., 2005) present , which directly uses the entries in the third power of the transition matrix to rank items. Although achieving competitive recommendation performance, it is observed that the ranking of items in is strongly influenced by the popularity of items (Cooper et al., 2014; Christoffel et al., 2015): popular items tend to dominate the recommendation list for most users. To this end, (Cooper et al., 2014) and (Christoffel et al., 2015) are proposed as reweighted versions of . Their idea of downweighting popularity items is similar to the normalization strategy in this paper. However, these methods are proposed as heuristics specifically for bipartite graphs and the task of topK recommendation, which do not generalize to other scenarios. Moreover, the power of the transition matrix is either computed exactly (Cooper et al., 2014) or approximated by sampling a significant number of random walks (Christoffel et al., 2015), both of which are not scalable.
6. Conclusion
We present FastRP, a scalable algorithm for obtaining distributed representations of nodes in a graph. FastRP first constructs a node similarity matrix that captures highorder proximity between nodes and then normalizes the matrix entries based on the convergence properties of . Very sparse random projection is applied to this similarity matrix to obtain node embeddings. Experimental results show that FastRP achieves three orders of magnitudes speedup over stateoftheart method DeepWalk while producing embeddings of comparable or even better quality.
7. Acknowledgements
This work was partially supported by NSF grants IIS1546113 and IIS1927227.
References
 (1)
 Achlioptas (2003) Dimitris Achlioptas. 2003. Databasefriendly random projections: JohnsonLindenstrauss with binary coins. Journal of computer and System Sciences 66, 4 (2003), 671–687.
 Arriaga and Vempala (1999) Rosa I Arriaga and Santosh Vempala. 1999. An algorithmic theory of learning: Robust concepts and random projection. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039). IEEE, 616–623.
 Barabási and Bonabeau (2003) AlbertLászló Barabási and Eric Bonabeau. 2003. Scalefree networks. Scientific american 288, 5 (2003), 60–69.
 Belkin and Niyogi (2002) Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Advances in neural information processing systems. 585–591.
 Blumofe et al. (1996) Robert D Blumofe, Christopher F Joerg, Bradley C Kuszmaul, Charles E Leiserson, Keith H Randall, and Yuli Zhou. 1996. Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing 37, 1 (1996), 55–69.
 Boldi and Vigna (2014) Paolo Boldi and Sebastiano Vigna. 2014. Axioms for centrality. Internet Mathematics 10, 34 (2014), 222–262.
 Cao et al. (2015) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2015. Grarep: Learning graph representations with global structural information. In CIKM. ACM, 891–900.
 Chen et al. (2017) Siheng Chen, Sufeng Niu, Leman Akoglu, Jelena Kovačević, and Christos Faloutsos. 2017. Fast, warped graph embedding: Unifying framework and oneclick algorithm. arXiv preprint arXiv:1702.05764 (2017).
 Christoffel et al. (2015) Fabian Christoffel, Bibek Paudel, Chris Newell, and Abraham Bernstein. 2015. Blockbusters and wallflowers: Accurate, diverse, and scalable recommendations with random walks. In Proceedings of the 9th ACM Conference on Recommender Systems. ACM, 163–170.
 Cooper et al. (2014) Colin Cooper, Sang Hyuk Lee, Tomasz Radzik, and Yiannis Siantos. 2014. Random walks in recommender systems: exact computation and simulations. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 811–816.
 Crawl ([n. d.]) Common Crawl. [n. d.]. Common Crawl’s Web graph data. http://commoncrawl.org/2017/05/hostgraph2017febmaraprcrawls/. Accessed: 20181201.
 Dasgupta and Gupta (1999) Sanjoy Dasgupta and Anupam Gupta. 1999. An elementary proof of the JohnsonLindenstrauss lemma. International Computer Science Institute, Technical Report 22, 1 (1999), 1–5.
 Fan et al. (2008) RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9, Aug (2008), 1871–1874.
 Fouss et al. (2005) Francois Fouss, Alain Pirotte, and Marco Saerens. 2005. A novel way of computing similarities between nodes of a graph, with application to collaborative recommendation. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 550–556.
 Frobenius et al. (1912) Georg Frobenius, Ferdinand Georg Frobenius, Ferdinand Georg Frobenius, Ferdinand Georg Frobenius, and Germany Mathematician. 1912. Über Matrizen aus nicht negativen Elementen. (1912).
 Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855–864.
 Halko et al. (2011) Nathan Halko, PerGunnar Martinsson, and Joel A Tropp. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53, 2 (2011), 217–288.
 Khosla et al. (2019) Megha Khosla, Avishek Anand, and Vinay Setty. 2019. A Comprehensive Comparison of Unsupervised Network Representation Learning Methods. arXiv preprint arXiv:1903.07902 (2019).
 Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems. 2177–2185.
 Li et al. (2006) Ping Li, Trevor J Hastie, and Kenneth W Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 287–296.
 Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using tSNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
 Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
 Perozzi et al. (2014) Bryan Perozzi, Rami AlRfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 701–710.
 Pirotte et al. (2007) Alain Pirotte, JeanMichel Renders, Marco Saerens, et al. 2007. Randomwalk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on Knowledge & Data Engineering 3 (2007), 355–369.
 Qiu et al. (2019) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, and Kuansan Wang. 2019. NetSMF: LargeScale Network Embedding as Sparse Matrix Factorization. (2019).
 Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 459–467.
 Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. science 290, 5500 (2000), 2323–2326.
 Salton and Buckley (1988) Gerard Salton and Christopher Buckley. 1988. Termweighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513–523.
 Tang et al. (2015a) Jian Tang, Meng Qu, and Qiaozhu Mei. 2015a. Pte: Predictive text embedding through largescale heterogeneous text networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1165–1174.
 Tang et al. (2015b) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015b. Line: Largescale information network embedding. In Proceedings of the 24th international conference on world wide web. 1067–1077.
 Tang and Liu (2009) Lei Tang and Huan Liu. 2009. Scalable learning of collective behavior based on sparse social dimensions. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 1107–1116.
 Tenenbaum et al. (2000) Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science 290, 5500 (2000), 2319–2323.
 Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. Verse: Versatile graph embeddings from similarity measures. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. 539–548.
 Tukey et al. (1957) John W Tukey et al. 1957. On the comparative anatomy of transformations. The Annals of Mathematical Statistics 28, 3 (1957), 602–632.
 Vempala (2005) Santosh S Vempala. 2005. The random projection method. Vol. 65. American Mathematical Soc.
 Wang et al. (2016) Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural deep network embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1225–1234.
 Yang et al. (2015) Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Chang. 2015. Network representation learning with rich text information. In TwentyFourth International Joint Conference on Artificial Intelligence.
 Zhang et al. (2018a) Ziwei Zhang, Peng Cui, Haoyang Li, Xiao Wang, and Wenwu Zhu. 2018a. Billionscale Network Embedding with Iterative Random Projection. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 787–796.
 Zhang et al. (2018b) Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018b. Arbitraryorder proximity preserved network embedding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, 2778–2786.