struc2gauss: Structural Role Preserving Network Embedding via Gaussian Embedding
Abstract
Network embedding (NE) is playing a principal role in network mining, due to its ability to map nodes into efficient lowdimensional embedding vectors. However, two major limitations exist in stateoftheart NE methods: role preservation and uncertainty modeling. Almost all previous methods represent a node into a point in space and focus on local structural information, i.e., neighborhood information. However, neighborhood information does not capture global structural information and point vector representation fails in modeling the uncertainty of node representations. In this paper, we propose a new NE framework, struc2gauss, which learns node representations in the space of Gaussian distributions and performs network embedding based on global structural information. struc2gauss first employs a given node similarity metric to measure the global structural information, then generates structural context for nodes and finally learns node representations via Gaussian embedding. Different structural similarity measures of networks and energy functions of Gaussian embedding are investigated. Experiments conducted on realworld networks demonstrate that struc2gauss effectively captures global structural information while stateoftheart network embedding methods fail to, outperforms other methods on the structurebased clustering and classification task and provides more information on uncertainties of node representations.
Keywords:
Gaussian Embedding Structural Similarity Uncertainty Modeling∎
1 Introduction
Network analysis consists of numerous tasks including community detection (fortunato2010community), role discovery (rossi2015role), link prediction (liben2007link), etc. As relations exist between nodes that disobey the i.i.d assumption, it is nontrivial to apply traditional data mining techniques in networks directly. Network embedding (NE) fills the gap by mapping nodes in a network into a lowdimensional space according to their structural information in the network. It has been reported that using embedded node representations can achieve promising performance on many network analysis tasks (perozzi2014deepwalk; grover2016node2vec; cao2015grarep; ribeiro2017struc2vec).
Previous NE techniques mainly relied on eigendecomposition (shaw2009structure; tenenbaum2000global), but the high computational complexity of eigendecomposition makes it difficult to apply in realworld networks. With the fast development of neural network techniques, unsupervised embedding algorithms have been widely used in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors in the learned embedding space, e.g., word2vec (mikolov2013efficient; mikolov2013distributed) and GloVe (pennington2014glove). By drawing an analogy between paths consists of several nodes on networks and word sequences in text, DeepWalk (perozzi2014deepwalk) learns node representations based on random walks using the same mechanism of word2vec. Afterwards, a sequence of studies have been conducted to improve DeepWalk either by extending the definition of neighborhood to higherorder proximity (cao2015grarep; tang2015line; grover2016node2vec; perozzi2016walklets) or incorporating more information for node representations such as attributes (li2017attributed; wang2017attributed) and heterogeneity (chang2015heterogeneous; tang2015pte).
Although a variety of NE methods have been proposed, two major limitations exist in previous NE studies: role preservation and uncertainty modeling. Previous methods focused only on one of these two limitations and while neglecting the other. In particular, for role preservation, most studies applied random walk to learn representations. However, random walk based embedding strategies and their higherorder extensions can only capture local structural information, i.e., firstorder and higherorder proximity within the neighborhood of the target node (lyu2017enhancing). Local structural information is reflected in community structures of networks. But these methods may fail in capturing global structural information, i.e., structural roles (rossi2015role; pei2018dynmf). Global structural information represents roles of nodes in networks, where two nodes have the same role if they are structurally similar from a global perspective. An example of global structural information (roles) and local structural information (communities) is shown in Fig. 1. In summary, nodes that belong to the same community require dense local connections while nodes that have the same role may have no common neighbors at all (tu2018deep). Empirical evidence based on this example for illustrating this limitation will be shown in Section 5.2. For uncertainty modeling, most previous methods represented a node into a point vector in the learned embedding space. However, realworld networks may be noisy and imbalanced. For example, node degree distributions in realworld networks are often skewed where some lowdegree nodes may contain less discriminative information (tu2018deep). Point vector representations learned by these methods are deterministic (dos2016multilabel) and are not capable of modeling the uncertainties of node representations.
There are a few studies trying to address these limitations in the literature. For instance, struc2vec (ribeiro2017struc2vec) builds a hierarchy to measure similarity at different scales, and constructs a multilayer graph to encode the structural similarities. SNS (lyu2017enhancing) discovers graphlets as a preprocessing step to obtain the structural similar nodes. DRNE (tu2018deep) learns network embedding by modeling regular equivalence (wasserman1994social). However, these studies aim only to solve the problem of role preservation to some extent. Thus the limitation of uncertainty modeling remains a challenge. (dos2016multilabel) and (bojchevski2017deep) put effort in improving classification tasks by embedding nodes into Gaussian distributions but both methods only capture the neighborhood information based on random walk techniques. DVNE (zhu2018deep) learns Gaussian embedding for nodes in the Wasserstein space as the latent representations to capture uncertainties of nodes, but they focus only on first and secondorder proximity of networks same to previous methods. Therefore, the problem of role preservation has not been solved in these studies.
In this paper, we propose struc2gauss, a new structural role preserving network embedding framework. struc2gauss learns node representations in the space of Gaussian distributions and performs NE based on global structural information so that it can address both limitations simultaneously. On the one hand, struc2gauss generates node context based on a global structural similarity measure to learn node representations so that global structural information can be taken into consideration. On the other hand, struc2gauss learns node representations via Gaussian embedding and each node is represented as a Gaussian distribution where the mean indicates the position of this node in the embedding space and the covariance represents its uncertainty. Furthermore, we analyze and compare two different energy functions for Gaussian embedding to calculate the closeness of two embedded Gaussian distributions, i.e., expected likelihood and KL divergence. To investigate the influence of structural information, we also compare struc2gauss to two other structural similarity measures for networks, i.e., MatchSim and SimRank.
We summarize the contributions of this paper as follows:

We propose a flexible structure preserving network embedding framework, struc2gauss, which learns node representations in the space of Gaussian distributions. struc2gauss is capable of preserving structural roles and modeling uncertainties.

We investigate the influence of different energy functions in Gaussian embedding and compare to different structural similarity measures in preserving global structural information of networks.

We conduct extensive experiments in node clustering and classification tasks which demonstrate the effectiveness of struc2gauss in capturing the global structural role information of networks and modeling the uncertainty of learned node representations.
The rest of the paper is organized as follows. Section 2 provides an overview of the related work. We present the problem statement in Section 3. Section 4 explains the technical details of struc2gauss. In Section 5 we then discuss our experimental study. The possible extensions of struc2gauss are discussed in Section 6. Finally, in Section 7 we draw conclusions and outline directions for future work.
2 Related Work
2.1 Network Embedding
Network embedding methods map nodes in a network into a lowdimensional space according to their structural information in the network. The learned node representations can boost performance in many network analysis tasks, e.g., community detection and link prediction. Previous methods mainly viewed NE as part of dimensionality reduction techniques (goyal2018graph). They first construct a pairwise similarity graph based on neighborhood and then embed the nodes of the graph into a lower dimensional vector space. Locally Linear Embedding (LLE) (tenenbaum2000global) and Laplacian Eigenmaps (belkin2002laplacian) are two representative methods in this category. SPE (shaw2009structure) learns a lowrank kernel matrix to capture the structures of input graph via a set of linear inequalities as constraints. But the high computational complexity makes these methods difficult to apply in realworld networks.
With increasing attention attracted by neural network research, unsupervised neural network techniques have opened up a new world for embedding. word2vec as well as SkipGram and CBOW (mikolov2013efficient; mikolov2013distributed) learn lowrank representations of words in text based on word context and show promising results of different NLP tasks. Based on word2vec, DeepWalk (perozzi2014deepwalk) first introduces such embedding mechanism to networks by treating nodes as words and random walks as sentences. Afterwards, a sequence of studies have been conducted to improve DeepWalk either by extending the definition of neighborhood to higherorder proximity (cao2015grarep; tang2015line; grover2016node2vec; perozzi2016walklets) or incorporating more information for node representations such as attributes (li2017attributed; wang2017attributed) and heterogeneity (chang2015heterogeneous; tang2015pte). Recently, deeper neural networks have also been introduced in NE problem to capture the nonlinear characteristics of networks, such as SDNE (wang2016structural). However, these approaches represent a node into a point vector in the learned embedding space and are not capable of modeling the uncertainties of node representations. To solve this problem, inspired by (vilnis2014word), Gaussian embedding has been used in NE. (bojchevski2017deep) learns node embeddings by leveraging Gaussian embedding to capture uncertainties. (dos2016multilabel) combines Gaussian embedding and classification loss function for multilabel network classification. DVNE (zhu2018deep) learns a Gaussian embedding for each node in the Wasserstein space as the latent representation so that the uncertainties can be modeled. We refer the reader to (hamilton2017repre; cui2018survey; cai2018comprehensive) for more details.
Method 


uncertainty  

DeepWalk (perozzi2014deepwalk)  
LINE (tang2015line)  
GraRep (cao2015grarep)  
PTE (tang2015pte)  
Walklets (perozzi2016walklets)  
node2vec (grover2016node2vec)  
EP (duran2017learning)  
GraphSage (hamilton2017inductive)  
struc2vec (ribeiro2017struc2vec)  
DRNE (tu2018deep)  
GraphWave (donnat2018learning)  
DVNE (zhu2018deep)  
SNS (lyu2017enhancing)  
our method 
Recent years have witnessed increasing interest in neural networks on graphs. Graph neural networks (scarselli2008graph) can also learn node representations but using more complicated operations such as convolution. (kipf2016semi) proposes a GCN model using an efficient layerwise propagation rule based on a firstorder approximation of spectral convolutions on graphs. (gilmer2017neural) introduces a general message passing neural network framework to interpret different previous neural models for graphs. GraphSAGE (hamilton2017inductive) learns node representations in an inductive manner sampling a fixedsize neighborhood of each node, and then performing a specific aggregator over it. Embedding Propagation (EP) (duran2017learning) learns representations of graphs by passing messages forward and backward in an unsupervised setting. Graph Attention Networks (GATs) (velickovic2017graph) extend graph convolutions by utilizing masked selfattention layers to assign different importances to different nodes with different sized neighborhoods.
However, most NE methods as well as graph neural networks only concern the local structural information represented by paths consists of linked nodes, i.e., the community structures of networks. But they fail to capture global structural information, i.e., structural roles. SNS (lyu2017enhancing), struc2vec (ribeiro2017struc2vec) and DRNE (tu2018deep) are exceptions which take global structural information into consideration. SNS uses graphlet information for structural similarity calculation as a prepropcessing step. struc2vec applies the dynamic time warping to measure similarity between two nodes’ degree sequences and builds a new multilayer graph based on the similarity. Then similar mechanism used in DeepWalk has been used to learn node representations. DRNE explicitly models regular equivalence, which is one way to define the structural role, and leverages the layer normalized LSTM (ba2016layer) to learn the representations for nodes. Another related work focusing on global structural information is REGAL (heimann2018regal). REGAL aims at matching nodes across different graphs so the global structural patterns should be considered. However, its target is network alignment but not representation learning. A brief summary of these NE methods is list in Table 1.
2.2 Structural Similarity
Structure based network analysis tasks can be categorized into two types: structural similarity calculation and network clustering .
Calculating structural similarities between nodes is a hot topic in recent years and different methods have been proposed. SimRank (jeh2002simrank) is one of the most representative notions to calculate structural similarity. It implements a recursive definition of node similarity based on the assumption that two objects are similar if they relate to similar objects. SimRank++ (antonellis2008simrank++) adds an evidence weight which partially compensates for the neighbor matching cardinality problem. PRank (zhao2009p) extends SimRank by jointly encoding both in and outlink relationships into structural similarity computation. MatchSim (lin2009matchsim) uses maximal matching of neighbors to calculate the structural similarity. RoleSim (jin2011axiomatic) is the only similarity measure which can satisfy the automorphic equivalence properties.
Network clusters can be based on either global or local structural information. Graph clustering based on global structural information is the problem of role discovery (rossi2015role). In social science research, roles are represented as concepts of equivalence (wasserman1994social). Graphbased methods and featurebased methods have been proposed for this task. Graphbased methods take nodes and edges as input and directly partition nodes into groups based on their structural patterns. For example, Mixed Membership Stochastic Blockmodel (airoldi2008mixed) infers the role distribution of each node using the Bayesian generative model. Featurebased methods first transfer the original network into feature vectors and then use clustering methods to group nodes. For example, RolX (henderson2012rolx) employs ReFeX (henderson2011s) to extract features of networks and then uses nonnegative matrix factorization to cluster nodes. Local structural information based clustering corresponds to the problem of community detection (fortunato2010community). A community is a group of nodes that interact with each other more frequently than with those outside the group. Thus, it captures only local connections between nodes.
3 Problem Statement
We illustrated local community structure and global role structure in Section 1 using the example in Fig. 1. In this section, definitions of community and role will be presented and then we formally define the problem of structural role preserving network embedding.
Structural role is from social science and used to describe nodes in a network from a global perspective. Formally,
Definition 1
Structural role. In a network, a set of nodes have the same role if they share similar structural properties (such as degree, clustering coefficient, and betweenness) and structural roles can often be associated with various functions in a network.
For example, hub nodes with high degree in a social network are more likely to be opinion leaders, whereas bridge nodes with high betweenness are gatekeepers to connect different groups. Structural roles can reflect the global structural information because two nodes which have the same role could be far from each other and have no direct links or shared neighbors. In contrast to roles, community structures focus on local connections between nodes.
Definition 2
Community structure. In a network, communities can represent the local structures of nodes, i.e., the organization of nodes in communities, with many edges joining nodes of the same community and comparatively few edges joining nodes of different communities (fortunato2010community). A community is a set of nodes where nodes in this set are densely connected internally.
It can be seen that the focus of community structure is the internal and local connections so it aims to capture the local structural information of networks
In this study, we only consider the global structural information, i.e., structural role information, so without mentioning it explicitly, structural information indicates the global one and the keyphrases “structural role information” and “global structural information” are used interchangeably.
Definition 3
Structural Role Preserving Network Embedding. Given a network , where is a set of nodes and is a set of edges between the nodes, the problem of Structural Preserving Network Embedding aims to represent each node into a Gaussian distribution with mean and covariance in a lowdimensional space , i.e., learning a function
where is the mean, is the covariance and . In the space , the global structural role information of nodes introduced in Definition 1 can be preserved, i.e., if two nodes have the same role their means should be similar, and the uncertainty of node representations can be captured, i.e., the values of variances indicate the levels of uncertainties of learned representations.
4 struc2gauss
An overview of our proposed struc2gauss framework is shown in Fig. 2. Given a network, a similarity measure is employed to calculate the similarity matrix, then the training set which consists of positive and negative pairs are sampled based on the similarity matrix. Finally, Gaussian embedding techniques are applied on the training set and generate the embedded Gaussian distributions as the node representations and uncertainties of the representations. Besides, we analyze the computational complexity and the flexibility of our struc2gauss framework.
4.1 Structural Similarity Calculation
It has been theoretically proved that random walk sampling based NE methods are not capable of capturing structural equivalence (lyu2017enhancing) which is one way to model the structural roles in networks (wasserman1994social). Thus, to capture the global structural information, we calculate the pairwise structural similarity as a preprocessing step similar to (lyu2017enhancing; ribeiro2017struc2vec).
In the literature, a variety of structural similarity measures have been proposed to calculate node similarity based on the structures of networks, e.g., SimRank (jeh2002simrank), MatchSim (lin2009matchsim) and RoleSim (jin2011axiomatic; jin2014scalable). However, not all of these measures can capture the global structural role information and we will show the empirical evidence in the experiments in Section 5. Therefore, in this paper we leverage RoleSim for the structural similarity since it satisfies all the requirements of Axiomatic Role Similarity Properties for modeling the equivalence (jin2011axiomatic), i.e., the structural roles. RoleSim also generalizes Jaccard coefficient and corresponds linearly to the maximal weighted matching. RoleSim similarity between two nodes and is defined as:
(1) 
where and are the numbers of neighbors of node and , respectively. is a matching between and , i.e., is a bijection between and . The parameter is a decay factor where . The intuition of RoleSim is that two nodes are structurally similar if their corresponding neighbors are also structurally similar. This intuition is consistent with the notion of automorphic and regular equivalence (wasserman1994social).
In practice, RoleSim values can be computed iteratively and are guaranteed to converge. The procedure of computing RoleSim consists of three steps:

Step 1: Initialize matrix of RoleSim scores ;

Step 2: Compute the iteration scores for the iteration’s values, using:
(2) 
Step 3: Repeat Step 2 until values converge for each pair of nodes.
Note that there are other strategies can be used to capture the global structural role information except structural similarity, and these possible strategies will be discussed in Section 6. The advantage of RoleSim in capturing structural roles to other structural measures will also be discussed empirically in Section 5.6.
4.2 Training Set Sampling
The target of structural role preserving network embedding is to map nodes in the network to a latent space where the learned latent representations of two nodes are (1) more similar if these two nodes are structurally similar, and (2) more dissimilar if these two nodes are not structurally similar. Hence, we need to generate structurally similar and dissimilar node pairs as the training set based on the similarity we learned in Section 4.1. We name the structurally similar pairs of nodes the positive set and the structurally dissimilar pairs the negative set.
In detail, for node , we rank its similarity values towards other nodes and then select top most similar nodes to form its positive set . For the negative set, we randomly select the same number of nodes same to (vilnis2014word) and other random walk sampling based methods (perozzi2014deepwalk; tang2015line; grover2016node2vec), i.e., . Therefore, is a parameter indicating the number of positive/negative nodes per node. We will generate positive and negative sets for each node where is a parameter indicating the number of samples per node. The influence of these parameters will be analyzed empirically in Section 5.7. Note that the selection of the positive set is similar to that in DeepWalk and the difference is that we follow the similarity rank to select the positive nodes instead of random walks.
4.3 Gaussian Embedding
Overview
Recently language modeling techniques such as word2vec have been extensively used to learn word representations in and almost all NE studies are based on these word embedding techniques. However, these NE studies map each entity to a fixed point vector in a lowdimension space so that the uncertainties of learned embeddings are ignored. Gaussian embedding aims to solve this problem by learning densitybased distributed embeddings in the space of Gaussian distributions (vilnis2014word). Gaussian embedding has been utilized in different graph mining tasks including triplet classification on knowledge graphs (he2015learning), multilabel classification on heterogeneous graphs (dos2016multilabel) and link prediction and node classification on attributed graphs (bojchevski2017deep).
Gaussian embedding trains with a rankingbased loss based on the ranks of positive and negative samples. Following (vilnis2014word), we choose the maxmargin ranking objective which can push scores of positive pairs above negatives by a margin defined as:
(3) 
where and are the positive and negative pairs, respectively. is the energy function which is used to measure the similarity of two distributions, and are the learned Gaussian distributions for nodes and , and is the margin separating positive and negative pairs. In this paper, we present two different energy functions to measure the similarity of two distributions for node representation learning, i.e., expected likelihood and KL divergence based energy functions. For the learned Gaussian distribution for node , to reduce the computational complexity, we restrict the covariance matrix to be diagonal and spherical in this work.
Expected Likelihood based Energy
Although both dot product and inner product can be used to measure similarity between two distributions, dot product only considers means and does not incorporate covariances. Thus, we use inner product to measure the similarity. Formally, the integral of inner product between two Gaussian distributions and (learned Gaussian embeddings for node and respectively), a.k.a., expected likelihood, is defined as:
(4) 
For simplicity in computation and comparison, we use the logarithm of Eq. (4) as the final energy function:
(5)  
where is the number of dimensions. The gradient of this energy function with respect to the means and covariances can be calculated in a closed form as:
(6)  
where (he2015learning; vilnis2014word). Note that expected likelihood is a symmetric similarity measure, i.e., .
KL Divergence based Energy
KL divergence is another straightforward way to measure the similarity between two distributions so we utilize the energy function based on the KL divergence to measure the similarity between Gaussian distributions and (learned Gaussian embeddings for node and respectively):
(7)  
where is the number of dimensions. Similarly, we can compute the gradients of this energy function with respect to the means and covariances :
(8)  
where .
Note that KL divergence based energy is asymmetric but we can easily extend to a symmetric similarity measure as follows:
(9) 
4.4 Learning
To avoid the means to grow too large and ensure the covariances to be positive definite as well as reasonably sized, we regularize the means and covariances to learn the embedding (vilnis2014word). Due to the different geometric characteristics, two different hard constraint strategies have been used for means and covariances, respectively. Note that we only consider diagonal and spherical covariances. In particular, we have
(10) 
(11) 
The constraint on means guarantees them to be sufficiently small and constraint on covariances ensures that they are positive definite and of appropriate size. For example, can be used to regularize diagonal covariances.
We use AdaGrad (duchi2011adaptive) to optimize the parameters. The learning procedure is described in Algorithm 1. Initialization phase is from line 1 to 4, context generation is shown in line 7, and Gaussian embeddings are learned from line 8 to 14.
4.5 Computational Complexity
The complexity of different components of struc2gauss are analyzed as follows:

For structural similarity calculation using RoleSim, the computational complexity is , where is the number of nodes, is the number of iterations and is the average of over all nodepair bipartite graph in (jin2011axiomatic) where for each pair of nodes and . The complexity is from the complexity of the fast greedy algorithm offers a approximation of the globally optimal matching.

To generate the training set based on similarity matrix, we need to sample from the most similar nodes for each node, i.e., to select largest numbers from an unsorted array. Using heap, the complexity is .

For Gaussian embedding, the operations include matrix addition, multiplication and inversion. In practice, as stated above, we only consider two types of covariance matrices, i.e., diagonal and spherical, so all these operations have the complexity of .
Overall, the component of similarity calculation is the bottleneck of the framework. One possible and effective way to optimize this part is to set the similarity to be 0 if two nodes have a large difference in degrees. The reason is: (1) we generate the context only based on most similar nodes; and (2) two nodes are less likely to be structural similar if their degrees are very different.
5 Experiments
We evaluate struc2gauss in different tasks in order to understand its effectiveness in capturing structural information, capability in modeling uncertainties of embeddings and stability of the model towards parameters. We also study the influence of different similarity measures empirically. The source code of struc2gauss is available online
5.1 Experimental Setup
Datasets
We conduct experiments on two types of network datasets: networks with and without groundtruth labels where these labels can represent the global structural role information of nodes in the networks. For networks with labels, to compare to stateoftheart, we use airtraffic networks from (ribeiro2017struc2vec) where the networks are undirected, nodes are airports, edges indicate the existence of commercial flights and labels correspond to their levels of activities. For networks without labels, we select five realworld networks in different domains from Network Repository
Type  Dataset  # nodes  # edges  # groups 

with labels  Brazilianair  131  1038  4 
Europeanair  399  5995  4  
USAair  1190  13599  4  
without labels  Arxiv GRQC  5242  28980  8 
Advogato  6551  51332  11  
Hamsterster  2426  16630  10  
Anybeat  12645  67053  15  
Epinion  26588  100120  18 
Baselines
We compare struc2gauss with several stateoftheart NE methods.

DeepWalk (perozzi2014deepwalk): DeepWalk (perozzi2014deepwalk) learns node representations based on random walks using the same mechanism of word2vec by drawing an analogy between paths consists of several nodes on networks and word sequences in text. The structural information is captured by the paths of nodes generated by random walks.

node2vec (grover2016node2vec): It extends DeepWalk to learn latent representations from the node paths generated by biased random walk. Two hyperparameters and are used to control the random walk to be breadthfirst or depthfirst. In this way, node2vec can capture the structural information in networks. Note that when , node2vec degrades to DeepWalk.

LINE (tang2015line): It learns node embeddings via preserving both the local and global network structures. By extending DeepWalk, LINE aims to capture both the firstorder, i.e., the neighbors of nodes, and secondorder proximities, i.e., the shared neighborhood structures of nodes.

Embedding Propagation (EP) (duran2017learning): EP is an unsupervised learning framework for network embedding and learns vector representations of graphs by passing two types of messages between neighboring nodes. EP, as one of graph neural networks, is similar to graph convolutional networks (GCN) (kipf2016semi). The difference is that EP is unsupervised and GCN is designed for semisupervised learning.

struc2vec (ribeiro2017struc2vec): It learns latent representations for the structural identity of nodes. Due to its high computational complexity, we use the combination of all optimizations proposed in the paper for large networks.

graph2gauss (vilnis2014word): It maps each node into a Gaussian distribution where the mean indicates the position of a node in the embedded space and the covariance denotes the uncertainty of the learned representation. (bojchevski2017deep) and (dos2016multilabel) extend the original Gaussian embedding method to network embedding task.

DRNE (tu2018deep): It learns node representations based on the concept of regular equivalence. DRNE utilizes a layer normalized LSTM to represent each node by aggregating the representations of their neighborhoods in a recursive way so that the global structural information can be preserved.

GraphWave (donnat2018learning): It leverages heat wavelet diffusion patterns to learn a multidimensional structural embedding for each node based on the diffusion of a spectral graph wavelet centered at the node. Then the wavelets as distributions are used to capture structural similarity in graphs.
For all baselines, we use the implementation released by the original authors. For our framework struc2gauss, we test four variants: struc2gauss with expected likelihood and diagonal covariance (s2g_el_d), expected likelihood and spherical covariance (s2g_el_s), KL divergence and diagonal covariance (s2g_kl_d), and KL divergence and spherical covariance (s2g_kl_s). Note that we only use means of Gaussian distributions as the node embeddings in role clustering and classification tasks. The covariances are left for uncertainty modeling.
For other settings including parameters and evaluation metrics, different settings will be discussed in each task.
5.2 Case Study: Visualization in 2D space
We use the toy example shown in Fig. 1 to demonstrate the effectiveness of struc2gauss in capturing the global structural information and the failure of other stateoftheart techniques in this task. The toy network consists of ten nodes and they can be clustered from two different perspectives:

from the perspective of the global role structure, they belong to three groups, i.e., (yellow color), (blue color) and (red color) because different groups have different structural functions in this network;

from the perspective of the local community structure, they belong to two groups, i.e., and because there are denser connections/more edges inside each community that outside the community.
Note that from the perspective of role discovery, these three groups of nodes can be explained to play the roles of periphery, star and bridge, respectively.
In this study, we aim to preserve the global structural information in network embedding. Fig. 3 shows the learned node representations by different methods. For shared parameters in all methods, we use the same settings by default: representation dimension: 2, number of walks per node: 20, walk length: 80, skipgram window size: 5. For node2vec, we set and . For graph2gauss and struc2gauss, the number of walks per node is 20 and the number of positive/negative nodes per node is 5. The constraint for means is 2 and constraints for covariances and are 0.5 and 2, respectively. From the visualization results, it can be observed that:

Our proposed struc2gauss outperforms all other methods. Both diagonal and spherical covariances can separate nodes based on global structural information and struc2gauss with spherical covariances performs better than diagonal covariances since it can recognize star and bridge nodes better.

Methods aim to capture the global structural information performs better than random walk sampling based methods. For example, struc2vec can solve this problem to some extent. However, there is overlap between node 6 and 9. It has been stated that node2vec can capture the structural equivalence but the visualization shows that it still captures the local structural information similar to DeepWalk.

DeepWalk, LINE and graph2gauss fail to capture the global structural information because these methods are based on random walk which only captures the local community structures. DeepWalk is capable to capture the local structural information since nodes are separated into two parts corresponding to the two communities shown in Fig. 1.
5.3 Structural Role Clustering
The most common network mining application based on global structural information is the problem of role discovery and role discovery essentially is a clustering task. Thus, we consider this task to illustrate the potential of node representations learned by struc2gauss. We use the latent representations learned by different methods (in struc2gauss, we use means of learned Gaussian distribution) as features and Kmeans as the clustering algorithm to cluster nodes.
Brazilair  Europeair  USAair  

DeepWalk (perozzi2014deepwalk)  0.1303  0.0458  0.0766 
LINE (tang2015line)  0.2215  0.1563  0.1275 
node2vec (grover2016node2vec)  0.2516  0.1722  0.0945 
EP (duran2017learning)  0.2283  0.1405  0.1007 
graph2gauss (vilnis2014word)  0.1204  0.1109  0.0896 
struc2vec (ribeiro2017struc2vec)  0.3758  0.2729  0.2486 
DRNE (tu2018deep)  0.5244  0.2766  0.2918 
GraphWave (donnat2018learning)  0.5040  0.3230  0.2452 
s2g_el_d  0.5615  0.3234  0.3188 
s2g_el_s  0.5396  0.2974  0.2967 
s2g_kl_d  0.5527  0.3145  0.3212 
s2g_kl_s  0.5675  0.3280  0.3217 
Parameters. For these baselines, we use the same settings in the literature: representation dimension: 128, number of walks per node: 20, walk length: 80, skipgram window size: 10. For node2vec, we set and . For graph2gauss and struc2gauss, we set the constraint for means to be 2 and constraints for covariances and to be 0.5 and 2, respectively. The number of walks per node is 10, the number of positive/negative nodes per node is 120 and the representation dimension is also 128.
Evaluation Metrics. To quantitatively evaluate clustering performance in labeled networks, we use Normalized Mutual Information (NMI) as the evaluation metric. NMI is obtained by dividing the mutual information by the arithmetic average of the entropy of obtained cluster and groundtruth cluster. It evaluates the clustering quality based on information theory, and is defined by normalization on the mutual information between the cluster assignments and the preexisting input labeling of the classes:
(12) 
where obtained cluster and groundtruth cluster . The mutual information is defined as and is the entropy.
For unlabeled networks, we use normalized goodnessoffit as the evaluation metric. goodnessoffit can measure how well the representation of roles and the relations among these roles fit a given network (wasserman1994social). In goodnessoffit, it is assumed that the output of a role discovery method is an optimal model, and nodes belonging to the same role are predicted to be perfectly structurally equivalent. In realworld social networks, nodes belonging to the same role are only approximately structurally equivalent. The essence of goodnessoffit indices is to measure how just how approximate are the approximate structural equivalences. If the optimal model holds, then all nodes belonging to the same role are exactly structurally equivalent.
In detail, given a social network with vertices and roles, we have the adjacency matrix and the role set , where indicates node belongs to the th role, as obtained using DyNMF. Note that partitions , in the sense that each belongs to exactly one . Then the density matrix is defined as:
(13) 
We also define block matrix based on the discovered roles. In fact, there are several criteria which can be used to build the block matrix including perfect fit, zeroblock, oneblock and density criterion (wasserman1994social). Since real social network data rarely contain perfectly structural equivalent nodes (faust1992blockmodels), perfect fit, zeroblock and oneblock criteria would not work well in realworld data and we use density criterion to construct the block matrix :
(14) 
where is the threshold to determine the values in blocks. density criterion is based on the density of edges between nodes belong to the same role and defined as
(15) 
Based on the definitions of density matrix and block matrix , the goodnessoffit index is defined as
(16) 
To make the evaluation metric value in the range of , we normalize goodnessoffit by dividing where is number of groups/roles. For more details about goodnessoffit indices, please refer to (wasserman1994social).
Results. The NMI values for node clustering on networks with labels are shown in Table 3 and the normalized goodnessoffit values for networks without labels are shown in Fig. 4. Note that random walk and neighbor based embedding methods, including DeepWalk, LINE, node2vec, EP and graph2gauss, aim at capturing local structural information and so are incapable of preserving structural roles. Hence, for simplicity, we will not compare them to these role preserving methods on networks without clustering labels.
From these results, some conclusions can be drawn:

For both types of networks with and without clustering labels, struc2gauss outperforms all other methods in different evaluation metrics. It indicates the effectiveness of struc2gauss in capturing the global structural information.

Comparing struc2gauss with diagonal and spherical covariances, it can be observed that spherical covariance can achieve better performance in node clustering. This finding is similar to the results of word embedding in (vilnis2014word). A possible explanation could be: spherical covariance requires the diagonal elements to be the same which limits the representation power of covariance matrices but on the contrast enhance the representation power of the learned means. Since we only use means to represent nodes, the method with spherical covariance matrix could learn more relaxed means which leads to better performance.

For baselines, struc2vec, GraphWave and DRNE can capture the structural role information to some extent since their performance is better than these random walk based methods, i.e., DeepWalk and node2vec, and neighborbased method, i.e., EP and graph2gauss, while all of them fail in capturing the global structural information for node clustering.
5.4 Structural Role Classification
Node classification is another widely used task for embedding evaluation. Different from previous studies which focused on community structures, our approach aims to preserve the global role structures. Thus, we evaluate the effectiveness of struc2gauss in role classification task. Same to the node clustering task in Section 5.3, we use the latent representations learned by different methods as features. Each dataset is separated into training set and test set (we will explore the classification performance with different percentages of training set). To focus on the learned representation, we use logistic regression as the classifier.
Structural role classification as a supervised task, the groundtruth labels are required. Thus we only use two airtraffic networks for evaluation. We compare our approach to the same stateoftheart NE algorithms as baselines used in Section 5.3, i.e., DeepWalk, LINE, node2vec, EP, graph2gauss, struc2vec, GraphWave and DRNE. Same to (tu2018deep), we also compare to four centrality measures, i.e., closeness centrality, betweenness centrality, eigenvector centrality and kcore. Since the combination of these four measures perform best (tu2018deep), we only compare the classification performance of the combination as features in this task. The parameters of baselines and struc2gauss, we use the same settings in Section 5.3.
The average accuracies for structural role classification in Europeair and USAair are shown in Fig. 5 and 6. From the results, we can observe that:

struc2gauss outperforms almost all other methods in both networks except DRNE in Europeair network. In Europeair network, struc2gauss with expected likelihood and spherical covariances, i.e., s2g_el_s, performs best. struc2gauss with KL divergence and spherical covariances, i.e., s2g_kl_s, achieves the second best performance especially when the training ratio is larger than 0.7. struc2gauss with diagonal covariances, i.e., s2g_el_d and s2g_kl_d, are on par with GraphWave, DRNE and struc2vec and outperform other methods. In the USAair network, struc2gauss with different settings outperforms all baselines. This indicates the effectiveness of struc2gauss in modeling the structural role information. Although not the same combination of energy function and covariance form performs best in two networks, different variants of struc2gauss are always the best.

Among the baselines, only struc2vec, GraphWave and DRNE can capture the structural role information so that they achieve better classification accuracy than other baselines. DRNE performs the best among these baselines since it captures regular equivalence. GraphWave and struc2vec are the second best baselines because they also aim to capture structural roles.

Random walk and neighbor based NE methods only capture local community structures so they perform worse than struc2gauss, GraphWave, DRNE and our proposed struc2gauss. Node that methods such as DeepWalk, LINE and node2vec, although considering the first, second and/or higherorder proximity, still are not capable of modeling structural role information.
5.5 Uncertainty Modeling
Mapping a node in a network into a distribution rather than a point vector allows us to model the uncertainty of the learned representation which is another advantage of struc2gauss. Different factors can lead to uncertainties of data. It is intuitive that the more noisy edges a node has, the less discriminative information it contains, thus making its embedding more uncertain. Similarly, incompleteness of information in the network can also bring uncertainties to the representation learning. Therefore, in this section, we study two factors: noisy information and incomplete information.
To verify these hypotheses, we conduct the following experiment using Brazilair and Europeair networks. For noisy information, we randomly insert certain number of edges to the network and then learn the latent representations and covariances. The average variance is used to measure the uncertainties. For Brazilair network, we range the number of noisy edges from 50 to 300 and for Europeair it ranges from 500 to 3000. For incomplete information, we randomly delete certain number of edges to the network to make it incomplete and then learn the latent representations and covariances. Similarly, for Brazilair network, we range the number of removed edges from 50 to 300 and for Europeair it ranges from 500 to 3000. The other parameter settings are same to Section 5.3.
The results are shown in Fig. 7 and Fig. 9. It can be observed that (1) with more noisy edges being added to the networks and (2) with more removed edges from the networks, average variance values become larger. struc2gauss with different energy functions and covariance forms have the same trend. This demonstrates that our proposed struc2gauss is able to model the uncertainties of learned node representations. It is interesting that struc2gauss with expected likelihood and diagonal covariance (s2g_el_d) always has the lowest average variance while struc2gauss with KL divergence and diagonal (s2g_kl_d) always has the largest value. This may result from the learning mechanism of different energy functions when measuring the distance between two distributions. To clarify the results, we also list the NMI for the clustering task in Table 5 and 6. Compared to the original Gaussian embedding method, we again show the effectiveness of our method in preserving structural role and modeling uncertainties.
5.6 Influence of Similarity Measures
As we mentioned not all structural similarity measures can capture the global structural role information, to validate the rationale to select RoleSim as the similarity measure for structural role information, we investigate the influence of different similarity measures on learning node representations. In specific, we select two other widely used structural similarity measures, i.e., SimRank (jeh2002simrank) and MatchSim (lin2009matchsim), and we incorporate these measures by replacing RoleSim in our framework. The datasets and evaluation metrics used in this experiment are the same to Section 5.3. For simplicity, we only show the results of struc2gauss using KL divergence with spherical covariance in this experiment because different variants perform similarly in previous experiments.
Brazilair  Europeair  USAair  
SimRank  0.1695  0.0524  0.0887 
MatchSim  0.3534  0.2389  0.0913 
RoleSim  0.5675 0.032  0.3280 0.019  0.3217 0.023 
The NMI values for networks with labels are shown in Table 4 and the goodnessoffit values are shown in Fig. 8. We can come to the following conclusions:

RoleSim outperforms other two similarity measures in both types of networks with and without clustering labels. It indicates RoleSim can better capture the global structural information. Performance of MatchSim varies on different networks and is similar to struc2vec. Thus, it can capture the global structural information to some extent.

SimRank performs worse than other similarity measures as well as struc2vec (Table 3). Considering the basic assumption of SimRank that ”two objects are similar if they relate to similar objects”, it computes the similarity also via relations between nodes so that the mechanism is similar to random walk based methods which have been proved not being capable of capturing the global structural information (lyu2017enhancing).
5.7 Parameter Sensitivity
We consider two types of parameters in struc2gauss: (1) parameters also used in other NE methods including latent dimensions, number of samples per node and number of positive/negative nodes per node; and (2) parameters only used in Gaussian embedding including mean constraint and covariance constraint (note that we fix the minimal covariance to be 0.5 for simplicity). In order to evaluate how changes to these parameters affect performance, we conducted the same node clustering experiment on the labeled USAair network introduced in Section 5.3. In the interest of brevity, we tune one parameter by fixing all other parameters. In specific, the number of latent dimensions varies from 10 to 200, the number of samples varies from 5 to 15 and the number of positive/negative nodes varies from 40 to 190. Mean constraint is from 1 to 10, and covariance constraint ranges from 1 to 10.
The results of parameter sensitivity are shown in Fig. 10 and Fig. 11. It can be observed from Fig. 10 (a) and 10 (b) that the trends are relatively stable, i.e., the performance is insensitive to the changes of representation dimensions and numbers of samples. The performance of clustering is improved with the increase of numbers of positive/negative nodes shown in Fig. 10 (c). Therefore, we can conclude that struc2guass is more stable than other methods. It has been reported that other methods, e.g., DeepWalk (perozzi2014deepwalk), LINE (tang2015line) and node2vec (grover2016node2vec), are sensitive to many parameters. In general, more dimensions, more walks and more context can achieve better performance. However, it is difficult to search for the best combination of parameters in practice and it may also lead to overfitting. For Gaussian embedding specific parameters and , both trends are stable, i.e., the selection of these contraints have little effect on the performance. Although with larger mean constraint , the NMI decreases but the difference is not huge.
# noisy edges  0  50  100  150  200  250  300 

graph2gauss  0.1204  0.1032  0.0903  0.0913  0.0852  0.0833  0.0683 
_el_d  0.5615  0.5165  0.5161  0.5122  0.4810  0.4754  0.4787 
_el_s  0.5396  0.4338  0.4180  0.4152  0.4102  0.3956  0.3924 
_kl_d  0.5527  0.5186  0.5036  0.4940  0.4824  0.4736  0.4103 
_kl_s  0.5527  0.5310  0.5214  0.4951  0.4895  0.4621  0.4651 
# noisy edges  0  500  1000  1500  2000  2500  3000 

graph2gauss  0.1109  0.0776  0.0727  0.0716  0.0634  0.0702  0.0613 
_el_d  0.3234  0.1767  0.1634  0.1694  0.1492  0.1431  0.1413 
_el_s  0.2974  0.1613  0.1505  0.1432  0.1452  0.1285  0.1042 
_kl_d  0.3145  0.2664  0.2014  0.1854  0.1802  0.1634  0.1361 
_kl_s  0.3280  0.3024  0.2930  0.1504  0.1514  0.1414  0.1367 
5.8 Efficiency and Effectiveness Study
As discussed above in Section 4.5, the high computational complexity is one of the major issues in our method. In this experiment, we empirically study this computational issue by comparing the runtime and performance of different global structural preserving baselines and a heuristic method to accelerate the RoleSim measures. The heuristic method, named Fast struc2gauss, is introduced in Section 4.5: we set the similarity to be 0 if two nodes have a large difference in degrees to avoid more computing for dissimilar node pairs. For simplicity, we only test struc2gauss with KL and spherical covariance. Also, we only consider embedding methods that can preserve the structural role information as baselines, i.e., GraphWave, struc2vec and DRNE.
We conduct the experiments on the larger networks without groundtruth labels because on smaller networks the runtime differences are not significant. The runtime comparison is shown in Table 7 and the performance comparison is shown in Table 8. Note that NA in these tables because these methods reported a memory error and did not obtain any results. To make a fair comparison, all these methods are run in the same machine with 128GB memory and GPU have not been used for DRNE. From these results, it can be observed: (1) although the computational issue still exists, our method can achieve good performance compared to stateoftheart structural role preserving network embedding methods such as GraphWAVE and struc2vec. (2) Although DRNE is much fast, its performance is worse than our method and other baselines. Moreover, it is incapable of modeling uncertainties. (3) Fast struc2gauss can effectively accelerate RoleSim computing and achieve comparable performance in role clustering.
GraphWAVE  struc2vec  DRNE  struc2gauss  Fast struc2gauss  
Arxiv  90.68s  10+h  159.43s  2h  886.93s 
Advogato  172.13s  10+h  191.52s  4h  1962.68s 
Hamsterster  24.25s  10+h  85.93s  1h  456.24s 
anybeat  NA  NA  1094.64s  13h  5h 
Epinion  NA  NA  2938.83s  20h+  12h 
GraphWAVE  struc2vec  DRNE  struc2gauss  Fast struc2gauss  
Arxiv  0.5435  0.3674  0.6822  0.1880  0.1983 
Advogato  0.3938  0.2751  0.6102  0.1852  0.2012 
Hamsterster  0.3385  0.1878  0.5939  0.1666  0.1790 
anybeat  NA  NA  0.5639  0.1597  0.1622 
Epinion  NA  NA  0.4978  0.2270  0.2452 
6 Discussion
The proposed struc2gauss is a flexible framework for node representations. As shown in Fig. 2, different similarity measures can be incorporated into this framework and empirical studies will be presented in Section 5.6. Furthermore, other types of methods which model structural information can be utilized in struc2gauss as well.
To illustrate the potential to incorporate different methods, we categorize different methods for capturing structural information into three types:

Similaritybased methods. Similaritybased methods calculate pairwise similarity based on the structural information of a given network. Related work has been reviewed in Section 2.2.

Rankingbased methods. PageRank (page1999pagerank) and HITS (kleinberg1999authoritative) are two most representative rankingbased methods which learns the structural information. PageRank has been used for NE in (ma2017preserving).

Partitionbased methods. This type of methods, e.g., role discovery, aims to partition nodes into disjoint or overlapping groups, e.g., REGE (borgatti1993two) and RolX (henderson2012rolx).
In this paper, we focus on similaritybased methods. For rankingbased methods, we can use a fixed sliding window on the ranking list, then given a node the nodes within the window can be viewed as the context. In fact, this mechanism is similar to DeepWalk. For partitionbased methods, we can consider the nodes in the same group as the context for each other.
7 Conclusions and Future Work
Two major limitations exist in previous NE studies: i.e., structure preservation and uncertainty modeling. Randomwalk based NE methods fail in capturing global structural information and representing a node into a point vector are not capable of modeling the uncertainties of node representations.
We proposed a flexible structure preserving network embedding framework, struc2gauss, to tackle these limitations. On the one hand, struc2gauss learns node representations based on structural similarity measures so that global structural information can be taken into consideration. On the other hand, struc2gauss utilizes Gaussian embedding to represent each node as a Gaussian distribution where the mean indicates the position of this node in the embedding space and the covariance represents its uncertainty.
We experimentally compared three different structural similarity measures for networks and two different energy functions for Gaussian embedding. By conducting experiments from different perspectives, we demonstrated that struc2gauss excels in capturing global structural information, compared to stateoftheart NE techniques such as DeepWalk, node2vec and struc2vec. It outperforms other competitor methods in role discovery task and structural role classification on several realworld networks. It also overcomes the limitation of uncertainty modeling and is capable of capturing different levels of uncertainties. Additionally, struc2gauss is less sensitive to different parameters which makes it more stable in practice without putting more effort in tuning parameters.
In the future, we will explore faster RoleSim measures for more scalable NE methods, for example, fast method to select most similar nodes for a given node. Also, it is a promising research direction to investigate different strategies to model global structural information except structural similarity in NE tasks. Besides, other future investigations in this area include learning node representations in dynamic and temporal networks.
References
Footnotes
 email: {y.pei.1,x.du,g.h.l.fletcher,m.pechenizkiy}@tue.nl
 email: zjp@ndsc.com.cn
 email: {y.pei.1,x.du,g.h.l.fletcher,m.pechenizkiy}@tue.nl
 email: zjp@ndsc.com.cn
 email: {y.pei.1,x.du,g.h.l.fletcher,m.pechenizkiy}@tue.nl
 email: zjp@ndsc.com.cn
 email: {y.pei.1,x.du,g.h.l.fletcher,m.pechenizkiy}@tue.nl
 email: zjp@ndsc.com.cn
 email: {y.pei.1,x.du,g.h.l.fletcher,m.pechenizkiy}@tue.nl
 email: zjp@ndsc.com.cn
 https://bitbucket.org/paulpei/struc2gauss/src/master/
 http://networkrepository.com/index.php