Abstract
Stochastic blockmodels (SBM) and their variants, , mixedmembership and overlapping stochastic blockmodels, are latent variable based generative models for graphs. They have proven to be successful for various tasks, such as discovering the community structure and link prediction on graphstructured data. Recently, graph neural networks, , graph convolutional networks, have also emerged as a promising approach to learn powerful representations (embeddings) for the nodes in the graph, by exploiting graph properties such as locality and invariance. In this work, we unify these two directions by developing a sparse variational autoencoder for graphs, that retains the interpretability of SBMs, while also enjoying the excellent predictive performance of graph neural nets. Moreover, our framework is accompanied by a fast recognition model that enables fast inference of the node embeddings (which are of independent interest for inference in SBM and its variants). Although we develop this framework for a particular type of SBM, namely the overlapping stochastic blockmodel, the proposed framework can be adapted readily for other types of SBMs. Experimental results on several benchmarks demonstrate encouraging results on link prediction while learning an interpretable latent structure that can be used for community discovery.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Stochastic Blockmodels meet Graph Neural Networks
Nikhil Mehta ^{1 } Lawrence Carin ^{1 } Piyush Rai ^{2 }
\@xsect
Learning the latent structure in graphstructured data (Fortunato, 2010; Goldenberg et al., 2010; Schmidt & Morup, 2013) is an important problem in a wide range of domains, such as social and biological network analysis and recommender systems. These latent structures help discover the underlying communities in the network, as well as in predicting potential links between nodes. Latent space models (Hoff et al., 2002) and their structured extensions, such as the stochastic blockmodel (Nowicki & Snijders, 2001) and variants like the infinite relational model (IRM) (Kemp et al., 2006), mixedmembership stochastic blockmodel (MMSB) (Airoldi et al., 2008), and the overlapping stochastic blockmodel (OSBM) (Miller et al., 2009a; Latouche et al., 2011a) accomplish this by learning lowdimensional, interpretable node embeddings defined via structured latent variables. These embeddings can be used to identify the community membership(s) of each node in the graph, as well as for tasks such as link prediction.
The overlapping stochastic blockmodel (OSBM), also known as the latent feature relational model (LFRM), is a particularly appealing model for relational data (Miller et al., 2009a; Latouche et al., 2011a; Zhu et al., 2016). The OSBM/LFRM models each node in the graph as belonging to one or more communities using a binary membership vector, and defines the link probability between any pair of nodes as a bilinear function of their community membership vectors. Despite its appealing properties, the OSBM/LFRM has a number of limitations. In particular, although usually considered to be more expressive (Miller et al., 2009a) than models such as IRM and MMSB, a single layer of binary node embeddings and the bilinear model for the link generation can still limit the expressiveness of OSBM/LFRM. Moreover, it has a challenging inference procedure, which primarily relies on MCMC (Miller et al., 2009a; Latouche et al., 2011a) or meanfield variational inference (Zhu et al., 2016). Although recent models have tried to improve the expressiveness of OSBM/LFRM, , by assuming a deep hierarchy of binaryvectorbased node embeddings (Hu et al., 2017), inference in such models remains intractable, requiring expensive MCMCbased inference. It is therefore desirable to have a model that retains the basic spirit to OSBM/LFRM (, easy interpretability and strong link prediction performance), but with greater expressiveness, and a simpler and scalable inference procedure.
Motivated by these desiderata, we develop a deep generative framework for graphstructured data, that inherits the easy interpretability of overlapping stochastic blockmodels, but is much more expressive and enjoys a fast inference procedure. Our framework is based on a novel, sparse variant of the variational autoencoder (VAE) (Kingma & Welling, 2013), designed to model graphstructured data. Our VAEbased setup comprises a nonlinear generator/decoder for the graph and a nonlinear encoder based on the graph convolutional network (GCN) (Kipf & Welling, 2016a) (although other graph neural networks can also be used). Our framework posits each node of the graph to have an embedding in the form of a sparse latent representation (modeled by a BetaBernoulli process (Griffiths & Ghahramani, 2011), which also enables learning the size of the embeddings). The generator/decoder part of the VAE models the probability of a link between two nodes via a nonlinear function (defined by a deep neural network) of their associated embeddings. The encoder part of the VAE consists of a fast recognition model that is designed leveraging reparameterization method for Beta and Bernoulli distributions (Maddison et al., 2017; Nalisnick & Smyth, 2017). The recognition model, based on stochastic gradient variational Bayes (SGVB) inference, enables fast inference of the node embeddings. In contrast, the traditional stochastic blockmodels rely on iterative MCMC or variational inference procedures for inferring the node embeddings. Consequently, the SGVB inference algorithm we develop is also of independent interest, since the recognition model enables fast inference of the node embeddings in singlelayer overlapping stochastic blockmodels.
We first introduce notation and then briefly describe the overlapping stochastic blockmodel (OSBM) (Latouche et al., 2011a; Miller et al., 2009a; Zhu et al., 2016). As described in the next section, our deep generative model enriches OSBM by endowing it with a deep architecture based on a sparse variational autoencoder, and a fast inference algorithm based on a recognition model. We assume that the graph is given as an adjacency matrix , where denotes the number of nodes. We assume if there exist a link between node and node , and otherwise . In addition to , for each node we may also be provided node features. These are given in the form of an matrix , with being the node features for node , and being the number of observed features.
The OSBM (Latouche et al., 2011a; Miller et al., 2009a; Zhu et al., 2016) is a stochastic blockmodel for networks; it assumes each node has an associated binary vector (node embedding), also termed a latent feature vector . Within the node embedding, denotes that node belongs to cluster/community , and otherwise. The OSBM allows each node to simultaneously belong to multiple communities, and defines the link probability between two nodes via a bilinear function of their latent feature vectors
(1) 
where entry in affects the probability of a link between node and node belonging to cluster and cluster , respectively.
The nonparametric latent feature relational model (LFRM) is a specific type of OSBM, that leverages the Indian Buffet Process (IBP) prior (Miller et al., 2009a) on the binary matrix of the nodecommunity membership vectors. Use of the IBP enables learning the number of communities. Inference in LFRM/OSBM is typically performed via MCMC or variational inference (Miller et al., 2009a; Latouche et al., 2011a; Zhu et al., 2016), which tends to be slow and often cannot scale easily to more than a few hundred nodes.
We now present our sparse VAE based deep generative framework for overlapping stochastic blockmodel. The proposed architecture, depicted in Fig. 1 (left), associates each link with two latent embeddings and (for the nodes associated with this link). Each link probability is modeled as a nonlinear function of the embeddings of its associated nodes. Unlike the standard VAE that assumes dense, Gaussiandistributed embeddings, since we wish to use the node embeddings to also infer the community membership(s) of each node (as it is one of the goals of stochastic blockmodels), we impose sparsity on the node embeddings. This is done by modeling them as a sparse vector of the form , where is a binary vector modeled using a stickbreaking process prior (Teh et al., 2007) and is a realvalued vector with a Gaussian prior. Modeling using the stickbreaking prior enables learning the node embedding size from data. Note that, unlike the OSBM/LFRM, which assumes the node embedding to be a strictly binary vector, our framework models it as a sparse realvalued vector, providing a more flexible and informative representation for the nodes. In particular, this enables inference of not just the node’s membership into communities, but also the strength of the membership in each of the communities the node belongs to. Specifically, denotes whether node belongs to cluster or not, and denotes the strength.
Given the node embeddings , the VAE decoder generates each link in the graph as , where probability distribution defines a decoder or generator model for the graph. This decoder can consist of one or more layers of deterministic nonlinear transformation of the node embeddings . Denoting the overall transformation for a node embedding as , we model the probability of a link as , where the nonlinear function can be modeled by a deep neural network (in our experiments, we use a deep neural net with each hidden layer having leaky ReLU nonlinearity). Figure 1 (left) depicts the generator.
We model the binary vector , denoting nodecommunity memberships, using the stickbreaking construction of the IBP (Teh et al., 2007), which enables learning of the effective by specifying a sufficiently large truncation level . The stickbreaking construction is given as follows
(2)  
(3) 
We further assume a Gaussian prior on membership strengths , i.e., .
We employ a nonlinear encoder to infer the node embedding for each node, using a fast noniterative recognition model (Kingma & Welling, 2013). Denoting the parameters of the variational posterior for the embeddings of all the nodes collectively as , we consider an approximation to the model’s true posterior with a variational posterior of the form . For simplicity, we consider a meanfield approximation, which allows us to factorize the posterior as . Our nonlinear encoder, as shown in Fig. 1 (Right), assumes variational distributions on the local variables of each node, , , and , and defines the variational parameters of these distributions as the outputs of a graph convolutional network (GCN) (Kipf & Welling, 2016a), which takes as input the network and the node feature matrix . GCN has recently emerged as a flexible encoder of graphstructured data (similar in spirit to convolutional neural networks for images), which makes it an ideal choice of the encoder in our VAEbased generative model for graphs. The forward propagation rule for each layer in GCN is defined as , where ( when no side information is present), is the weight matrix, is the nonlinear activation, and is the symmetric normalization of adjacency . Although here we have used the vanilla GCN in our architecture, moregeneralized variants of GCN, such as GraphSAGE (Hamilton et al., 2017), can also be used as the encoder. The variational distributions have the following forms
(4)  
(5)  
(6) 
where are outputs of a GCN, , . We use the stochastic gradient variational Bayes (SGVB) algorithm (Kingma & Welling, 2013) to infer the parameters of the variational distributions. Details on reparameterization and the loss formulation are provided in Section id1.
Existing models for graphstructured data can be seen as special cases of our framework. Recall that we model the node embeddings as , and our generative model is of the form . If we ignore the community strength latent variable , , is defined simply as (just a binary vector) and further define as a Bernoulli distribution with its probability being a bilinear function of the embeddings and , then we recover the OSBM/LFRM (Latouche et al., 2011a; Miller et al., 2009a). Note, however, that while OSBM/LFRM typically rely on MCMC or variational inference, our framework can leverage SGVB for efficient inference.
Likewise, if we define , , a dense vector, and define as a Bernoulli distribution with its probability being a bilinear function of the embeddings, we recover the Eigenmodel or latentspace model (LSM) (Hoff et al., 2002). Note that this model cannot infer since the binary vector is not present. Finally, if is a Bernoulli distribution with its probability being a nonlinear function of the embeddings, then we recover the VGAE model (Kipf & Welling, 2016b), which can also be seen as a nonlinear extension of LSM. Moreover, note that a key limitation of LSM and VGAE is that these cannot be used to infer the community structure (due to the nonsparse nature of ) and usually can only be used for linkprediction tasks.
We define the factorized variational posterior as
where are a function of the GCN encoder, with inputs and . We define the loss function parameterized by inference network (encoder) parameters () and generator parameters () by minimizing the negative of the evidence lower bound (ELBO)
(7)  
where is the KullbackLeibler divergence between and . Note that here we have also included the loss from the reconstruction of the side information . We have considered that the side information and the links are conditionally independent given the node embeddings . When there is no side information, we can ignore the term in the loss function. For the encoder and decoder parameters we infer point estimates, while we learn the distribution over the latent variables .
Our variational autoencoder for link generation is trained using Stochastic Gradient Variational Bayes (SGVB) (Kingma & Welling, 2013). SGVB can be used to perform inference for a broad class of nonconjugate models and is therefore appealing to Bayesian nonparametric models, such as those based on stickbreaking priors that we use in our framework. SGVB uses differentiable Monte Carlo (MC) expectations to learn the model parameters. Specifically, it requires differentiable, noncentered parameterization (DNCP) (Kingma & Welling, 2014) to allow backpropagation. However, our model has expectations over Beta and Bernoulli distributions, neither of which permit easy reparameterization as required by SGVB. We leverage recent developments on reparameterizing these distributions (Maddison et al., 2017; Nalisnick & Smyth, 2017), which consequently leads to a simple inference procedure.
Following (Nalisnick & Smyth, 2017), we approximate the Beta posterior in (4) with the Kumaraswamy distribution, defined as: for and . The closedform inverse CDF allows easy reparameterization, and samples for (with parameters and ) can be drawn using:
(8) 
We compute the KL divergence between the Kumaraswamy and the Beta distribution by taking a finite approximation of the infinite sum as mentioned in (Nalisnick & Smyth, 2017).
For the Bernoulli random variable, we use the Binary Concrete distribution (Maddison et al., 2017; Jang et al., 2017) at the time of training, as a continuous relaxation to get the biased lowvariance estimates of the gradient. The divergence between two Bernoulli distributions is relaxed using two Binary Concrete distributions.
We reparameterize , defined by a Bernoulli with probability , (in (3) and (5)) with reparameterization:
(9) 
where is the sigmoid function, is the inversesigmoid function, is the relaxation temperature and .
Structured MeanField: Since the vanilla meanfield variational inference ignores the posterior dependence among the latent variables, we also considered Structured Stochastic Variational Inference (SSVI) (Hoffman, 2014; Hoffman et al., 2013), which allows globallocal parameter dependency and improves upon the meanfield approximation. We considered (and its variational parameters ) as global parameters and impose a hierarchical structure on by conditioning it on . The variational posterior of our framework using SSVI can be factorized as with , where are parameters to be learned. In practice, we have found structured meanfield to perform better than the meanfield, and our model implementation uses the former.
The proposed framework can be seen as bridging two lines of research on modeling graphs: () structured latent variable models for graphs, such as stochastic blockmodels and its variants (Kemp et al., 2006; Airoldi et al., 2008; Miller et al., 2009a; Latouche et al., 2011a); and () deep learning models for graphs, such as graph convolutional networks (Kipf & Welling, 2016a). Our effort is motivated by the goal of harnessing their complementary strengths to develop a deep generative stochastic blockmodel for graphs, that also enjoys an efficient inference procedure.
The most prominent methods in stochastic blockmodels include models that associate each node to a single community (Nowicki & Snijders, 2001; Kemp et al., 2006), a mixture of communities (Airoldi et al., 2008), and an overlapping set of communities (Miller et al., 2009a; Latouche et al., 2011a; Yang & Leskovec, 2012; Zhou, 2015). While stochastic blockmodels have nice interpretability, these models usually assume the links of the networks to be modeled as a simple bilinear function of the node embeddings, which may not be able to capture the nonlinear interactions between the nodes (Yan et al., 2011). An approach to model such nonlinear interactions was proposed in (Yan et al., 2011), using a matrixvariate Gaussian process. However, despite the modeling flexibility, inference in this model is challenging and the model is usually infeasible to run on networks with more than 100 nodes.
There is also significant recent interest in nonprobabilistic deep learning models for graphs. Some of the prominent works in this direction include DeepWalk (Perozzi et al., 2014) and graph autoencoders (GAE) (Kipf & Welling, 2016a; Hamilton et al., 2017). DeepWalk is inspired by the idea of word embeddings. It treats each node as a “document,” by starting a random walk at that node and taking the nodes encountered in the path taken as the word in that document. It uses document/word embedding methods to the learn embedding of each node. In contrast, the GAE approaches are based on the idea of graph convolutional networks (GCN) (Kipf & Welling, 2016a). This line of work nicely complements our contribution, since modules like GCN can be effectively used to design the encoder model for our deep generative framework. In particular, as noted in the model description, our encoder is essentially a GCN. We believe that such advances in graph encoding can be used as modules to design new deep generative models for relational data.
Despite the significant success of deep generative models for images and text data, there has been relatively little work on deep generative models for relational data (You et al., 2018; Hu et al., 2017; Wang et al., 2017; Kipf & Welling, 2016b). GraphRNN (You et al., 2018) learns a single representation of an entire graph to model the joint distribution of different graphs. The focus of GraphRNN is on generating smallsized graphs, whereas we focus on link prediction and community detection for a given graph. Among other existing methods, (Hu et al., 2017) proposed an extension of the LFRM via a deep hierarchy of binary latent features for each node. However, this model relies on expensive batch MCMC inference, precluding its applicability to largescale networks. Another deep latent variable model was proposed recently in (Wang et al., 2017). However, this model also has a difficult inference procedure, requiring modelspecific inference. Moreover, the node embeddings are not interpretable. Perhaps the closest in spirit to our work is the recent work on variational graph autoencoders (VGAE) (Kipf & Welling, 2016b). Graphite (Grover et al., 2018) extends the VGAE by using a multilayer iterative decoder that alternates between message passing and graph refinement. A similar decoding scheme can also be applied in our framework; however, the focus of this work is on learning sparse interpretable node embeddings. Both VGAE and Graphite are built on top of the standard VAE, and consequently do not have direct interpretability of node embeddings as desired by stochastic blockmodels. This leads to a model with different properties and a different inference procedure, compared to (Kipf & Welling, 2016b). Moreover, our VAE architecture is nonparametric in nature and can infer the node embedding size.
We report experimental results on several synthetic and realworld datasets, to demonstrate the efficacy of our model. Our experimental results include quantitative comparisons on the task of link prediction as well as qualitative results, such as using the embeddings to discover the underlying communities in the network data. The qualitative results are meant to demonstrate the expressiveness of the latent space that our model infers. The expressive nature of our model is the result of the sparse and interpretable embedding for each node of the graph. In particular, we show that these sparse embeddings can be interpreted as the memberships and strength of memberships of each node in one or more communities.
First we evaluate our model on linkprediction, comparing it with various baselines on several benchmark datasets on moderate (about 2000 nodes) to largescale (about 20,000 nodes) datasets. We then analyze the latent structure learned by our model on a synthetic and a realworld coauthorship dataset. We compare the latent structure with the embeddings learned by the variational graph autoencoder (VGAE) (Kipf & Welling, 2016b). We also examine the community structure on the realworld coauthorship dataset, and show that the proposed framework is able to readily capture the underlying communities. We refer to our framework as DGLFRM (Deep Generative Latent Feature Relational Model), which refers to our most general model with sparse embeddings with nonlinear generator and nonlinear encoder. We also consider a variant of DGLFRM with binary embeddings , which we refer to as DGLFRMB (the ‘B’ here denotes “binary”). Note that DGLFRMB can be seen as a deep generalization of LFRM (Miller et al., 2009b)/OSBM (Latouche et al., 2011b), with another key difference from LFRM/OSBM being the fact that we use amortized inference.
For link prediction, we compare the proposed model with four baselines, one of which is a simplified variant of DGLFRM akin to LFRM (Miller et al., 2009a), which is an overlapping stochastic blockmodel. The original LFRM, which uses MCMCbased inference, was infeasible to run on the datasets used in these experiments. On the other hand, DGLFRM with and bilinear decoder (link generation model) is similar to LFRM, but with a much faster SGVB based inference (we will refer to this simplified variant of DGLFRM as LFRM).
Among the other three baselines, Spectral Clustering (SC) and DeepWalk (DW) (Perozzi et al., 2014) learn node embeddings, which we use to compute the link probability as . The third baseline is the recently proposed variational autoencoder on graphs (VGAE) (Kipf & Welling, 2016b). Note that none of these baselines can be used for community detection, since the realvalued embeddings learned by these baselines are not interpretable (unlike our model which learns sparse embeddings, with nonzeros denoting community memberships).
Method  NIPS12  Yeast  Cora  Citeseer  Pubmed 

SC  
DW  
VGAE  
LFRM  
DGLFRMB  
DGLFRM 
Method  NIPS12  Yeast  Cora  Citeseer  Pubmed 

SC  
DW  
VGAE  
LFRM  
DGLFRMB  
DGLFRM 
Cluster  Authors 

Probabilistic Modeling  Sejnowski T, Hinton G, Dayan P, Jordan M, Williams C 
Reinforcement Learning  Barto A, Singh S, Sutton R, Connolly C, Precup D 
Robotics/Vision  Shibata T, Peper F, Thrun S, Giles C, Michel A 
Computational Neuroscience  Baldi P, Stein C, Rinott Y, Weinshall D, Druzinsky R 
Neural Networks  Pearlmutter B AbuMostafa Y, LeCun Y, Sejnowski T, Tang A 
We consider five realworld datasets, with three datasets consisting of side information in the form of the node features, and the other two datasets having only the link information. For the linkprediction experiments, all models are provided a partiallycomplete network (with unknown part to be predicted). The node features (when available) are provided to all the models. The description of each data set is as follows:

NIPS12: The NIPS12 coauthor network (Zhou, 2015) includes all 2037 authors in NIPS papers from volumes 112, with 3134 edges. This network has no side information.

Yeast: The Yeast protein interaction network (Zhou, 2015) has 2361 nodes and 6646 nonself edges. This network has no side information.

Cora: Cora network is a citation network consisting of 2708 documents. The datasets contain sparse bagofwords feature vectors of length 1433 for each document. These are used as node features. The network has total 5278 links.

Citeseer: Citeseer is a citation network consisting of 3312 scientific publications from six categories: agents, AI, databases, human computer interaction, machine learning, and information retrieval. The side information for the dataset is the category label for each paper which is converted into a onehot representation. These onehot vectors are used as node features. The network has a total of 4552 links.

Pubmed: A citation network consisting of 19,717 nodes. The dataset contains sparse bagofwords feature vectors of length 500 for each document, used as node features. The network has total 44,324 links.
We use Area Under the ROC Curve (AUC) and Average Precision (AP) to compare our model with the other baselines for link prediction. For all datasets, we hold out 10% and 5% of the links as our test set and validation set, respectively, and use the validation set to finetune the hyperparameters. We take the average of AUCROC and AP scores by running our model on 10 random splits of each dataset, to compare with the baselines. The AUCROC scores of our models and the various baselines are shown in Table 1 and AP scores are shown in Table 2. As shown in the tables, our models outperforms the baselines on almost all datasets. We again highlight that unlike the baselines, such as VGAE, that cannot learn interpretable embeddings, our model also learns embeddings that can be interpreted as memberships of nodes into communities. The superior results of DGLFRM and DGLRFMB demonstrate the benefit of our deep generative models. The significantly better results of these as compared to LFRM also show the benefit of endowing LFRM with a deep architecture, with nonlinear decoder and nonlinear encoder. The hyperparameter settings used for all experiments are included in the Supplementary Material. We also performed an experiment to investigate the model’s ability to leverage node features. As expected, when using the features the model performs better compared to the case when it ignores features. This experiment is included in the supplementary section.
To demonstrate the interpretable nature of the embeddings learned by our model, we generate a synthetic dataset with 100 nodes and 10 communities. The dataset is generated by fixing the groundtruth communities (by creating a binary vector for each node) such that some of the nodes belong to same communities. The adjacency matrix is then generated using a simple inner product, followed by the sigmoid operation (Figure 2a). We train using 85% of the synthetic adjacency matrix for linkprediction and for visualizing the latent structure that our model learns. The latent structure obtained using DGLFRM is plotted in Figure 2(b). Figure 2(c) shows that by using only the first two dimensions of the latent structure we can reconstruct the graph reasonably well. This depicts an important property of using a stickbreaking IBP prior which encourages the most commonly selected communities (the columns on the left in Figure 2 (b)) to be dense, while the communities with higher indices (columns in right) to be sparse. This shows that DGLFRM can learn the effective number of communities given a graph. Finally, we can quantize the latent space into discrete intervals to extract nodes belonging to different communities. In our experiments we saw that the latent structure learned is in fact close to the groundtruth community assignments we started with. In Figure 2(d) we compare the community structure from our model with the latent structure obtained by running the VGAE. Note that the Gaussian latent structure in VGAE is dense and therefore fails to learn community memberships that are readily interpretable.
We also do a qualitative analysis on the NIPS12 dataset. Again we train DGLFRM and VGAE using 85% of the adjacency matrix. Table 3 shows five of the inferred communities by DGLFRM. The authors shown under each community are ordered by the strength of their community memberships (in decreasing order). As Table 3 shows, each of the communities represent a subfield, with authors working on similar topics. Moreover, note that some authors (, Sejnowski) are inferred as belonging to more than one community. This qualitative experiment demonstrates that our model can learn interpretable embeddings that can be used for tasks such as (overlapping) clustering. We have included a visualization of the latent structure learned on NIPS12 data in the Supplementary Material. Note that our model can infer the number of communities naturally, via the stickbreaking prior. The stickbreaking prior requires specifying a large truncation level on the number of communities. Our model effectively infers the “active” communities for a given truncation level. As shown in Fig. 2 (b)(c), the posterior inference in our model is able to “turn off” the unnecessary columns in . Although we do not know the ground truth for the number of communities, the number of inferred active communities is similar to what is reported in prior work on nonparametric Bayesian overlapping stochastic blockmodels (Miller et al., 2009a). Note that VGAE embeddings require an additional step (such as Means clustering) to cluster nodes. Moreover, a method such as means cannot detect overlapping communities, and it is also sensitive to the initialization of (estimated number of communities). For reference, we have included the clustering results on the VGAE embeddings in the Supplementary Material.
We have presented a deep generative framework for overlapping community discovery and link prediction. This work combines the interpretability of stochastic blockmodels, such as the latent feature relational model, with the modeling power of deep generative models. Moreover, leveraging a nonparametric Bayesian prior on the node embeddings enables learning the node embedding size (, the number of communities) from data. Our framework is modular and a wide variety of decoder and encoder models can be used. In particular, it can leverage recent advances in nonprobabilistic autoencoders for graphs, such as the graph convolutional network (Kipf & Welling, 2016a) or its extensions (Hamilton et al., 2017). Inference in the model is based on SGVB, that does not require conjugacy. This further widens the applicability of our framework to model different types of networks (, weighted, countvalued edges, and powerlaw degree distribution of node degrees). We believe this combination of discrete latent variables based stochastic blockmodels and graph neural network will help leverage their respective strengths, and will fuel further research and advance the stateoftheart in (deep) generative modeling of graphstructured data.
Although SGVB inference makes our model fairly efficient, it can be scaled up further for massive networks by using minibatch based inference (Chen et al., 2018). Another possibility to scale up the model is to replace the Bernoullilogistic likelihood model by a BernoulliPoisson link (Zhou, 2015), which enable scaling up the model in the number of nonzeros (, number of edges) in the network. Given that our framework can work with a wide variety of decoder/generator models, such modifications can be done without much difficulty.
Finally, in this work we model each node as having a single binary vector, denoting its memberships in one or more communities. Another interesting extension would be to consider multiple layers of latent variables, which can model a node’s membership into a hierarchy of communities (Ho et al., 2011; Blundell & Teh, 2013; Hu et al., 2017).
Acknowledgements: PR acknowledges support from Google AI/ML faculty award and DSTSERB Early Career Research Award. The Duke investigators acknowledge the support of DARPA and ONR.
References
 Airoldi et al. (2008) Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. Mixed membership stochastic blockmodels. JMLR, 2008.
 Blundell & Teh (2013) Blundell, C. and Teh, Y. W. Bayesian hierarchical community discovery. In NIPS, 2013.
 Chen et al. (2018) Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with graph convolutional networks via importance sampling. CoRR, abs/1801.10247, 2018. URL http://arxiv.org/abs/1801.10247.
 Fortunato (2010) Fortunato, S. Community detection in graphs. Physics reports, 486(3):75–174, 2010.
 Goldenberg et al. (2010) Goldenberg, A., Zheng, A. X., Fienberg, S. E., and Airoldi, E. M. A survey of statistical network models. Foundations and Trends® in Machine Learning, 2010.
 Griffiths & Ghahramani (2011) Griffiths, T. L. and Ghahramani, Z. The indian buffet process: An introduction and review. JMLR, 2011.
 Grover et al. (2018) Grover, A., Zweig, A., and Ermon, S. Graphite: Iterative generative modeling of graphs. arXiv preprint arXiv:1803.10459, 2018.
 Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In NIPS, 2017.
 Ho et al. (2011) Ho, Q., Parikh, A., Song, L., and Xing, E. Multiscale community blockmodel for network exploration. In AISTATS, 2011.
 Hoff et al. (2002) Hoff, P. D., Raftery, A. E., and Handcock, M. S. Latent space approaches to social network analysis. JASA, 2002.
 Hoffman (2014) Hoffman, M. D. Stochastic structured meanfield variational inference. CoRR, abs/1404.4114, 2014.
 Hoffman et al. (2013) Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Hu et al. (2017) Hu, C., Rai, P., and Carin, L. Deep generative models for relational data with side information. In International Conference on Machine Learning, pp. 1578–1586, 2017.
 Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. In ICLR, 2017.
 Kemp et al. (2006) Kemp, C., Tenenbaum, J. B., Griffiths, T. L., Yamada, T., and Ueda, N. Learning systems of concepts with an infinite relational model. In Proceedings of the national conference on artificial intelligence, 2006.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Kingma & Welling (2013) Kingma, D. P. and Welling, M. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Efficient gradientbased inference through transformations between bayes nets and neural nets. CoRR, abs/1402.0480, 2014. URL http://arxiv.org/abs/1402.0480.
 Kipf & Welling (2016a) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016a.
 Kipf & Welling (2016b) Kipf, T. N. and Welling, M. Variational graph autoencoders. arXiv preprint arXiv:1611.07308, 2016b.
 Latouche et al. (2011a) Latouche, P., Birmelé, E., and Ambroise, C. Overlapping stochastic block models with application to the french political blogosphere. The Annals of Applied Statistics, 2011a.
 Latouche et al. (2011b) Latouche, P., BirmelÃ©, E., and Ambroise, C. Overlapping stochastic block models with application to the french political blogosphere. The Annals of Applied Statistics, 2011b.
 Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In ICLR, 2017.
 Miller et al. (2009a) Miller, K., Griffiths, T., and Jordan, M. Nonparametric latent feature models for link prediction. NIPS, 2009a.
 Miller et al. (2009b) Miller, K. T., Griffiths, T. L., and Jordan, M. I. Nonparametric latent feature models for link prediction. In NIPS, 2009b.
 Nalisnick & Smyth (2017) Nalisnick, E. and Smyth, P. Stickbreaking variational autoencoders. In ICLR, 2017.
 Nowicki & Snijders (2001) Nowicki, K. and Snijders, T. A. B. Estimation and prediction for stochastic blockstructures. JASA, 2001.
 Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In KDD, 2014.
 Schmidt & Morup (2013) Schmidt, M. N. and Morup, M. Nonparametric bayesian modeling of complex networks: An introduction. Signal Processing Magazine, IEEE, 30(3), 2013.
 Teh et al. (2007) Teh, Y. W., GrÃ¼r, D., and Ghahramani, Z. Stickbreaking construction for the indian buffet process. In AISTATS, 2007.
 Wang et al. (2017) Wang, H., Shi, X., and Yeung, D.Y. Relational deep learning: A deep latent variable model for link prediction. In AAAI, pp. 2688–2694, 2017.
 Yan et al. (2011) Yan, F., Xu, Z., and Qi, Y. Sparse matrixvariate gaussian process blockmodels for network modeling. In UAI, 2011.
 Yang & Leskovec (2012) Yang, J. and Leskovec, J. Communityaffiliation graph model for overlapping network community detection. In ICDM, 2012.
 You et al. (2018) You, J., Ying, R., Ren, X., Hamilton, W. L., and Leskovec, J. Graphrnn: A deep generative model for graphs. CoRR, abs/1802.08773, 2018. URL http://arxiv.org/abs/1802.08773.
 Zhou (2015) Zhou, M. Infinite edge partition models for overlapping community detection and link prediction. In AISTATS, 2015.
 Zhu et al. (2016) Zhu, J., Song, J., and Chen, B. Maxmargin nonparametric latent feature models for link prediction. arXiv preprint arXiv:1602.07428, 2016.
Quantitative results:
The framework proposed uses a stickbreaking IBP prior which has two parameters: and . The parameter is the initial guess of the number of nonzero entries in the binary vector and is the truncation parameter. In the experiments, . In general, a higher value of the parameter worked better for DGLFRMB and LFRM, as compared to the value in the DGLFRM model. This difference in reflects the inherent capacity of the latent space of these models. The embedding learned by DGLFRM, while being highly sparse, are in real space resulting in more capacity to represent data as compared to the binary latent space in DGLFRMB and LFRM.
The encoder network for DGLFRM and DGLFRMB had two nonlinear GCN layers. The length of the first nonlinear layer was fixed to 32/64 for the datasets which had sideinformation (Cora, Citeseer and Pubmed), and otherwise was set to 128/256. The second layer of the GCN encoder had hidden units. The decoder network for DGLFRM and DGLFRMB had two layers with dimension 32 and 16. All the models were trained for 5001000 iterations using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of . We used 0.5 dropout. The temperature parameter of the Binary Concrete distribution (Maddison et al., 2017) was 0.5 for the prior and 1.0 for the posterior.
Qualitative results: For experiments on the synthetic data (with 100 nodes and 10 communities), the DGLFRM model had two GCN encoding layers with 32 and hidden units, and the decoder had a simple innerproduct layer. The VGAE model had the same set of hyperparameters as above. The qualitative experiment on the NIPS12 coauthorship dataset had two hidden layers with 64 and hidden units. The parameter for this experiment was fixed to 2.
Variational Graph Autoencoder (VGAE), unlike the proposed model, is not able to learn embeddings which are readily interpretable. It requires additional processing such as KMeans over the learned embeddings for node clustering. Moreover, a method like KMeans does not result in overlapping communities. In this section, we compare the clusters obtained after applying Kmeans on embeddings learned from VGAE with the readily available overlapping communities obtained from our framework.
We use Kmeans to find clusters for NIPS12 (3134 authors) coauthorship data on the node embeddings learned using VGAE. The Kmeans results are shown in Table 4. We performed two experiments with different kmeans cluster hyperparameter (K=5 and K=20). We also show the clusters (communities) which our model was readily able to infer for reference Table 3. We only show prominent authors and their clusters for both the models.
As we see in Table 4, adhoc postprocessing of embeddings may break some relevant coherent communities which were inferred by our model. It is also important to note that, unlike our model, kmeans has no strength indicator for community membership.
The latent structure of the NIPS12 dataset learned by DGLFRM and VGAE is shown in Figure 2. In this experiment, the truncation parameter for the stickbreaking prior is 50. As shown in Figure 2(b), the posterior inference in DGLFRM is naturally able to “turn off” the unnecessary columns in . The average number of active communities for each node was found to be 8. The sparse nature of the embedding matrix allows us to consider each column as a possible community of a given node. For visualization, we have ordered the indices of the communities (columns of ) such that the community with higher active nodes has a lower index in the visualization. For the VGAE model, we used a two layer GCN with dimensions 32 and 16. Figure 2(c) depicts the dense node embedding learned by VGAE.
We also perform an experiment to investigate the model’s ability to leverage side information associated with nodes. For this experiment, we ran our model on three datasets (Cora, Citeseer and Pubmed) with and without node features. We compare the AUCROC results in Fig. 4. As expected, when using the node side information, the model performs better as compared to the case when it ignores the side information.
Cluster (K=5)  Authors 

Cluster 1  Hinton G, Dayan P, Jordan M, Tang A, Sejnowski T, Willams C 
Cluster 2  Weinshall D, Rinott Y, Barto A, Singh S, Sutton R, Giles C, Connolly C, Baldi P, Precup D 
Cluster 3  Thrun S, Shibata T, Stein C, Peper F, Michel A, Druzinsky R, AbuMostafa Y 
Cluster 4  LeCun Y, Pearlmutter B 
Cluster (K=20)  Authors 
Cluster 1  Hinton G, Williams C 
Cluster 2  Jordan M, Connolly C, Barto A, Singh S, Sutton R 
Cluster 3  Michel A, Tang A 
Cluster 4  Dayan P, Sejnowski T 
Cluster 5  Thrun S, Peper F 
Cluster 6  Baldi P, Weinshall D 
Cluster 7  Shibata T, Druzinsky R 
Cluster 8  Stein C 
Cluster 9  Precup D 
Cluster 10  Giles C 
Cluster 11  Pearlmutter B 
Cluster 12  LeCun Y 
Cluster 13  Rinott Y 
Cluster 14  AbuMostafa Y 
Cluster  Authors 

Probabilistic Modeling  Sejnowski T, Hinton G, Dayan P, Jordan M, Williams C 
Reinforcement Learning  Barto A, Singh S, Sutton R, Connolly C, Precup D 
Robotics/Vision  Shibata T, Peper F, Thrun S, Giles C, Michel A 
Computational Neuroscience  Baldi P, Stein C, Rinott Y, Weinshall D, Druzinsky R 
Neural Networks  Pearlmutter B AbuMostafa Y, LeCun Y, Sejnowski T, Tang A 