MultiMotifGAN (MMGAN): Motiftargeted Graph Generation and Prediction
Abstract
Generative graph models create instances of graphs that mimic the properties of realworld networks. Generative models are successful at retaining pairwise associations in the underlying networks but often fail to capture higherorder connectivity patterns known as network motifs. Different types of graphs contain different network motifs, an example of which are triangles that often arise in social and biological networks. It is hence vital to capture these higherorder structures to simulate realworld networks accurately. We propose MultiMotifGAN (MMGAN), a motiftargeted Generative Adversarial Network (GAN) that generalizes the benchmark NetGAN approach. The generalization consists of combining multiple biased random walks, each of which captures a different motif structure. MMGAN outperforms NetGAN at creating new graphs that accurately reflect the network motif statistics of input graphs such as Citeseer, Cora and Facebook.
Anuththari Gamage Eli Chien Jianhao Peng Olgica Milenkovic^{†}^{†}thanks: The work was supported by the NSF Center for Science of Information under grant number 0939370 \addressDepartment of Electrical and Computer Engineering, University of Illinois UrbanaChampaign \ninept {keywords} Generative adversarial networks, Higherorder networks, Multiview graphs, Network motifs.
1 Introduction
Given the ubiquity of network structures in realworld data, graph generative models have been studied extensively as a means of simulating graphs with different properties. Classical stochastic models, such as the ErdősRényi, BarabasiAlbert, and the stochastic block model generate graphs based on a predefined set of parameters, such as the probability of edge formation within and between communities [Easley2010]. In contrast, modern approaches to graph generation based on deep learning, including NetGAN[Bojchevski2018], GraphGAN[Wang2018], and GraphRNN[You2018], are flexible enough to learn multiple different properties of an input graph simultaneously. The graphs generated by these architectures may be used for downstream learning tasks such as data augmentation[chakrabarti2006], recommendation[yu2014], and link prediction[gao2011].
Many realworld networks consist of entities with complex mutual interrelations. Such networks cannot be modeled effectively as graphs with simple pairwise relations, despite the fact that pairwise relations provide a wealth of information for learning. Studying higherorder relationships in a graph is fundamental for our understanding of the network behavior and function. Higherorder relationships are usually termed hyperedges (collections of more than two nodes) [Zhou2006, chien2019] or network motifs (recurrent node connectivity patterns that are statistically significant compared to some ground truth random graph model) [Milo2002]. These higherorder structures are the actual building blocks of complex networks, as they capture fundamental functional properties.
Network motifs were originally studied in the context of gene regulatory networks [Milo2002, ShenOrr2002], but the presence of distinct network motifs in different types of realworld networks (food webs, the world wide web, social networks, power grid networks etc.) has been established in prior literature [Milo2002, Ugander2013, Dey2017]. For example, gene regulatory networks, neuronal networks, and social networks all contain a large number of triangles [Milo2002, Ugander2013]. When generating graphs that are statistically similar to a realworld network or trying to predict unobserved subgraphs, it is vital to preserve the motif structures present in the network under consideration.
Existing implicit graph generative models successfully capture pairwise relationships within the graph and associated graph statistics, but they are not as successful in retaining higherorder relationships like motifs or hyperedges. To address this issue, we propose MultiMotifGAN (MMGAN), a novel motiftargeted graph generative model that preserves network motif statistics in the output graphs. MMGAN generalizes NetGAN, an architecture that uses random walks on an input graph to learn characteristics of the network. The generalization consists of combining multiple random walk statistics, where each type of random walk is biased towards one type of motif structure. We consider two variants of MMGAN: the first is designed to reflect the motif statistics of the input graph accurately, and the second aims to improve motif prediction in networks with missing edges. Both variants combine multiple random walk outputs generated by differently biased GANs, each of which targets a specific motif type.
We show experimentally that MMGAN outperforms benchmark generative models such as NetGAN at retaining mutltiple network motif statistics of the original graph, as evidenced by its competitive results in generation and link prediction on realworld social networks such as Citeseer, Cora, and Facebook [mccallum2000, sen2008, leskovec2012learning]. For example,when trained on Citeseer, which contains 1084 triangles, MMGAN produces networks with an average of 1285 triangles, compared to an average of 625 produced by NetGAN. Similarly, in terms of motif prediction, MMGAN obtains an average precision of 99.29% on Cora while NetGAN achieves 92.23%. For simplicity and due to space constraints, we only discuss results on motifs with up to nodes. However, it is straightforward to adapt MMGAN for another constant number of nodes.
Relation to Existing Work: MMGAN uses multiple techniques for learning on graphs and combines them into a motifaware model. Random walks on graphs are widely used to learn the local and global topology of a graph [Perozzi2014, Grover2016, Li2019], while biased random walks are used to characterize higherorder network structures like hyperedges and network motifs [Lee2011, Tsourakakis2017, Backstrom2011, Han2016, Dayeh2012, Zhou2006]. Generative Adversarial Networks (GAN) are highly effective at learning implicit features of a data set and using these to generate realistic data samples. They are therefore a natural choice for both prediction tasks on incomplete data and sample generation. Combining GANs that provide multiple views of the same system is a new feature of our architecture and it is expected to improve the quality of inference tasks on the underlying data. There exists many methods for link prediction in networks [liben2007link], but to the best of our knowledge, MMGAN is the only GANbased generative and predictive model for motifs.
The paper is organized as follows: Section 2 introduces the MMGAN architecture, while Section 3 presents a summary of our experimental findings.
2 MultiMotifGAN
Let be a graph with node set and edge set . A subgraph of is a graph contained within such that and . The frequency of a subgraph is the number of appearances of subgraphs in that are isomorphic to . Furthermore, let be a random graph model with the same number of nodes and the same node degree distribution as . A network motif is defined as a subgraph that recurs in a network with a higher frequency than in the chosen random graph model [Milo2002].
2.1 Graph Generation using NetGAN
We base our motiftargeted generative model on an existing implicit graph generative architecture, NetGAN [Bojchevski2018]. NetGAN is a Generative Adversarial Network that uses random walks on a graph to generate realistic graphs that are statistically similar to a training graph. NetGAN consists of a generator and discriminator which are trained under the Wasserstein GAN objective [Arjovsky2017] for increased stability. The generator outputs sets of random walks that are similar to those sampled from an input graph that one wants to generate, while learns to distinguish between random walks generated by and those sampled from the input graph. Thus, NetGAN requires only one undirected graph as an input, from which it samples a set of random walks to act as a training data set. It is highly efficient for cases where one does not have a large set of similar graphs that can serve as the training set.
Once and are trained, NetGAN generates a new graph using the frequency of edges in the generated set of random walks. It constructs a score matrix whose th entry represents the number of times edge appears in the generated random walks. The score matrix is normalized by the row sums so that for every node, one obtains a probability distribution over its neighboring nodes. To add an edge, a node is selected randomly and its neighbor is sampled according to the corresponding probability distribution constructed from the normalized score matrix. Subsequently, an edge between these two nodes is added in the output graph. The procedure continues until reaching the number of edges in the input graph.
NetGAN has been shown to outperform stateoftheart graph generative models at preserving various topological features of the input graph (e.g. maximum degree, clustering coefficient, inter and intracommunity edge density) in its generated output. The method also exhibits competitive performance at link prediction on incomplete graphs, which indicates that it is capable of generalization rather than only memorizing the input graph. Despite the efficacy of NetGAN in the abovementioned tasks, we observe that the graphs generated by NetGAN (as well as other stateoftheart generative models comparable to NetGAN) fail to approximate the network motif statistics of the input graph. For example, NetGAN systematically underestimates the number of triangles in social networks by 4060% (see Table 1). This is a major shortcoming for applications that aim to generate graphs that realistically mimic realworld networks or predict unobserved motif structures.
2.2 MultiMotifGAN (MMGAN)
For our proposed algorithm, we generalize the NetGAN random walkbased architecture which lends itself to characterizing the local properties of nodes (depending on how the random walk is performed). To generate the training set of random walks, NetGAN employs a secondorder random walk, node2vec, which captures the local and global structure of the graph effectively via a twostep weighting scheme [Grover2016]: given an edge , suppose that the previous transition of the random walk was from some node to . The second order bias is chosen as
where and is the shortestlength path between and . The unnormalized transition probability from to equals
where is the weight of edge (equal to for unweighted graphs). In MMGAN, we change this weight to incorporate the 3node motif statistics of the graph and bias the random walk towards edges that are more likely to be part of a particular network motif. This bias is different from the bias introduced to control the extent of exploration in the graph.
To find the correct biases, we first count the motifs in the graph of interest. While a complete enumeration of the motifs present in largescale network is computationally prohibitive, a number of efficient motifsampling algorithms exist that approximate the frequencies of different motif in a network [MasoudiNejad2012, Wernicke2006]. In our analysis, we use FANMOD, a fast network motif detection algorithm that can handle both directed and undirected networks and finds motifs containing up to nodes [Wernicke2006]. For simplicity, we focus on 3node motifs since they represent the most significant structures in social networks and are likely to be contained in other higherorder motifs; this allows one to implicitly include information about higherorder interactions while limiting the complexity of the MMGAN platform. However, it is straightforward to adapt MMGAN to account for motifs of larger sizes with adjustments in the weight calculation and graph combination procedures so as to account for different nested nonisomorphic motifs.
Using FANMOD, we first estimate the total number of 3node motifs which are of types and listed in Figure 1. The concentration of a motif equals , where and denotes the number of motifs of type in the graph. For an edge , we define to be the number of motifs of type in which the edge participates. Then, the motifbiased weight of the edge equals
where
Thus, is a weighted average of the motif counts, weighted by an appropriate function of concentration. The chosen bias will lead to a higher frequency of the particular motif in the output graph compared to the input graph. In order to obtain motif counts that reflect the counts in the input, we combine the output score matrices of three GANs with random walks biased as follows: without using motif weights as in NetGAN (), using weights that bias towards (), and using weights biased towards (). The matrix leads to a good characterization of the input edge set. From , we obtain a better characterization of the motifs in the graphs when compared to , but with a frequency that is higher. Similarly, ensures a good characterization of triangles, albeit once again with a higher count than observed in the input. These three ‘views’ of the graph provide a close approximation of the actual motif frequencies and concentrations once properly combined. To handle different tasks such as motif generation and link prediction, we propose two different ways of combining the score matrices:
I. Multiview combination for link prediction (MMGANAvg): We combine the three score matrices via averaging, resulting in . Edges are sampled in the same manner as in NetGAN by first normalizing to produce a transition probability matrix, then selecting a node at random and choosing one of its neighbors according to the above distribution. We add an edge between the corresponding nodes in the output graph and continue adding edges similarly until the same number of edges as in the input graph is reached.
II. Multiview combination for graph generation (MMGAN): In this scheme, we sample both edges and motifs from the three views at random and add them to the output graph directly as follows. We first randomly choose one view of with probabilities respectively. Then, we choose one of two sampling methods, sampling by maximum score or random sampling, with probabilities and respectively, where is small to avoid overfitting.
If we choose sampling by maximum score, we first select the edge with the highest score in . Then, we add the corresponding subgraph structure to the output graph. In more detail, if we add to the output graph. If , we find all possible motifs containing and compute the average score for each possible motif. Then, we select the motif with the highest average score and add the two edges of the motif to the output graph. Similarly, if , we compute the average scores for all motifs (triangles) containing and add the three edges of the highest scoring motif to the output. After adding the edge(s), we remove the corresponding scores from to enforce sampling without replacement. We repeat this procedure with the next highest score in the score matrix and continue until the output graph has the same number of edges as the input.
If we choose random sampling, we first select a node uniformly at random. Then, similar to the previous combination method, we randomly sample two other nodes with the probability distribution defined by the normalized score matrix. Finally, if , we add edge to the output graph. If , we add the motif with edges , and if , we add the triangle with all three nodes. We continue this until the output contains the same number of edges as the input.
Choosing some of the maximum scoring edges and motifs ensures that the key edges that appear repeatedly in the sampled set of random walks are included in the output graph. The repeated appearances indicate that the edge has a high weight and therefore is a part of a larger number of motifs. Every time we sample from these heavyweighted edges, we add an entire motif to the output graph. Thus, adding a small sample of these will lead to a higher frequency of motifs in the output. Furthermore, by adjusting , we can control the frequency of the different motif types as needed. This approach leads to a closer approximation of the motif counts in the original graph compared to MMGANavg, at the potential expense of link and motif prediction accuracy.
3 Experimental Results
We test the performance of MMGAN and MMGANAvg against NetGAN, which is shown to outperform a number of other benchmark graph generative models in terms of preserving the input graph statistics [Bojchevski2018]. For data, we use three realworld social networks, Cora [sen2008], Citeseer [mccallum2000], and Facebook [leskovec2012learning] with the characteristics described in Table 3. Note that in all of these networks, triangles () are statistically significant (occur with higher frequency in the real network compared to randomized networks). Thus, we are generally interested in keeping a comparable triangle count to the input network in our generated output.
In each of the experiments described, we train NetGAN, MMGAN, and MMGANAvg to 60% edge overlap (one of the methods of early stopping in NetGAN) and average results over 5 runs. We use an 8020% training and testing split of the total 3node motifs in the original graph.
Motiftargeted graph generation: We evaluate the ability of MMGAN and MMGANAvg to preserve the motif structures in the graph by comparing motif counts and motif concentrations in the output. For this, we combine the multiple score matrices using the combination schemes described in I and II. For II, we set , emphasizing triangles, and for every experiment. The choice of the probabilities is governed by the number of edges in the motifs being , and , respectively. The results for both combination schemes I and II are shown for comparison in Tables 1 and 2.
Link and motif prediction: We evaluate the predictive ability of the MMGAN and MMGANAvg as follows. For motif prediction, we use the test set of motifs held out during training and construct an equallysized set of test nonmotifs. For link prediction, we use the corresponding edges as test edges and nonedges.
Dataset  Motif  Input  Motif Count  Normalized Motif Count ( error)  
NetGAN  MMGAN  MMGANavg  NetGAN  MMGAN  MMGANavg  
Citeseer  V  22,763  18,369  23,280  17,464  0.8069 (0.1931)  1.0227 (0.0227)  0.7672 (0.2328) 
T  1084  632  1285  722  0.5830 (0.4170)  1.1854 (0.1854)  0.6661 (0.3339)  
Cora 
V  47,239  39,401  58,967  35,640  0.8340 (0.1660)  1.2426 (0.2426)  0.7546(0.2454) 
T  1558  796  1819  1006  0.5110 (0.4890)  1.1675 (0.1675)  0.6457(0.3543)  

V  1,238,448  1,337,952  1,204,147  1,329,432  1.0803 (0.0803)  0.9723(0.0277)  1.0735(0.0735) 
T  420,329  233,566  168,607  236,144  0.5557 (0.4443)  0.4011(0.5989)  0.5618(0.4382)  

Dataset  Motif  Input  Motif Concentration  KL Divergence  
NetGAN  MMGAN  MMGANavg  NetGAN  MMGAN  MMGANavg  
Citeseer  V  95.45 %  96.68%  94.75%  96.03%  0.2764  
T  4.55%  3.32%  5.25%  3.97%  0.0777  0.0583  
Cora  V  96.81%  98.02%  97.00%  97.25%  0.3942  
T  3.19%  1.98%  3.00%  2.75%  0.0086  0.0474  
V  74.66%  85.14%  87.72%  84.92%  4.6922  7.5672  
T  25.34%  14.86%  12.28%  15.08%  4.4839 
Network  

Cora  2485  10,138  96.81  3.19  99.97  0.03 
Citeseer  2118  7358  95.45  4.55  99.94  0.06 
1034  53,498  74.66  25.34  96.68  3.32 
We use the average scores of these test motifs and edges to compute two metrics: AUC (Area Under the Curve of the Receiver Operating Characteristic) and AP (Average Precision), which are standard metrics for link prediction evaluation [Bojchevski2018]. Tables 4 and 5 show the results under each metric.
4 Discussion
While all three algorithms are quite successful, MMGANAvg outperforms all the other methods in every dataset under all metrics and should be the method of choice for motif prediction. The two different GANcombining schemes essentially tradeoff between exploration and exploitation in different manners. MMGAN targets edges that are more likely to produce motifs and adds them to the output, thus ensuring that we obtain close to the input counts. MMGANAvg on the other hand incorporates information from all three views equally, resulting in a graph that better reflects the edge connectivity of the input network. Nevertheless, it appears plausible that largescale tuning of the motif sampling probabilities and the proportions of the maximum and random score selection in MMGAN may lead to improved performance compared to MMGANAvg. These will be described in the full version of the paper.
We further note that even without explicitly incorporating statistics of 4node motifs in the input network, MMGAN approximates their counts better than NetGAN. For example, we compare the square (4node cycle) counts in the output when they were trained on Citeseer. NetGAN generates graphs that have a normalized count of 0.1204 on average, while MMGAN has a normalized count of 0.3012 on average in its output graphs. This supports our assumption that since 3node motifs are likely to be contained in other higherorder motifs, using only the 3node motif statistics still allows us to implicitly include information about the higherorder motifs.
Dataset  Type  NetGAN  MMGAN  MMGANavg 

Citeseer  Link  0.9599  0.9265  0.9675 
Motif  0.9974  0.9958  0.9982  
Cora  Link  0.9159  0.8947  0.9340 
Motif  0.9961  0.9907  0.9977  
Link  0.9779  0.9751  0.9981  
Motif  0.9733  0.9585  0.9770 
Dataset  Type  NetGAN  MMGAN  MMGANavg 

Citeseer  Link  0.9655  0.9391  0.9730 
Motif  0.9962  0.9950  0.9970  
Cora  Link  0.9223  0.9010  0.9429 
Motif  0.9959  0.9902  0.9969  
Link  0.9735  0.9743  0.9816  
Motif  0.9578  0.9337  0.9632 