Multi-MotifGAN (MMGAN): Motif-targeted Graph Generation and Prediction

Multi-MotifGAN (MMGAN): Motif-targeted Graph Generation and Prediction

Abstract

Generative graph models create instances of graphs that mimic the properties of real-world networks. Generative models are successful at retaining pairwise associations in the underlying networks but often fail to capture higher-order connectivity patterns known as network motifs. Different types of graphs contain different network motifs, an example of which are triangles that often arise in social and biological networks. It is hence vital to capture these higher-order structures to simulate real-world networks accurately. We propose Multi-MotifGAN (MMGAN), a motif-targeted Generative Adversarial Network (GAN) that generalizes the benchmark NetGAN approach. The generalization consists of combining multiple biased random walks, each of which captures a different motif structure. MMGAN outperforms NetGAN at creating new graphs that accurately reflect the network motif statistics of input graphs such as Citeseer, Cora and Facebook.

\name

Anuththari Gamage   Eli Chien   Jianhao Peng   Olgica Milenkovicthanks: The work was supported by the NSF Center for Science of Information under grant number 0939370 \addressDepartment of Electrical and Computer Engineering, University of Illinois Urbana-Champaign \ninept {keywords} Generative adversarial networks, Higher-order networks, Multi-view graphs, Network motifs.

1 Introduction

Given the ubiquity of network structures in real-world data, graph generative models have been studied extensively as a means of simulating graphs with different properties. Classical stochastic models, such as the Erdős-Rényi, Barabasi-Albert, and the stochastic block model generate graphs based on a predefined set of parameters, such as the probability of edge formation within and between communities [Easley2010]. In contrast, modern approaches to graph generation based on deep learning, including NetGAN[Bojchevski2018], GraphGAN[Wang2018], and GraphRNN[You2018], are flexible enough to learn multiple different properties of an input graph simultaneously. The graphs generated by these architectures may be used for downstream learning tasks such as data augmentation[chakrabarti2006], recommendation[yu2014], and link prediction[gao2011].

Many real-world networks consist of entities with complex mutual interrelations. Such networks cannot be modeled effectively as graphs with simple pairwise relations, despite the fact that pairwise relations provide a wealth of information for learning. Studying higher-order relationships in a graph is fundamental for our understanding of the network behavior and function. Higher-order relationships are usually termed hyperedges (collections of more than two nodes) [Zhou2006, chien2019] or network motifs (recurrent node connectivity patterns that are statistically significant compared to some ground truth random graph model) [Milo2002]. These higher-order structures are the actual building blocks of complex networks, as they capture fundamental functional properties.

Network motifs were originally studied in the context of gene regulatory networks [Milo2002, Shen-Orr2002], but the presence of distinct network motifs in different types of real-world networks (food webs, the world wide web, social networks, power grid networks etc.) has been established in prior literature [Milo2002, Ugander2013, Dey2017]. For example, gene regulatory networks, neuronal networks, and social networks all contain a large number of triangles [Milo2002, Ugander2013]. When generating graphs that are statistically similar to a real-world network or trying to predict unobserved subgraphs, it is vital to preserve the motif structures present in the network under consideration.

Existing implicit graph generative models successfully capture pairwise relationships within the graph and associated graph statistics, but they are not as successful in retaining higher-order relationships like motifs or hyperedges. To address this issue, we propose Multi-MotifGAN (MMGAN), a novel motif-targeted graph generative model that preserves network motif statistics in the output graphs. MMGAN generalizes NetGAN, an architecture that uses random walks on an input graph to learn characteristics of the network. The generalization consists of combining multiple random walk statistics, where each type of random walk is biased towards one type of motif structure. We consider two variants of MMGAN: the first is designed to reflect the motif statistics of the input graph accurately, and the second aims to improve motif prediction in networks with missing edges. Both variants combine multiple random walk outputs generated by differently biased GANs, each of which targets a specific motif type.

We show experimentally that MMGAN outperforms benchmark generative models such as NetGAN at retaining mutltiple network motif statistics of the original graph, as evidenced by its competitive results in generation and link prediction on real-world social networks such as Citeseer, Cora, and Facebook [mccallum2000, sen2008, leskovec2012learning]. For example,when trained on Citeseer, which contains 1084 triangles, MMGAN produces networks with an average of 1285 triangles, compared to an average of 625 produced by NetGAN. Similarly, in terms of motif prediction, MMGAN obtains an average precision of 99.29% on Cora while NetGAN achieves 92.23%. For simplicity and due to space constraints, we only discuss results on motifs with up to nodes. However, it is straightforward to adapt MMGAN for another constant number of nodes.

Relation to Existing Work: MMGAN uses multiple techniques for learning on graphs and combines them into a motif-aware model. Random walks on graphs are widely used to learn the local and global topology of a graph [Perozzi2014, Grover2016, Li2019], while biased random walks are used to characterize higher-order network structures like hyperedges and network motifs [Lee2011, Tsourakakis2017, Backstrom2011, Han2016, Dayeh2012, Zhou2006]. Generative Adversarial Networks (GAN) are highly effective at learning implicit features of a data set and using these to generate realistic data samples. They are therefore a natural choice for both prediction tasks on incomplete data and sample generation. Combining GANs that provide multiple views of the same system is a new feature of our architecture and it is expected to improve the quality of inference tasks on the underlying data. There exists many methods for link prediction in networks [liben2007link], but to the best of our knowledge, MMGAN is the only GAN-based generative and predictive model for motifs.

The paper is organized as follows: Section 2 introduces the MMGAN architecture, while Section 3 presents a summary of our experimental findings.

2 Multi-MotifGAN

Let be a graph with node set and edge set . A subgraph of is a graph contained within such that and . The frequency of a subgraph is the number of appearances of subgraphs in that are isomorphic to . Furthermore, let be a random graph model with the same number of nodes and the same node degree distribution as . A network motif is defined as a subgraph that recurs in a network with a higher frequency than in the chosen random graph model  [Milo2002].

Figure 1: Motifs in social networks [Ugander2013]. (Row 1) We focus on motifs with nodes: (edges), (pairs of edges sharing a vertex) and (triangles). (Row 2) Motifs involving vertices.

2.1 Graph Generation using NetGAN

We base our motif-targeted generative model on an existing implicit graph generative architecture, NetGAN [Bojchevski2018]. NetGAN is a Generative Adversarial Network that uses random walks on a graph to generate realistic graphs that are statistically similar to a training graph. NetGAN consists of a generator and discriminator which are trained under the Wasserstein GAN objective [Arjovsky2017] for increased stability. The generator outputs sets of random walks that are similar to those sampled from an input graph that one wants to generate, while learns to distinguish between random walks generated by and those sampled from the input graph. Thus, NetGAN requires only one undirected graph as an input, from which it samples a set of random walks to act as a training data set. It is highly efficient for cases where one does not have a large set of similar graphs that can serve as the training set.

Once and are trained, NetGAN generates a new graph using the frequency of edges in the generated set of random walks. It constructs a score matrix whose th entry represents the number of times edge appears in the generated random walks. The score matrix is normalized by the row sums so that for every node, one obtains a probability distribution over its neighboring nodes. To add an edge, a node is selected randomly and its neighbor is sampled according to the corresponding probability distribution constructed from the normalized score matrix. Subsequently, an edge between these two nodes is added in the output graph. The procedure continues until reaching the number of edges in the input graph.

Figure 2: The MMGAN architecture, consisting of NetGAN (), and the two motif-biased GANs (, ) and (, ). Each produces a set of random walks, while each determines which are plausibly coming from the input graph and generates a score matrix. The score matrices are combined under two different schemes to obtain the output graph.

NetGAN has been shown to outperform state-of-the-art graph generative models at preserving various topological features of the input graph (e.g. maximum degree, clustering coefficient, inter- and intra-community edge density) in its generated output. The method also exhibits competitive performance at link prediction on incomplete graphs, which indicates that it is capable of generalization rather than only memorizing the input graph. Despite the efficacy of NetGAN in the above-mentioned tasks, we observe that the graphs generated by NetGAN (as well as other state-of-the-art generative models comparable to NetGAN) fail to approximate the network motif statistics of the input graph. For example, NetGAN systematically underestimates the number of triangles in social networks by 40-60% (see Table 1). This is a major shortcoming for applications that aim to generate graphs that realistically mimic real-world networks or predict unobserved motif structures.

2.2 Multi-MotifGAN (MMGAN)

For our proposed algorithm, we generalize the NetGAN random walk-based architecture which lends itself to characterizing the local properties of nodes (depending on how the random walk is performed). To generate the training set of random walks, NetGAN employs a second-order random walk, node2vec, which captures the local and global structure of the graph effectively via a two-step weighting scheme [Grover2016]: given an edge , suppose that the previous transition of the random walk was from some node to . The second order bias is chosen as

where and is the shortest-length path between and . The unnormalized transition probability from to equals

where is the weight of edge (equal to for unweighted graphs). In MMGAN, we change this weight to incorporate the 3-node motif statistics of the graph and bias the random walk towards edges that are more likely to be part of a particular network motif. This bias is different from the bias introduced to control the extent of exploration in the graph.

To find the correct biases, we first count the motifs in the graph of interest. While a complete enumeration of the motifs present in large-scale network is computationally prohibitive, a number of efficient motif-sampling algorithms exist that approximate the frequencies of different motif in a network [Masoudi-Nejad2012, Wernicke2006]. In our analysis, we use FANMOD, a fast network motif detection algorithm that can handle both directed and undirected networks and finds motifs containing up to nodes [Wernicke2006]. For simplicity, we focus on 3-node motifs since they represent the most significant structures in social networks and are likely to be contained in other higher-order motifs; this allows one to implicitly include information about higher-order interactions while limiting the complexity of the MMGAN platform. However, it is straightforward to adapt MMGAN to account for motifs of larger sizes with adjustments in the weight calculation and graph combination procedures so as to account for different nested non-isomorphic motifs.

Using FANMOD, we first estimate the total number of 3-node motifs which are of types and listed in Figure 1. The concentration of a motif equals , where and denotes the number of motifs of type in the graph. For an edge , we define to be the number of motifs of type in which the edge participates. Then, the motif-biased weight of the edge equals

where

Thus, is a weighted average of the motif counts, weighted by an appropriate function of concentration. The chosen bias will lead to a higher frequency of the particular motif in the output graph compared to the input graph. In order to obtain motif counts that reflect the counts in the input, we combine the output score matrices of three GANs with random walks biased as follows: without using motif weights as in NetGAN (), using weights that bias towards (), and using weights biased towards (). The matrix leads to a good characterization of the input edge set. From , we obtain a better characterization of the motifs in the graphs when compared to , but with a frequency that is higher. Similarly, ensures a good characterization of triangles, albeit once again with a higher count than observed in the input. These three ‘views’ of the graph provide a close approximation of the actual motif frequencies and concentrations once properly combined. To handle different tasks such as motif generation and link prediction, we propose two different ways of combining the score matrices:

I. Multi-view combination for link prediction (MMGAN-Avg): We combine the three score matrices via averaging, resulting in . Edges are sampled in the same manner as in NetGAN by first normalizing to produce a transition probability matrix, then selecting a node at random and choosing one of its neighbors according to the above distribution. We add an edge between the corresponding nodes in the output graph and continue adding edges similarly until the same number of edges as in the input graph is reached.

II. Multi-view combination for graph generation (MMGAN): In this scheme, we sample both edges and motifs from the three views at random and add them to the output graph directly as follows. We first randomly choose one view of with probabilities respectively. Then, we choose one of two sampling methods, sampling by maximum score or random sampling, with probabilities and respectively, where is small to avoid overfitting.

If we choose sampling by maximum score, we first select the edge with the highest score in . Then, we add the corresponding subgraph structure to the output graph. In more detail, if we add to the output graph. If , we find all possible motifs containing and compute the average score for each possible motif. Then, we select the motif with the highest average score and add the two edges of the motif to the output graph. Similarly, if , we compute the average scores for all motifs (triangles) containing and add the three edges of the highest scoring motif to the output. After adding the edge(s), we remove the corresponding scores from to enforce sampling without replacement. We repeat this procedure with the next highest score in the score matrix and continue until the output graph has the same number of edges as the input.

If we choose random sampling, we first select a node uniformly at random. Then, similar to the previous combination method, we randomly sample two other nodes with the probability distribution defined by the normalized score matrix. Finally, if , we add edge to the output graph. If , we add the motif with edges , and if , we add the triangle with all three nodes. We continue this until the output contains the same number of edges as the input.

Choosing some of the maximum scoring edges and motifs ensures that the key edges that appear repeatedly in the sampled set of random walks are included in the output graph. The repeated appearances indicate that the edge has a high weight and therefore is a part of a larger number of motifs. Every time we sample from these heavy-weighted edges, we add an entire motif to the output graph. Thus, adding a small sample of these will lead to a higher frequency of motifs in the output. Furthermore, by adjusting , we can control the frequency of the different motif types as needed. This approach leads to a closer approximation of the motif counts in the original graph compared to MMGAN-avg, at the potential expense of link and motif prediction accuracy.

3 Experimental Results

We test the performance of MMGAN and MMGAN-Avg against NetGAN, which is shown to outperform a number of other benchmark graph generative models in terms of preserving the input graph statistics [Bojchevski2018]. For data, we use three real-world social networks, Cora [sen2008], Citeseer [mccallum2000], and Facebook [leskovec2012learning] with the characteristics described in Table 3. Note that in all of these networks, triangles () are statistically significant (occur with higher frequency in the real network compared to randomized networks). Thus, we are generally interested in keeping a comparable triangle count to the input network in our generated output.

In each of the experiments described, we train NetGAN, MMGAN, and MMGAN-Avg to 60% edge overlap (one of the methods of early stopping in NetGAN) and average results over 5 runs. We use an 80-20% training and testing split of the total 3-node motifs in the original graph.

Motif-targeted graph generation: We evaluate the ability of MMGAN and MMGAN-Avg to preserve the motif structures in the graph by comparing motif counts and motif concentrations in the output. For this, we combine the multiple score matrices using the combination schemes described in I and II. For II, we set , emphasizing triangles, and for every experiment. The choice of the probabilities is governed by the number of edges in the motifs being , and , respectively. The results for both combination schemes I and II are shown for comparison in Tables 1 and 2.

Link and motif prediction: We evaluate the predictive ability of the MMGAN and MMGAN-Avg as follows. For motif prediction, we use the test set of motifs held out during training and construct an equally-sized set of test non-motifs. For link prediction, we use the corresponding edges as test edges and non-edges.

Dataset Motif Input Motif Count Normalized Motif Count ( error)
NetGAN MMGAN MMGAN-avg NetGAN MMGAN MMGAN-avg
Citeseer V 22,763 18,369 23,280 17,464 0.8069 (0.1931) 1.0227 (0.0227) 0.7672 (0.2328)
T 1084 632 1285 722 0.5830 (0.4170) 1.1854 (0.1854) 0.6661 (0.3339)

Cora
V 47,239 39,401 58,967 35,640 0.8340 (0.1660) 1.2426 (0.2426) 0.7546(0.2454)
T 1558 796 1819 1006 0.5110 (0.4890) 1.1675 (0.1675) 0.6457(0.3543)

Facebook
V 1,238,448 1,337,952 1,204,147 1,329,432 1.0803 (0.0803) 0.9723(0.0277) 1.0735(0.0735)
T 420,329 233,566 168,607 236,144 0.5557 (0.4443) 0.4011(0.5989) 0.5618(0.4382)

Table 1: Raw motif counts in the generated graphs with normalization with respect to input count for better comparison. We use the dark shade to denote the best result (least error) over all methods and the light shade for any our methods that outperforms NetGAN.
Dataset Motif Input Motif Concentration KL Divergence
NetGAN MMGAN MMGAN-avg NetGAN MMGAN MMGAN-avg
Citeseer V 95.45 % 96.68% 94.75% 96.03% 0.2764
T 4.55% 3.32% 5.25% 3.97% 0.0777 0.0583
Cora V 96.81% 98.02% 97.00% 97.25% 0.3942
T 3.19% 1.98% 3.00% 2.75% 0.0086 0.0474
Facebook V 74.66% 85.14% 87.72% 84.92% 4.6922 7.5672
T 25.34% 14.86% 12.28% 15.08% 4.4839
Table 2: Motif distributions in generated graphs and comparison using Kullback-Leibler Divergence with respect to the input distribution.
Network
Cora 2485 10,138 96.81 3.19 99.97 0.03
Citeseer 2118 7358 95.45 4.55 99.94 0.06
Facebook 1034 53,498 74.66 25.34 96.68 3.32
Table 3: Statistics of the real-world network used for testing. and are the number of nodes and edges in the largest connected component of the graph respectively. and represent the concentration of each motif (proportion of motifs of each type in the total set of 3-node motifs). and show the average concentration of each motif type in a set of graphs drawn from the random graph model .

We use the average scores of these test motifs and edges to compute two metrics: AUC (Area Under the Curve of the Receiver Operating Characteristic) and AP (Average Precision), which are standard metrics for link prediction evaluation [Bojchevski2018]. Tables 4 and 5 show the results under each metric.

4 Discussion

While all three algorithms are quite successful, MMGAN-Avg outperforms all the other methods in every dataset under all metrics and should be the method of choice for motif prediction. The two different GAN-combining schemes essentially tradeoff between exploration and exploitation in different manners. MMGAN targets edges that are more likely to produce motifs and adds them to the output, thus ensuring that we obtain close to the input counts. MMGAN-Avg on the other hand incorporates information from all three views equally, resulting in a graph that better reflects the edge connectivity of the input network. Nevertheless, it appears plausible that large-scale tuning of the motif sampling probabilities and the proportions of the maximum and random score selection in MMGAN may lead to improved performance compared to MMGAN-Avg. These will be described in the full version of the paper.

We further note that even without explicitly incorporating statistics of 4-node motifs in the input network, MMGAN approximates their counts better than NetGAN. For example, we compare the square (4-node cycle) counts in the output when they were trained on Citeseer. NetGAN generates graphs that have a normalized count of 0.1204 on average, while MMGAN has a normalized count of 0.3012 on average in its output graphs. This supports our assumption that since 3-node motifs are likely to be contained in other higher-order motifs, using only the 3-node motif statistics still allows us to implicitly include information about the higher-order motifs.

Dataset Type NetGAN MMGAN MMGAN-avg
Citeseer Link 0.9599 0.9265 0.9675
Motif 0.9974 0.9958 0.9982
Cora Link 0.9159 0.8947 0.9340
Motif 0.9961 0.9907 0.9977
Facebook Link 0.9779 0.9751 0.9981
Motif 0.9733 0.9585 0.9770
Table 4: Link and motif prediction quality measured using Area Under the Curve (AUC).
Dataset Type NetGAN MMGAN MMGAN-avg
Citeseer Link 0.9655 0.9391 0.9730
Motif 0.9962 0.9950 0.9970
Cora Link 0.9223 0.9010 0.9429
Motif 0.9959 0.9902 0.9969
Facebook Link 0.9735 0.9743 0.9816
Motif 0.9578 0.9337 0.9632
Table 5: Link and motif prediction quality measured using Average Precision.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
398272
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description