Minfer: Inferring Motif Statistics From Sampled Edges

# Minfer: Inferring Motif Statistics From Sampled Edges

###### Abstract

Characterizing motif (i.e., locally connected subgraph patterns) statistics is important for understanding complex networks such as online social networks and communication networks. Previous work made the strong assumption that the graph topology of interest is known, and that the dataset either fits into main memory or is stored on disk such that it is not expensive to obtain all neighbors of any given node. In practice, researchers have to deal with the situation where the graph topology is unknown, either because the graph is dynamic, or because it is expensive to collect and store all topological and meta information on disk. Hence, what is available to researchers is only a snapshot of the graph generated by sampling edges from the graph at random, which we called a “RESampled graph”. Clearly, a RESampled graph’s motif statistics may be quite different from the underlying original graph. To solve this challenge, we propose a framework and implement a system called Minfer, which can take the given RESampled graph and accurately infer the underlying graph’s motif statistics. We also use Fisher information to bound the errors of our estimates. Experiments using large scale datasets show our method to be accurate.

Minfer: Inferring Motif Statistics From Sampled Edges

 Pinghui Wang1, John C.S. Lui2, and Don Towsley3 1Noah’s Ark Lab, Huawei, Hong Kong 2Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong 3Department of Computer Science, University of Massachusetts Amherst, MA, USA {wang.pinghui}@huawei.com, cslui@cse.cuhk.edu.hk, towsley@cs.umass.edu

\@float

\end@float

Complex networks are widely studied across many fields of science and technology, from physics to biology, and from nature to society. Networks which have similar global topological features such as degree distribution and graph diameter can exhibit significant differences in their local structures. There is a growing interest to explore these local structures (also known as “motifs”), which are small connected subgraph patterns that form during the growth of a network. Motifs have many applications, for example, they are used to characterize communication and evolution patterns in online social networks (OSNs) [?, ?, ?, ?], pattern recognition in gene expression profiling [?], protein-protein interaction prediction [?], and coarse-grained topology generation of networks [?]. For instance, 3-node motifs such as “the friend of my friend is my friend” and “the enemy of my enemy is my friend” are well known evolution patterns in signed (i.e., friend/foe) social networks. Kunegis et al. [?] considered the significance of motifs in Slashdot Zoo111www.slashdot.org and how they impact the stability of signed networks. Other more complex examples include 4-node motifs such as bi-fans and bi-parallels defined in [?].

Although motifs are important characteristics to help researchers to understand the underlying network, one major technical hurdle is that it is computationally expensive to compute motif frequencies since this requires one to enumerate and count all subgraphs in a network, and there exist a large number of subgraphs even for a medium size network with less than one million edges. For example, the graphs Slashdot [?] and Epinions [?], which contain approximately nodes and edges have more than 4-node connected and induced subgraphs [?]. To address this problem, several sampling methods have been proposed to estimate the frequency distribution of motifs [?, ?, ?, ?]. All these methods require that the entire graph topology fit into memory, or the existence of an I/O efficient neighbor query API available so that one can explore the graph topology, which is stored on disk. In summary,previous work focuses on designing computationally efficient methods to characterize motifs when the entire graph of interest is given.

In practice the graph of interest may not be known, but instead the available dataset is a subgraph sampled from the original graph. This can be due to the following reasons:

• Data collection: Sampling is inevitable for collecting a large dynamic graph given as a high speed stream of edges. For example, sampling is used to collect network traffic on backbone routers in order to study the network graph where a node in the graph represents a host and an edge represents a connection from host to host , because capturing the entire traffic is prohibited due to the high speed traffic and limited resources (e.g. memory and computation) of network devices.

• Data transportation: Sampling may also be required to reduce the high cost of transporting an entire dataset to a remote data analysis center.

• Memory and computation: Sometimes the graph of interest is given in a memory expensive format such as a raw text file, and may be too large to fit into memory. Moreover, it may be too expensive to preprocess and organize it on disk. In such cases, it may be useful to build a relatively small graph consisting of edges sampled from the graph file at random, and compute its motif statistics in memory.

A simple example is given in Fig. Minfer: Inferring Motif Statistics From Sampled Edges, where the sampled graph is derived from the dataset representing . In this work, we assume the available graph is obtained through random edge sampling (i.e, each edge is independently sampled with the same probability ), which is popular and easy to implement in practice. Formally, we denote the graph as a RESampled graph of . One can easily see that a RESampled graph’s motif statistics will differ from those of the original graph due to uncertainties introduced by sampling. For example, Fig. Minfer: Inferring Motif Statistics From Sampled Edges shows that is a 4-node induced subgraph in the RESampled graph , and we do not know from which original induced subgraph in that it derives. could be any one of the five subgraphs depicted in Fig. Minfer: Inferring Motif Statistics From Sampled Edges.

Unlike previous methods [?, ?, ?, ?], we aim to design an accurate method to infer motif statistics of the original graph from the available RESampled graph . These previous methods focus on designing computationally efficient sampling methods based on sampling induced subgraphs in to avoid the problem shown in Fig. Minfer: Inferring Motif Statistics From Sampled Edges. Hence they fail to infer the underlying graph’s motif statistics from the given RESampled graph. The gSH method in  [?] can be used to estimate the number of connected subgraphs from sampled edges. However it cannot be applied to characterize motifs, i.e., connected and induced subgraphs (or CISes), because motif statistics can differ from connected subgraphs’ statistics. For example, Fig. Minfer: Inferring Motif Statistics From Sampled Edges shows that of a graph’s 4-node connected subgraphs are isomorphic to a 4-node line (i.e., the first motif in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (b)), while of its 4-node CISes are isomorphic to a 4-node line.

Contribution: Our contribution can be summarized as: To the best of our knowledge, we are the first to study and provide an accurate and efficient solution to estimate motif statistics from a given RESampled graph. We introduce a probabilistic model to study the relationship between motifs in the RESampled graph and in the underlying graph. Based on this model, we propose an accurate method, Minfer, to infer the underlying graph’s motif statistics from the RESampled graph. We also provide a Fisher information based method to bound the error of our estimates. Experiments on a variety of real world datasets show that our method can accurately estimate the motif statistics of a graph based on a small RESampled graph.

This paper is organized as follows: The problem formulation is presented in Section Minfer: Inferring Motif Statistics From Sampled Edges. Section Minfer: Inferring Motif Statistics From Sampled Edges presents our method (i.e. Minfer) for inferring subgraph class concentrations of the graph under study from a given RESampled graph. Section Minfer: Inferring Motif Statistics From Sampled Edges presents methods for computing the given RESampled graph’s motif statistics. The performance evaluation and testing results are presented in Section \thetable. Section Minfer: Inferring Motif Statistics From Sampled Edges summarizes related work. Concluding remarks then follow.

In this section, we first introduce the concept of motif concentration, then we discuss the challenges of computing motif concentrations in practice.

Denote the underlying graph of interest as a labeled undirected graph , where is a set of nodes, is a set of undirected edges, , and is a set of labels associated with edges . For example, we attach a label to indicate the direction of the edge for a directed network. Edges may have other labels too, for instance, in a signed network, edges have positive or negative labels to represent friend or foe relationship. If is empty, then is an unlabeled undirected graph, which is equivalent to the regular undirected graph.

Motif concentration is a metric that represents the distribution of various subgraph patterns that appear in . To illustrate, we show the 3-, 4- and 5-nodes subgraph patterns in Figs. Minfer: Inferring Motif Statistics From Sampled EdgesMinfer: Inferring Motif Statistics From Sampled Edges,and Minfer: Inferring Motif Statistics From Sampled Edges respectively. To define motif concentration formally, we first need to introduce some notation. For ease of presentation, Table Minfer: Inferring Motif Statistics From Sampled Edges depicts the notation used in this paper.

An induced subgraph of , , , and , is a subgraph whose edges and associated labels are all in , i.e. , . We define as the set of all connected induced subgraphs (CISes) with nodes in , and denote . For example, Fig. Minfer: Inferring Motif Statistics From Sampled Edges depicts all possible 4-node CISes. Let denote the number of -node motifs and denote the -node motif. For example, and are the six 4-node undirected motifs depicted in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (b). Then we partition into equivalence classes, or , where CISes within are isomorphic to .

Let denote the frequency of the motif , i.e., the number of the CISes in isomorphic to . Formally, we have , which is the number of CISes in . Then the concentration of is defined as

 ω(k)i=n(k)in(k),1≤i≤Tk.

Thus, is the fraction of -node CISes isomorphic to the motif among all -node CISes. In this paper, we make the follow assumptions:

• Assumption 1: The complete is not available to us, but a RESampled graph of is given, where , , and are node, edge, and edge label sets of respectively. is obtained by random edge sampling, i.e., each edge in is independently sampled with the same probability , where is known in advance.

• Assumption 2: The label of a sampled edge is the same as that of in , i.e., .

These two assumptions are satisfied by many applications’ data collection procedures. For instance, the source data of online applications such as network traffic monitoring is given as a streaming of directed edges, and the following simple method is computational and memory efficient for collecting edges and generating a small RESampled graph, which will be sent to remote network traffic analysis center: Each incoming directed edge is sampled when , where is an integer (e.g., 10,000) and is a hash function satisfying and mapping edges into integers uniformly. The property guarantees that edges and are sampled or discarded simultaneously. Hence the label of a sampled edge is the same as that of in . Using universal hashing [?], a simple instance of is given as the following function when each is an integer smaller than

 τ(u,v)=(a(min{u,v}Δ+max{u,v})+b)modγmodρ,

where is a prime larger than , and are any integers with and . We can easily find that and maps edges into integers uniformly. The computational and space complexities of the above sampling method are both , which make it easy to use for data collections in practice. As alluded before, in this paper, we aim to accurately infer the motif concentrations of based on the given RESampled graph .

The motif statistics of RESampled graph and original graph can be quite different. In this section, we introduce a probabilistic model to bridge the gap between the motif statistics of and . Using this model, we will show there exists a simple and concise relationship between the motif statistics of and . We then propose an efficient method to infer the motif concentration of from . Finally, we also give a method to construct confidence intervals of our estimates of motif concentrations.

To estimate the motif statistics of based on , we develop a probabilistic method to model the relationship between the motifs in and . Define where is the probability that is isomorphic to motif given that is isomorphic to motif , i.e., .

To obtain , we first compute , which is the number of subgraphs of isomorphic to . For example, (i.e., the triangle) includes three subgraphs isomorphic to (i.e., the wedge) for the undirected graph shown in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (a). Thus, we have for 3-node undirected motifs. When , . It is not easy to compute manually for 4- and 5-node motifs. Hence we provide a simple method to compute in Algorithm 1. The computational complexity is . Note that the cost of computing is not a big concern because in practice, is usually five or less for motif discovery. Denote by and the sets of nodes and edges in subgraph respectively. We have the following equation

 Pi,j=ϕi,jp|E(M(k)i)|q|E(M(k)j)|−|E(M(k)i)|,

where . For all CISes in isomorphic to , the above model tells us that approximately of these CISes are expected to appear as CISes isomorphic to in .

Using the above probabilistic model, we propose a method Minfer to estimate motif statistics of from . Denote by , , , the number of CISes in isomorphic to the motif . The method to compute is presented in next section. Then, the expectation of is computed as

 E[m(k)i]=∑1≤j≤Tkn(k)jPi,j. (1)

In matrix notation, Equation  (1) can be expressed as

 E[m(k)]=Pn(k),

where , , and . Then, we have

 n(k)=P−1E[m(k)].

Thus, we estimate as

 ^n(k)=P−1m(k),

where . We easily have

 E[^n(k)]=E[P−1m(k)]=P−1% E[m(k)]=n(k),

therefore is an unbiased estimator of . Finally, we estimate as follows

 ^ω(k)i=^n(k)i∑Tkj=1^n(k)j,1≤i≤Tk. (2)

Denote by the concentration of motif in , i.e., . Then we observe that  (2) is equivalent to the following equation, which directly describes the relationship between motif concentrations of and . Let and

 ^ω=P−1ρW, (3)

where is a normalizer. For 3-node undirected motifs, is computed as

 P=(p23qp20p3),

and the inverse of is

 P−1=(p−2−3qp−30p−3).

Expressions for and for 3-node signed undirected motifs, 3-node directed motifs, 4-node undirected motifs, and 5-node undirected motifs can be found in Appendix.

It is difficult to directly analyze the errors of our estimate , because it is complex to model the dependence of sampled CISes due to their shared edges and nodes. Instead, we derive a lower bound on the mean squared error (MSE) of using the Cramér-Rao lower bound (CRLB) of , which gives the smallest MSE that any unbiased estimator of can achieve. For a -node CIS selected from -node CISes of at random, the probability that is isomorphic to the -th -node motif is . Let be the induced subgraph of the node set in the RESampled graph . Clearly, may not be connected. Furthermore there may exist nodes in that are not present in . We say is evaporated in for these two scenarios. Let denote the probability that is evaporated given that its original CIS is isomorphic to the -th -node motif. Then, we have

 P0,j=1−Tk∑l=1Pl,j.

For a random -node CIS of , the probability that its associated in is isomorphic to the -th -node motif is

 ξi=P(M(s∗)=M(k)i)=Tk∑j=1Pi,jω(k)j,1≤i≤Tk,

and the probability that is evaporated is

 ξ0=Tk∑j=1P0,jω(k)j.

When is evaporated, we denote . Then, the likelihood function of with respect to is

 f(i|ω(k))=ξi,0≤i≤Tk.

The Fisher information of with respect to is defined as a matrix , where

 Ji,j=E[∂lnf(l|ω(k))∂ωi∂lnf(l|ω(k))∂ωj]=Tk∑l=0∂lnf(l|ω(k))∂ωi∂lnf(l|ω(k))∂ωjξl=Tk∑l=0Pl,iPl,jξl.

For simplicity, we assume that the CISes of are independent (i.e., none overlapping edges). Then the Fisher information matrix of all -node CISes is . The Cramér-Rao Theorem states that the MSE of any unbiased estimator is lower bounded by the inverse of the Fisher information matrix, i.e.,

 MSE(^ω(k)i)=E[(^ω(k)i−ω(k)i)2]≥(J−1)i,i−ω(k)(ω(k))Tn(k)

provided some weak regularity conditions hold [?]. Here the term corresponds to the accuracy gain obtained by accounting for the constraint .

The existing generalized graph enumeration method  [?] can be used for enumerating all -node CISes in the RESampled graph , however it is complex to apply and is inefficient for small values of . In this section, we first present a method (an extension of the NodeIterator++ method in [?]) to enumerate all 3-node CISes in . Then, we propose new methods to enumerate 4 and 5-node CISes in respectively. In what follows we denote as the neighbors of in . Note that in this section is the default graph when a function’s underlying graph is omitted for simplicity. For example, the CIS with nodes , , and refers to the CIS with nodes , , and in .

Algorithm 2 shows our 3-node CISes enumeration method. Similar to the NodeIterator++ method in [?], we “pivot" (the associated operation is discussed later) each node to enumerate CISes including . For any two neighbors and of , we can easily find that the induced graph with nodes , and is a 3-node CIS. Thus, we enumerate all pairs of two nodes in , and update their associated 3-node CIS for . We call this process “pivoting" for 3-node CISes.

Clearly, a 3-node CIS is counted three times when the associated undirected graph of by discarding edge labels is isomorphic to a triangle, once by pivoting each node , , and . Let be a total order on all of the nodes, which can be easily defined and obtained, e.g. from array position or pointer addresses. To ensure each CIS is enumerated once and only once, we let one and only one node in each CIS be “responsible" for making sure the CIS gets counted. When we “pivot" and enumerate a CIS , is counted if is the ‘responsible" node of . Otherwise, is discarded and not counted. We use the same method in [?, ?], i.e., let the node with lowest order in a CIS whose associated undirected graph isomorphic to a triangle be the “responsible" node. For the other classes of CISes, their associated undirected graphs are isomorphic to an unclosed wedge, i.e., the first motif in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (a). For each of these CISes, we let the node in the middle of its associated undirected graph (e.g., the node with degree 2 in the unclosed wedge) be the “responsible" node.

Algorithm 3 shows our 4-node CISes enumeration method. To enumerate 4-node CISes, we “pivoting" each node as follows: For each pair of ’s neighbors and where , we compute the neighborhood of , , and , defined as . For any node , we observe that the induced graph consisting of nodes , , , and is a 4-node CIS. Thus, we enumerate each node in , and update the 4-node CIS consisting of , , , and . We repeat this process until all pairs of ’s neighbors and are enumerated and processed.

Similar to 3-node CISes, some 4-node CISes might be enumerated and counted more than once when we “pivoting" each node as above. To solve this problem, we propose the following methods for making sure each 4-node CIS is enumerated and gets counted once and only once: When and , we discard . Otherwise, denote by the associated undirected graph of by discarding edge labels. When includes one and only one node having at least 2 neighbors in , we let be the “responsible" of . For example, the node 4 is the “responsible" node of the first subgraph in Fig. Minfer: Inferring Motif Statistics From Sampled Edges. When includes more than one node having at least 2 neighbors in , we let the node with lowest order among the nodes having at least 2 neighbors in be the “responsible" node of . For example, the nodes 6 and 3 are the “responsible" nodes of the second and third subgraphs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges.

Algorithm 4 describes our 5-node CISes enumeration method. For a 5-node CIS , we classify it into two types according to its associated undirected graph :

• 5-node CIS with type 1: has at least one node having more than two neighbors in ;

• 5-node CIS with type 2: has no node having more than two neighbors in , i.e., is isomorphic to a 5-node line or a circle, i.e., the first or sixth motifs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (c).

We propose two different methods to enumerate these two types of 5-node CISes respectively.

To enumerate 5-node CISes with type 1, we “pivoting" each node as follows: When has at least three neighbors, we enumerate each combination of three nodes where , and then compute the neighborhood of , , , and , defined as . For any node , we observe that the induced graph consisting of nodes , , , , and is a 5-node CIS. Thus, we enumerate each node in , and update the associated 5-node CIS consisting of , , , , and . We repeat this process until all combinations of three nodes are enumerated and processed. Similar to 4-node CISes, we propose the following method to make sure each 5-node is enumerated and gets counted once and only once: When and , we discard . Otherwise, let be the associated undirected graph of , and we then pick the node with lowest order among the nodes having more than two neighbors in be the “responsible" node. The third and fourth subgraphs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges are two corresponding examples.

To enumerate 5-node CISes with type 2, we “pivoting" each node as follows: When has at least two neighbors, we first enumerate each pair of ’s neighbors and where . Then, we compute defined as the set of ’s neighbors not including and and not connecting to and , that is, . Similarly, we compute defined as the set of ’s neighbors not including and and not connecting to and , i.e., . Clearly, . For any and , we observe that the induced graph consisting of nodes , , , , and is a 5-node CIS with type 2. Thus, we enumerate each pair of and , and update the 5-node CIS consisting of , , , , and . We repeat this process until all pairs of ’s neighbors and are enumerated and processed. To make sure each CIS is enumerated and gets counted once and only once, we let the node with lowest order be the “responsible" node when the associated undirected graph of isomorphic to a 5-node circle. When is isomorphic to a 5-node line, we let the node in the middle of the line be the “responsible" node. The first and second subgraphs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges are two examples respectively.

In this section, we first introduce our experimental datasets. Then we present results of experiments used to evaluate the performance of our method, Minfer, for characterizing CIS classes of size .

We evaluate the performance of our methods on publicly available datasets taken from the Stanford Network Analysis Platform (SNAP)222www.snap.stanford.edu, which are summarized in Table Minfer: Inferring Motif Statistics From Sampled Edges. We start by evaluating the performance of our methods in characterizing -node CISes over million-node graphs: Flickr, Pokec, LiveJournal, YouTube, Web-Google, and Wiki-talk, contrasting our results with the ground truth computed through an exhaustive method. It is computationally intensive to calculate the ground-truth of -node and -nodes CIS classes in large graphs. For example, we can easily observe that a node with degree is included in at least 4-node CISes and 5-node CISes, therefore it requires more than and operations to enumerate the 4-node and 5-node CISes of the Wiki-talk graph, which has a node with 100,029 neighbors. Even for a relatively small graph such as soc-Slashdot08, it takes almost 20 hours to compute all of its 4-node CISes. To solve this problem, the experiments for 4-node CISes are performed on four medium-sized graphs soc-Epinions1, soc-Slashdot08, soc-Slashdot09, com-DBLP, and com-Amazon, and the experiments for 5-node CISes are performed on four relatively small graphs ca-GR-QC, ca-HEP-TH, ca-CondMat, and p2p-Gnutella08, where computing the ground-truth is feasible. We also evaluate the performance of our methods for characterizing signed CIS classes in graphs sign-Epinions, sign-Slashdot08, and sign-Slashdot09.

In our experiments, we focus on the normalized root mean square error (NRMSE) to measure the relative error of the estimator of the subgraph class concentration , . is defined as:

 NRMSE(^ωi)=√MSE(^ωi)ωi,i=1,2,…,

where is defined as the mean square error (MSE) of an estimate with respect to its true value , that is

 MSE(^ω)=E[(^ω−ω)2]=var(^ω)+(E[^ω]−ω)2.

We note that decomposes into a sum of the variance and bias of the estimator . Both quantities are important and need to be as small as possible to achieve good estimation performance. When is an unbiased estimator of , then we have and thus is equivalent to the normalized standard error of , i.e.,