Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets

# Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets

###### Abstract

Counting the frequencies of 3-, 4-, and 5-node undirected motifs (also know as graphlets) is widely used for understanding complex networks such as social and biology networks. However, it is a great challenge to compute these metrics for a large graph due to the intensive computation. Despite recent efforts to count triangles (i.e., 3-node undirected motif counting), little attention has been given to developing scalable tools that can be used to characterize 4- and 5-node motifs. In this paper, we develop computational efficient methods to sample and count 4- and 5- node undirected motifs. Our methods provide unbiased estimators of motif frequencies, and we derive simple and exact formulas for the variances of the estimators. Moreover, our methods are designed to fit vertex centric programming models, so they can be easily applied to current graph computing systems such as Pregel and GraphLab. We conduct experiments on a variety of real-word datasets, and experimental results show that our methods are several orders of magnitude faster than the state-of-the-art methods under the same estimation errors.

Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets

 Pinghui Wang, Jing Tao, Junzhou Zhao, and Xiaohong Guan MOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, China {phwang, jtao, jzzhao, xhguan}@sei.xjtu.edu.cn

\@float

\end@float

Design tools for counting the frequencies of the appearance of 3-, 4-, and 5-node connected subgraph patterns (i.e., motifs, also known as graphlets) in a graph is important for understanding and exploring networks such as online social networks and computer networks. For example, a variety of motif-based network analysis techniques have been widely used to characterize communication and evolution patterns in OSNs [?, ?, ?, ?], Internet traffic classification and anomaly detection [?, ?], pattern recognition in gene expression profiling [?], protein-protein interaction predication [?], and coarse-grained topology generation [?].

Due to combinatorial explosion, it is computational intensive to enumerate and count motif frequencies even for a moderately sized graph. For example, medium-size networks Slashdot [?] and Epinions [?] have nodes and edges but have more than 4-node connected and induced subgraphs (CISes) [?]. To address this problem, cheaper methods such as sampling can be used rather than the brute-force enumeration method. Unfortunately, existing methods of estimating motif concentrations [?, ?, ?, ?, ?] cannot be used to estimate motif frequencies, which are more fundamental than motif concentrations.

Despite recent efforts to count triangles [?, ?, ?, ?], little attention has been given to developing scalable tools that can be used to characterize 4- and 5-node motifs. Jha et al. [?] develop sampling methods to estimate 4-node undirected motifs’ frequencies. In our experiment we observe that their methods do not bound the estimation error tightly, so they significantly over-estimate the sampling budget required to achieve a certain accuracy. Meanwhile, their methods cannot be easily extended to characterize 5-node undirected motifs. Moreover, their methods use an edge-centric program model, so it is difficult to implement them on current graph computing systems such as Pregel [?], GraphLab [?] and GraphChi [?]. In this paper, we propose new methods to estimate the frequencies of 4- and 5-node motifs. Our contributions are summarized as: 1) Our methods of sampling 4- and 5-node motifs are computational efficient and scalable. Meanwile, they can be easily implemented via vertex centric programming models, which are required by most current graph computing systems. 2) To validate our methods, we perform an in-depth analysis. We find that our methods provide unbiased estimators of motif frequencies. To the best of our knowledge, we are the first to derive simple and exact formulas for the variances of the estimators, which is critical for determining a proper sampling budget in practice. Moreover, we conduct experiments on a variety of publicly available datasets, and experimental results show that our methods significantly outperform the state-of-the-art methods.

The rest of this paper is organized as follows. The problem formulation is presented in Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets. Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets introduces preliminaries used in this paper. Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets presents our 4- and 5-node motif sampling methods. The performance evaluation and testing results are presented in Section \thefigure. Section \thefigure summarizes related work. Concluding remarks then follow.

Let be the undirected graph of interest, where and are the sets of nodes and edges respectively. To formally define 4- and 5-node motif frequencies of , we first introduce some notations. An induced subgraph of , , is a subgraph whose edges are all in , i.e. , . We would like to point out that if we do not say “induced" in this paper, we mean that a subgraph is not necessarily induced. Fig. Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets(a) shows all 4-node motifs of any undirected network. Denote as the set of 4-node CISes in isomorphic to motif , and then the motif frequency of is defined as , . Fig. Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets(b) shows all 5-node motifs of any undirected network. Denote as the set of 5-node CISes in isomorphic to motif , and then the motif frequency of is defined as , . In this paper, we aim to develop computational methods to estimate and . For ease of reading, we list notations used throughout the paper in Table Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets and we present the proofs of all theorems in this paper in Appendix.

###### Theorem 1

Suppose we have unbiased estimates of , i.e., . When these estimates are independent and their variances are , . Using all these estimates, we can obtain a more accurate unbiased estimate of by solving

 min∑ni=1αi=1Var(^c)=Var(n∑i=1αici).

We can easily obtain the optimal when . We can also estimate the confidence interval of by the Central Limit Theorem. That is, as , for any , we have

 Pr(|^c−c|≥ε√Var(^c))→1√2π∫+∞εe−t22dt≈e−ε22√2πε.

To describe the state-of-the-art 4-node motif sampling methods: 3-path sampling and centered 3-path sampling [?], we first introduce some notations. Let be the set of neighbors of a node in . Denote the degree of as , which is defined as the number of neighbors of in , i.e., . Let be a total order on all of the nodes in , which can be easily defined and obtained. For example, suppose we order all nodes based on their degrees and node IDs, and we define if or, if while the node ID of is large than that of . Let denote the set of ’s neighbors with order larger than , i.e.,

 Nu,v={x:x∈Nu, and x≻v}.

Denote .

To sample a 4-node CIS, the 3-PATH sampling method mainly consists of five steps: 1) Sample an edge from according to the distribution

 {π(u,v)=(du−1)(dv−1)∑(u′,v′)∈E(du′−1)(dv′−1):(u,v)∈E},

i.e., the probability of sampling an edge is ; 2) Sample a node from uniformly at random; 3) Sample a node from uniformly at random; 4) Retrieve the CIS including nodes , , , and . Note that might be a 3-node CIS when .

Compared to 3-path sampling, centered 3-path sampling is tailored to estimate the frequencies of 4-node motifs , , and , which are usually not frequently appeared in many real networks. To sample a 4-node CIS, the centered 3-PATH sampling method mainly consists of five steps: 1) Sample an edge from according to the distribution

 {π(u,v)=du,vdv,u∑(u′,v′)∈Edu′,v′dv′,u′:(u′,v′)∈E};

2) Sample a node from at random; 3) Sample a node from at random; 4) Retrieve the CIS including nodes , , , and . Similarly, might be a 3-node CIS.

Vertex-centric programming models require users to express their algorithms by “thinking like a vertex". Each node contains information about itself and all its immediate neighbors, and the algorithms’ operations are expressed at the level of a single node. For example, the operations of a node in Pregel involve receiving messages from other nodes, updating the state of itself and its edges, and sending messages to other nodes. Vertex-centric models are very easy to program and have been widely used for many graph mining and machine learning algorithms.

In this section, we introduce our sampling methods: MOSS-4 and MOSS-4Min. MOSS-4 is used to estimate all 4-node motifs’ frequencies. We observe that MOSS-4 might exhibit large errors for characterizing rare motifs (i.e., motifs with low frequencies) for a small sampling budget. In addition to MOSS-4, we also develop a method MOSS-4Min to further reduce the errors for characterizing rare motifs.

Denote by . We assign a weight to each node . Define and . Our method of sampling a 4-node CIS mainly consists of five steps: 1) Sample a node from according to the distribution ; 2) Sample a random node from according to the distribution , where is defined as

 σ(v)u=du−1∑x∈Nv(dx−1),u∈Nv; (1)

3) Sample a node from uniformly at random; 4) Sample a node from uniformly at random; 5) Retrieve the CIS including nodes , , , and . We set the sampling budget as , i.e., we run the above method times to obtain CISes . The pseudo-code of MOSS-4 is shown in Algorithm 1. In Algorithm 1, function returns a node sampled from according to the distribution , function returns a node sampled from at random, and function returns the CIS with the node set in .

Let , , be the number of subgraphs in motif that are isomorphic to motif . We can easily compute , , , , , and . To remove the error introduced by sampling, we analyze the bias of MOSS-4 as follows:

###### Theorem 2

When the sampling budget , MOSS-4 samples a CIS with probability , .

We let be the 4-node motif class ID of when is a 4-node CIS, and -1 otherwise (i.e., is a triangle). Let denote the indicator function that equals one when predicate is true, and zero otherwise. Denote . For , is larger than zero and we estimate as

 ^ni=miKpi,i∈{1,3,4,5,6}.

Let . Then, the number of all 4-node subgraphs (not necessarily induced) in isomorphic to motif is . Let , , be the number of subgraphs in motif that are isomorphic to motif . We have , , , , , and . We can easily find that

 Λ3=6∑i=1φ(2)ini=n2+n4+2n5+4n6. (2)

Thus, we estimate as

 ^n2=Λ3−^n4−2^n5−4^n6.
###### Theorem 3

is an unbiased estimator of , . The variance of is

 Var(^ni)=niK(1pi−ni),i∈{1,3,4,5,6}.

The variance of is computed as

 Var(^n2)=1K(n4p4+4n5p5+16n6p6−(n4+2n5+4n6)2).

From Theorem 1, we can easily compute a sampling budget that can guarantee for any and , .

Initialization: For each node , we store its degree and use a list to store its neighbors . Therefore, it requires operations to compute , and the computational complexity of processing all nodes is .

: We use a list to store the nodes in . We store an array in memory, where is defined as , . Clearly, . Let . Then, can be easily achieved by the following three steps:

• Step 1: Select a number from at random;

• Step 2: Find satisfying , which can be solved by the binary search algorithm;

• Step 3: Return .

Its computational complexity is .

: We use a list to store the neighbors of . We store an array in memory, where is defined as , . Let . Then, can be easily achieved by the following three steps:

• Step 1: Select a number from at random;

• Step 2: Find satisfying

 ACC_σ(v)[i−1]

which can be solved by the binary search algorithm;

• Step 3: Return .

Its computational complexity is .

: Let denote the index of in the list , i.e., . Then, function can be achieved by the following steps:

• Step 1: Select a number from at random;

• Step 2: Return .

Its computational complexity is .

In summary, the complexity of MOSS-4 sampling CISes is .

From the above derived formulas of the variances of MOSS-4, we can see that MOSS-4 might exhibit larger errors for 4-node motifs with lower frequencies when allocating a small sampling budget . To solve this problem, we develop a better method MOSS-4Min to further reduce the errors for estimating the frequencies of 4-node motifs , , and .

Let , . MOSS-4Min assigns a weight to each node . Define and . MOSS-4Min mainly consists of five steps: 1) Sample a node from according to the distribution . 2) Sample a node from according to the distribution , where is defined as

 ˇσ(v)u=du,vdv,uˇΓv,u∈Nv; (3)

3) Sample a node from at random; 4) Sample a node from at random; 5) Retrieve the CIS including nodes , , , and . We set the sampling budget as to obtain CISes .

###### Theorem 4

When the sampling budget , MOSS-4Min samples CISes , , and with probabilities , , and respectively.

We estimate , , and as

 ˇni=ˇK∑k=1miˇKˇpi,i=3,5,6,

where . The variances of , , and are given in the following theorem. We omit the proof, which is analogous to that of Theorem 3.

###### Theorem 5

is an unbiased estimator of , . Its variance is

 Var(ˇni)=niˇK(1ˇpi−ni),i=3,5,6.

From Theorems 13, and 5, we can easily obtain a more accurate estimator of by combining and , .

We easily extend methods in Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets to design functions and in Algorithm 2. The computational complexity of MOSS-4min sampling CISes is .

In this subsection, we show MOSS-4 and MOSS-4MIN can be easily implemented via vertex-centric programming models.

First, we sample nodes in according to . Let denote the number of times a node sampled. Thus, . For each node , we set as its node value, and then repeat the set of four following operations times

 u←WeightRandomVertex(Nv,σ(v)),
 w←RandomVertex(Nv−{u}),
 Update(A) and then MSG(v,∗,w,∗,A)→u,

where is the adjacent matrix of the CIS consisting of nodes , , , and , which are the variables in the Algorithm 1, i.e., the nodes sampled at the 1-st, 2-nd, 3-rd, and 4-th steps respectively. Note that here and some entries in are unknown. Function is used to get the values of unknown entries in based the edges of the current node . Function generates a message , and sends the message to , which is a neighbor of .

We process the messages that a node receives as follows:

• When a node receives a message like , do

 r←RandomVertex(Nu−{v}),
 Update(A) and then MSG(v,u,w,∗,A)→r.
• When a node receives a message like , we first . From , we then have all the edges between , , , and . Last, we set , where is the motif class of the CIS consisting of , , , and .

Similar to MOSS-4, we sample nodes in according to . Let denote the number of times a node sampled. Thus, . For each node , we set as its node value, and then repeat the set of four following operations times

 u←WeightRandomVertex(Nv,ˇσ(v)),
 w←RandomVertex(Nv,u),
 Update(A) and then MSG(v,∗,w,∗,A)→u.

We process the messages that a node receives as follows:

• When a node receives a message like , do

 r←RandomVertex(Nu,v),
 Update(A) and then MSG(v,u,w,∗,A)→r.
• When a node receives a message like , we first and then set , where is the motif class of the CIS consisting of , , , and .

MOSS-4 and MOSS4-Min can be viewed as the vertex-centric versions of the 3-path and centered 3-path sampling methods respectively. Suppose we use 4 bytes to store a node ID and its weight . The 3-path and centered 3-path sampling methods require bytes of memory, but MOSS-4 and MOSS-4Min need only bytes, which is orders of magnitude smaller than for many real-world large networks. Therefore, MOSS-4 and MOSS-4Min are suit for disk-based graph computing systems such as GraphChi and VENUS [?], which aim to analyze big graphs when the graphs of interest cannot be fitted into memory. Moreover, MOSS-4 and MOSS-4Min can be easily implemented in distributed vertex-centric graph computing systems such as Pregel and GraphLab. Meanwhile, we would like to point out we give the closed-form formulas for the variances of MOSS-4 and MOSS-4Min. They are critical to evaluate the error of an estimate and determine a proper sampling budget in order to guarantee certain accuracy. Moreover, they can also help us to make the right sampling strategies in advance. An example is given in the following subsection.

From Theorems 3 and 5, when , we have

 Var(^ni)Var(ˇni)=1/pi−ni1/ˇpi−ni≈ˇpipi,,i=3,5,6,

where , , and . Thus, the value of helps us to determine whether it is necessary to apply MOSS-4Min to further reduce the errors of estimating , , and . For example, the graph ca-GrQc [?] has . In our experiments we observe that MOSS-4Min slightly improves the accuracy of MOSS-4 for estimating and of ca-GrQc, and exhibits a larger error than MOSS-4 for estimating of ca-GrQc.

MOSS-5, our method of estimating frequency of all 5-node motifs, consists of two sub-methods: T-5 and Path-5. We develop T-5 to sample 5-node CISes that include at least one subgraph isomorphic to . Similarly, Path-5 is developed to sample 5-node CISes that include at least one subgraph isomorphic to . Finally, we propose a method to estimate the frequency of all 5-node motifs based on sampled CISes given by T-5 and Path-5.

The pseudo-code of T-5 is shown in Algorithm 3. Let

 Γ(1)v=(dv−1)(dv−2)∑x∈Nv(dx−1),v∈V.

We assign a weight to each node . Define and . To sample a 5-node CIS, T-5 mainly consists of five steps: 1) Sample a node from according to the distribution ; 2) Sample a node of according to the distribution , where is defined the same as in (1); 3) Sample two different nodes and from at random; 4) Sample a node from uniformly at random; 5) Retrieve the CIS including nodes , , , and . We run the above method times to obtain CISes .

Let , , be the number of subgraphs in motif that are isomorphic to motif . The value of is given in Table Moss: A Scalable Tool for Efficiently Sampling and Counting 4- and 5-Node Graphlets. The following theorem shows the sampling bias of the 5-node T-sampling method.

###### Theorem 6

When the sampling budget , T-5 samples a CIS with probability , .

We let be the 5-node motif class ID of when is a 5-node CIS, and -1 otherwise. Denote . Let . For , is larger than zero and we then estimate as

 ^η(1)i=m(1)iK1p(1)i.
###### Theorem 7

For , is an unbiased estimator of and its variance of is

 Var(^η(1)i)=ηiK1⎛⎝1p(1)i−ηi⎞⎠. (4)

The covariance of and is

 Cov(^η(1)i,^η(1)j)=−ηiηjK1,i≠j and i,j∈Ω1.

The pseudo-code of Path-5 is shown in Algorithm 4. Let

 Γ(2)v=(∑x∈Nv(dx−1))2−∑x∈Nv(dx−1)2,v∈V.

We assign a weight to each node . Define and . To sample a 5-node CIS, Path-5 mainly consists of six steps: 1) Sample a node from according to the distribution ; 2) Sample a node from according to the distribution , where and is defined as

 τ(v)u=(du−1)(∑y∈Nv−{u}(dy−1))Γ(2)v,u∈Nv;

3) Sample a node from according to the distribution , where and