Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets
Abstract
Counting the frequencies of 3, 4, and 5node undirected motifs (also know as graphlets) is widely used for understanding complex networks such as social and biology networks. However, it is a great challenge to compute these metrics for a large graph due to the intensive computation. Despite recent efforts to count triangles (i.e., 3node undirected motif counting), little attention has been given to developing scalable tools that can be used to characterize 4 and 5node motifs. In this paper, we develop computational efficient methods to sample and count 4 and 5 node undirected motifs. Our methods provide unbiased estimators of motif frequencies, and we derive simple and exact formulas for the variances of the estimators. Moreover, our methods are designed to fit vertex centric programming models, so they can be easily applied to current graph computing systems such as Pregel and GraphLab. We conduct experiments on a variety of realword datasets, and experimental results show that our methods are several orders of magnitude faster than the stateoftheart methods under the same estimation errors.
Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets
Pinghui Wang, Jing Tao, Junzhou Zhao, and Xiaohong Guan 
MOE Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University, China 
{phwang, jtao, jzzhao, xhguan}@sei.xjtu.edu.cn 
\@float
copyrightbox[b]
\end@floatDesign tools for counting the frequencies of the appearance of 3, 4, and 5node connected subgraph patterns (i.e., motifs, also known as graphlets) in a graph is important for understanding and exploring networks such as online social networks and computer networks. For example, a variety of motifbased network analysis techniques have been widely used to characterize communication and evolution patterns in OSNs [?, ?, ?, ?], Internet traffic classification and anomaly detection [?, ?], pattern recognition in gene expression profiling [?], proteinprotein interaction predication [?], and coarsegrained topology generation [?].
Due to combinatorial explosion, it is computational intensive to enumerate and count motif frequencies even for a moderately sized graph. For example, mediumsize networks Slashdot [?] and Epinions [?] have nodes and edges but have more than 4node connected and induced subgraphs (CISes) [?]. To address this problem, cheaper methods such as sampling can be used rather than the bruteforce enumeration method. Unfortunately, existing methods of estimating motif concentrations [?, ?, ?, ?, ?] cannot be used to estimate motif frequencies, which are more fundamental than motif concentrations.
Despite recent efforts to count triangles [?, ?, ?, ?], little attention has been given to developing scalable tools that can be used to characterize 4 and 5node motifs. Jha et al. [?] develop sampling methods to estimate 4node undirected motifs’ frequencies. In our experiment we observe that their methods do not bound the estimation error tightly, so they significantly overestimate the sampling budget required to achieve a certain accuracy. Meanwhile, their methods cannot be easily extended to characterize 5node undirected motifs. Moreover, their methods use an edgecentric program model, so it is difficult to implement them on current graph computing systems such as Pregel [?], GraphLab [?] and GraphChi [?]. In this paper, we propose new methods to estimate the frequencies of 4 and 5node motifs. Our contributions are summarized as: 1) Our methods of sampling 4 and 5node motifs are computational efficient and scalable. Meanwile, they can be easily implemented via vertex centric programming models, which are required by most current graph computing systems. 2) To validate our methods, we perform an indepth analysis. We find that our methods provide unbiased estimators of motif frequencies. To the best of our knowledge, we are the first to derive simple and exact formulas for the variances of the estimators, which is critical for determining a proper sampling budget in practice. Moreover, we conduct experiments on a variety of publicly available datasets, and experimental results show that our methods significantly outperform the stateoftheart methods.
The rest of this paper is organized as follows. The problem formulation is presented in Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets. Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets introduces preliminaries used in this paper. Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets presents our 4 and 5node motif sampling methods. The performance evaluation and testing results are presented in Section \thefigure. Section \thefigure summarizes related work. Concluding remarks then follow.
Let be the undirected graph of interest, where and are the sets of nodes and edges respectively. To formally define 4 and 5node motif frequencies of , we first introduce some notations. An induced subgraph of , , is a subgraph whose edges are all in , i.e. , . We would like to point out that if we do not say “induced" in this paper, we mean that a subgraph is not necessarily induced. Fig. Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets(a) shows all 4node motifs of any undirected network. Denote as the set of 4node CISes in isomorphic to motif , and then the motif frequency of is defined as , . Fig. Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets(b) shows all 5node motifs of any undirected network. Denote as the set of 5node CISes in isomorphic to motif , and then the motif frequency of is defined as , . In this paper, we aim to develop computational methods to estimate and . For ease of reading, we list notations used throughout the paper in Table Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets and we present the proofs of all theorems in this paper in Appendix.
is the undirected graph of interest.  
the set of neighbors of a node in  
4node undirected motifs  
4node motif class ID of CIS  
as the set of 4node CISes in  
isomorphic to motif , .  
the frequency of motif , i.e.,  
, .  
5node undirected motifs  
5node motif class ID of CIS  
as the set of 5node CISes in  
isomorphic to motif , .  
the frequency of motif , i.e.,  
, .  
sampling budget of MOSS4  
sampling budget of MOSS4Min  
,  sampling budgets of MOSS5 
the number of subgraphs in motif  
that are isomorphic to motif  
the number of subgraphs in motif  
that are isomorphic to motif  
the number of subgraphs in motif  
that are isomorphic to motif  
Theorem 1
Suppose we have unbiased estimates of , i.e., . When these estimates are independent and their variances are , . Using all these estimates, we can obtain a more accurate unbiased estimate of by solving
We can easily obtain the optimal when . We can also estimate the confidence interval of by the Central Limit Theorem. That is, as , for any , we have
To describe the stateoftheart 4node motif sampling methods: 3path sampling and centered 3path sampling [?], we first introduce some notations. Let be the set of neighbors of a node in . Denote the degree of as , which is defined as the number of neighbors of in , i.e., . Let be a total order on all of the nodes in , which can be easily defined and obtained. For example, suppose we order all nodes based on their degrees and node IDs, and we define if or, if while the node ID of is large than that of . Let denote the set of ’s neighbors with order larger than , i.e.,
Denote .
To sample a 4node CIS, the 3PATH sampling method mainly consists of five steps: 1) Sample an edge from according to the distribution
i.e., the probability of sampling an edge is ; 2) Sample a node from uniformly at random; 3) Sample a node from uniformly at random; 4) Retrieve the CIS including nodes , , , and . Note that might be a 3node CIS when .
Compared to 3path sampling, centered 3path sampling is tailored to estimate the frequencies of 4node motifs , , and , which are usually not frequently appeared in many real networks. To sample a 4node CIS, the centered 3PATH sampling method mainly consists of five steps: 1) Sample an edge from according to the distribution
2) Sample a node from at random; 3) Sample a node from at random; 4) Retrieve the CIS including nodes , , , and . Similarly, might be a 3node CIS.
Vertexcentric programming models require users to express their algorithms by “thinking like a vertex". Each node contains information about itself and all its immediate neighbors, and the algorithms’ operations are expressed at the level of a single node. For example, the operations of a node in Pregel involve receiving messages from other nodes, updating the state of itself and its edges, and sending messages to other nodes. Vertexcentric models are very easy to program and have been widely used for many graph mining and machine learning algorithms.
In this section, we introduce our sampling methods: MOSS4 and MOSS4Min. MOSS4 is used to estimate all 4node motifs’ frequencies. We observe that MOSS4 might exhibit large errors for characterizing rare motifs (i.e., motifs with low frequencies) for a small sampling budget. In addition to MOSS4, we also develop a method MOSS4Min to further reduce the errors for characterizing rare motifs.
Denote by . We assign a weight to each node . Define and . Our method of sampling a 4node CIS mainly consists of five steps: 1) Sample a node from according to the distribution ; 2) Sample a random node from according to the distribution , where is defined as
(1) 
3) Sample a node from uniformly at random; 4) Sample a node from uniformly at random; 5) Retrieve the CIS including nodes , , , and . We set the sampling budget as , i.e., we run the above method times to obtain CISes . The pseudocode of MOSS4 is shown in Algorithm 1. In Algorithm 1, function returns a node sampled from according to the distribution , function returns a node sampled from at random, and function returns the CIS with the node set in .
Let , , be the number of subgraphs in motif that are isomorphic to motif . We can easily compute , , , , , and . To remove the error introduced by sampling, we analyze the bias of MOSS4 as follows:
Theorem 2
When the sampling budget , MOSS4 samples a CIS with probability , .
We let be the 4node motif class ID of when is a 4node CIS, and 1 otherwise (i.e., is a triangle). Let denote the indicator function that equals one when predicate is true, and zero otherwise. Denote . For , is larger than zero and we estimate as
Let . Then, the number of all 4node subgraphs (not necessarily induced) in isomorphic to motif is . Let , , be the number of subgraphs in motif that are isomorphic to motif . We have , , , , , and . We can easily find that
(2) 
Thus, we estimate as
Theorem 3
is an unbiased estimator of , . The variance of is
The variance of is computed as
From Theorem 1, we can easily compute a sampling budget that can guarantee for any and , .
Initialization: For each node , we store its degree and use a list to store its neighbors . Therefore, it requires operations to compute , and the computational complexity of processing all nodes is .
: We use a list to store the nodes in . We store an array in memory, where is defined as , . Clearly, . Let . Then, can be easily achieved by the following three steps:

Step 1: Select a number from at random;

Step 2: Find satisfying , which can be solved by the binary search algorithm;

Step 3: Return .
Its computational complexity is .
: We use a list to store the neighbors of . We store an array in memory, where is defined as , . Let . Then, can be easily achieved by the following three steps:

Step 1: Select a number from at random;

Step 2: Find satisfying
which can be solved by the binary search algorithm;

Step 3: Return .
Its computational complexity is .
: Let denote the index of in the list , i.e., . Then, function can be achieved by the following steps:

Step 1: Select a number from at random;

Step 2: Return .
Its computational complexity is .
In summary, the complexity of MOSS4 sampling CISes is .
From the above derived formulas of the variances of MOSS4, we can see that MOSS4 might exhibit larger errors for 4node motifs with lower frequencies when allocating a small sampling budget . To solve this problem, we develop a better method MOSS4Min to further reduce the errors for estimating the frequencies of 4node motifs , , and .
Let , . MOSS4Min assigns a weight to each node . Define and . MOSS4Min mainly consists of five steps: 1) Sample a node from according to the distribution . 2) Sample a node from according to the distribution , where is defined as
(3) 
3) Sample a node from at random; 4) Sample a node from at random; 5) Retrieve the CIS including nodes , , , and . We set the sampling budget as to obtain CISes .
Theorem 4
When the sampling budget , MOSS4Min samples CISes , , and with probabilities , , and respectively.
We estimate , , and as
where . The variances of , , and are given in the following theorem. We omit the proof, which is analogous to that of Theorem 3.
Theorem 5
is an unbiased estimator of , . Its variance is
We easily extend methods in Section Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets to design functions and in Algorithm 2. The computational complexity of MOSS4min sampling CISes is .
In this subsection, we show MOSS4 and MOSS4MIN can be easily implemented via vertexcentric programming models.
First, we sample nodes in according to . Let denote the number of times a node sampled. Thus, . For each node , we set as its node value, and then repeat the set of four following operations times
where is the adjacent matrix of the CIS consisting of nodes , , , and , which are the variables in the Algorithm 1, i.e., the nodes sampled at the 1st, 2nd, 3rd, and 4th steps respectively. Note that here and some entries in are unknown. Function is used to get the values of unknown entries in based the edges of the current node . Function generates a message , and sends the message to , which is a neighbor of .
We process the messages that a node receives as follows:

When a node receives a message like , do

When a node receives a message like , we first . From , we then have all the edges between , , , and . Last, we set , where is the motif class of the CIS consisting of , , , and .
Similar to MOSS4, we sample nodes in according to . Let denote the number of times a node sampled. Thus, . For each node , we set as its node value, and then repeat the set of four following operations times
We process the messages that a node receives as follows:

When a node receives a message like , do

When a node receives a message like , we first and then set , where is the motif class of the CIS consisting of , , , and .
MOSS4 and MOSS4Min can be viewed as the vertexcentric versions of the 3path and centered 3path sampling methods respectively. Suppose we use 4 bytes to store a node ID and its weight . The 3path and centered 3path sampling methods require bytes of memory, but MOSS4 and MOSS4Min need only bytes, which is orders of magnitude smaller than for many realworld large networks. Therefore, MOSS4 and MOSS4Min are suit for diskbased graph computing systems such as GraphChi and VENUS [?], which aim to analyze big graphs when the graphs of interest cannot be fitted into memory. Moreover, MOSS4 and MOSS4Min can be easily implemented in distributed vertexcentric graph computing systems such as Pregel and GraphLab. Meanwhile, we would like to point out we give the closedform formulas for the variances of MOSS4 and MOSS4Min. They are critical to evaluate the error of an estimate and determine a proper sampling budget in order to guarantee certain accuracy. Moreover, they can also help us to make the right sampling strategies in advance. An example is given in the following subsection.
From Theorems 3 and 5, when , we have
where , , and . Thus, the value of helps us to determine whether it is necessary to apply MOSS4Min to further reduce the errors of estimating , , and . For example, the graph caGrQc [?] has . In our experiments we observe that MOSS4Min slightly improves the accuracy of MOSS4 for estimating and of caGrQc, and exhibits a larger error than MOSS4 for estimating of caGrQc.
MOSS5, our method of estimating frequency of all 5node motifs, consists of two submethods: T5 and Path5. We develop T5 to sample 5node CISes that include at least one subgraph isomorphic to . Similarly, Path5 is developed to sample 5node CISes that include at least one subgraph isomorphic to . Finally, we propose a method to estimate the frequency of all 5node motifs based on sampled CISes given by T5 and Path5.
The pseudocode of T5 is shown in Algorithm 3. Let
We assign a weight to each node . Define and . To sample a 5node CIS, T5 mainly consists of five steps: 1) Sample a node from according to the distribution ; 2) Sample a node of according to the distribution , where is defined the same as in (1); 3) Sample two different nodes and from at random; 4) Sample a node from uniformly at random; 5) Retrieve the CIS including nodes , , , and . We run the above method times to obtain CISes .
Let , , be the number of subgraphs in motif that are isomorphic to motif . The value of is given in Table Moss: A Scalable Tool for Efficiently Sampling and Counting 4 and 5Node Graphlets. The following theorem shows the sampling bias of the 5node Tsampling method.
Theorem 6
When the sampling budget , T5 samples a CIS with probability , .
1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  

0  0  1  1  2  0  2  2  4  4  5  4  6  10  9  12  10  20  20  36  60  
1  0  0  2  2  5  1  0  4  7  2  4  6  10  6  6  14  24  18  36  60  
0  1  0  0  0  0  0  1  0  0  1  1  0  1  1  2  0  1  2  3  5 
We let be the 5node motif class ID of when is a 5node CIS, and 1 otherwise. Denote . Let . For , is larger than zero and we then estimate as
Theorem 7
For , is an unbiased estimator of and its variance of is
(4) 
The covariance of and is
The pseudocode of Path5 is shown in Algorithm 4. Let
We assign a weight to each node . Define and . To sample a 5node CIS, Path5 mainly consists of six steps: 1) Sample a node from according to the distribution ; 2) Sample a node from according to the distribution , where and is defined as
3) Sample a node from according to the distribution , where and