Minfer: Inferring Motif Statistics From Sampled Edges
Abstract
Characterizing motif (i.e., locally connected subgraph patterns) statistics is important for understanding complex networks such as online social networks and communication networks. Previous work made the strong assumption that the graph topology of interest is known, and that the dataset either fits into main memory or is stored on disk such that it is not expensive to obtain all neighbors of any given node. In practice, researchers have to deal with the situation where the graph topology is unknown, either because the graph is dynamic, or because it is expensive to collect and store all topological and meta information on disk. Hence, what is available to researchers is only a snapshot of the graph generated by sampling edges from the graph at random, which we called a “RESampled graph”. Clearly, a RESampled graph’s motif statistics may be quite different from the underlying original graph. To solve this challenge, we propose a framework and implement a system called Minfer, which can take the given RESampled graph and accurately infer the underlying graph’s motif statistics. We also use Fisher information to bound the errors of our estimates. Experiments using large scale datasets show our method to be accurate.
Minfer: Inferring Motif Statistics From Sampled Edges
Pinghui Wang, John C.S. Lui, and Don Towsley 
Noah’s Ark Lab, Huawei, Hong Kong 
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong 
Department of Computer Science, University of Massachusetts Amherst, MA, USA 
{wang.pinghui}@huawei.com, cslui@cse.cuhk.edu.hk, towsley@cs.umass.edu 
\@float
copyrightbox[b]
\end@floatComplex networks are widely studied across many fields of science and technology, from physics to biology, and from nature to society. Networks which have similar global topological features such as degree distribution and graph diameter can exhibit significant differences in their local structures. There is a growing interest to explore these local structures (also known as “motifs”), which are small connected subgraph patterns that form during the growth of a network. Motifs have many applications, for example, they are used to characterize communication and evolution patterns in online social networks (OSNs) [?, ?, ?, ?], pattern recognition in gene expression profiling [?], proteinprotein interaction prediction [?], and coarsegrained topology generation of networks [?]. For instance, 3node motifs such as “the friend of my friend is my friend” and “the enemy of my enemy is my friend” are well known evolution patterns in signed (i.e., friend/foe) social networks. Kunegis et al. [?] considered the significance of motifs in Slashdot Zoo^{1}^{1}1www.slashdot.org and how they impact the stability of signed networks. Other more complex examples include 4node motifs such as bifans and biparallels defined in [?].
Although motifs are important characteristics to help researchers to understand the underlying network, one major technical hurdle is that it is computationally expensive to compute motif frequencies since this requires one to enumerate and count all subgraphs in a network, and there exist a large number of subgraphs even for a medium size network with less than one million edges. For example, the graphs Slashdot [?] and Epinions [?], which contain approximately nodes and edges have more than 4node connected and induced subgraphs [?]. To address this problem, several sampling methods have been proposed to estimate the frequency distribution of motifs [?, ?, ?, ?]. All these methods require that the entire graph topology fit into memory, or the existence of an I/O efficient neighbor query API available so that one can explore the graph topology, which is stored on disk. In summary,previous work focuses on designing computationally efficient methods to characterize motifs when the entire graph of interest is given.
In practice the graph of interest may not be known, but instead the available dataset is a subgraph sampled from the original graph. This can be due to the following reasons:

Data collection: Sampling is inevitable for collecting a large dynamic graph given as a high speed stream of edges. For example, sampling is used to collect network traffic on backbone routers in order to study the network graph where a node in the graph represents a host and an edge represents a connection from host to host , because capturing the entire traffic is prohibited due to the high speed traffic and limited resources (e.g. memory and computation) of network devices.

Data transportation: Sampling may also be required to reduce the high cost of transporting an entire dataset to a remote data analysis center.

Memory and computation: Sometimes the graph of interest is given in a memory expensive format such as a raw text file, and may be too large to fit into memory. Moreover, it may be too expensive to preprocess and organize it on disk. In such cases, it may be useful to build a relatively small graph consisting of edges sampled from the graph file at random, and compute its motif statistics in memory.
A simple example is given in Fig. Minfer: Inferring Motif Statistics From Sampled Edges, where the sampled graph is derived from the dataset representing . In this work, we assume the available graph is obtained through random edge sampling (i.e, each edge is independently sampled with the same probability ), which is popular and easy to implement in practice. Formally, we denote the graph as a RESampled graph of . One can easily see that a RESampled graph’s motif statistics will differ from those of the original graph due to uncertainties introduced by sampling. For example, Fig. Minfer: Inferring Motif Statistics From Sampled Edges shows that is a 4node induced subgraph in the RESampled graph , and we do not know from which original induced subgraph in that it derives. could be any one of the five subgraphs depicted in Fig. Minfer: Inferring Motif Statistics From Sampled Edges.
Unlike previous methods [?, ?, ?, ?], we aim to design an accurate method to infer motif statistics of the original graph from the available RESampled graph . These previous methods focus on designing computationally efficient sampling methods based on sampling induced subgraphs in to avoid the problem shown in Fig. Minfer: Inferring Motif Statistics From Sampled Edges. Hence they fail to infer the underlying graph’s motif statistics from the given RESampled graph. The gSH method in [?] can be used to estimate the number of connected subgraphs from sampled edges. However it cannot be applied to characterize motifs, i.e., connected and induced subgraphs (or CISes), because motif statistics can differ from connected subgraphs’ statistics. For example, Fig. Minfer: Inferring Motif Statistics From Sampled Edges shows that of a graph’s 4node connected subgraphs are isomorphic to a 4node line (i.e., the first motif in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (b)), while of its 4node CISes are isomorphic to a 4node line.
Contribution: Our contribution can be summarized as: To the best of our knowledge, we are the first to study and provide an accurate and efficient solution to estimate motif statistics from a given RESampled graph. We introduce a probabilistic model to study the relationship between motifs in the RESampled graph and in the underlying graph. Based on this model, we propose an accurate method, Minfer, to infer the underlying graph’s motif statistics from the RESampled graph. We also provide a Fisher information based method to bound the error of our estimates. Experiments on a variety of real world datasets show that our method can accurately estimate the motif statistics of a graph based on a small RESampled graph.
This paper is organized as follows: The problem formulation is presented in Section Minfer: Inferring Motif Statistics From Sampled Edges. Section Minfer: Inferring Motif Statistics From Sampled Edges presents our method (i.e. Minfer) for inferring subgraph class concentrations of the graph under study from a given RESampled graph. Section Minfer: Inferring Motif Statistics From Sampled Edges presents methods for computing the given RESampled graph’s motif statistics. The performance evaluation and testing results are presented in Section \thetable. Section Minfer: Inferring Motif Statistics From Sampled Edges summarizes related work. Concluding remarks then follow.
is the graph under study  
is a RESampled graph  
set of nodes for node CIS  
set of edges for node CIS  
associated motif of CIS  
number of node motif classes  
th node motif  
set of node CISes in  
set of CISes in isomorphic to  
number of node CISes in  
number of CISes in isomorphic to  
number of CISes in isomorphic to  
concentration of motif in  
matrix  
probability that a node CIS in  
isomorphic to given its original  
CIS in isomorphic to  
number of subgraphs of isomorphic  
to  
concentration of motif in  
probability of sampling an edge  
In this section, we first introduce the concept of motif concentration, then we discuss the challenges of computing motif concentrations in practice.
Denote the underlying graph of interest as a labeled undirected graph , where is a set of nodes, is a set of undirected edges, , and is a set of labels associated with edges . For example, we attach a label to indicate the direction of the edge for a directed network. Edges may have other labels too, for instance, in a signed network, edges have positive or negative labels to represent friend or foe relationship. If is empty, then is an unlabeled undirected graph, which is equivalent to the regular undirected graph.
Motif concentration is a metric that represents the distribution of various subgraph patterns that appear in . To illustrate, we show the 3, 4 and 5nodes subgraph patterns in Figs. Minfer: Inferring Motif Statistics From Sampled Edges, Minfer: Inferring Motif Statistics From Sampled Edges,and Minfer: Inferring Motif Statistics From Sampled Edges respectively. To define motif concentration formally, we first need to introduce some notation. For ease of presentation, Table Minfer: Inferring Motif Statistics From Sampled Edges depicts the notation used in this paper.
An induced subgraph of , , , and , is a subgraph whose edges and associated labels are all in , i.e. , . We define as the set of all connected induced subgraphs (CISes) with nodes in , and denote . For example, Fig. Minfer: Inferring Motif Statistics From Sampled Edges depicts all possible 4node CISes. Let denote the number of node motifs and denote the node motif. For example, and are the six 4node undirected motifs depicted in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (b). Then we partition into equivalence classes, or , where CISes within are isomorphic to .
Let denote the frequency of the motif , i.e., the number of the CISes in isomorphic to . Formally, we have , which is the number of CISes in . Then the concentration of is defined as
Thus, is the fraction of node CISes isomorphic to the motif among all node CISes. In this paper, we make the follow assumptions:

Assumption 1: The complete is not available to us, but a RESampled graph of is given, where , , and are node, edge, and edge label sets of respectively. is obtained by random edge sampling, i.e., each edge in is independently sampled with the same probability , where is known in advance.

Assumption 2: The label of a sampled edge is the same as that of in , i.e., .
These two assumptions are satisfied by many applications’ data collection procedures. For instance, the source data of online applications such as network traffic monitoring is given as a streaming of directed edges, and the following simple method is computational and memory efficient for collecting edges and generating a small RESampled graph, which will be sent to remote network traffic analysis center: Each incoming directed edge is sampled when , where is an integer (e.g., 10,000) and is a hash function satisfying and mapping edges into integers uniformly. The property guarantees that edges and are sampled or discarded simultaneously. Hence the label of a sampled edge is the same as that of in . Using universal hashing [?], a simple instance of is given as the following function when each is an integer smaller than
where is a prime larger than , and are any integers with and . We can easily find that and maps edges into integers uniformly. The computational and space complexities of the above sampling method are both , which make it easy to use for data collections in practice. As alluded before, in this paper, we aim to accurately infer the motif concentrations of based on the given RESampled graph .
The motif statistics of RESampled graph and original graph can be quite different. In this section, we introduce a probabilistic model to bridge the gap between the motif statistics of and . Using this model, we will show there exists a simple and concise relationship between the motif statistics of and . We then propose an efficient method to infer the motif concentration of from . Finally, we also give a method to construct confidence intervals of our estimates of motif concentrations.
To estimate the motif statistics of based on , we develop a probabilistic method to model the relationship between the motifs in and . Define where is the probability that is isomorphic to motif given that is isomorphic to motif , i.e., .
To obtain , we first compute , which is the number of subgraphs of isomorphic to . For example, (i.e., the triangle) includes three subgraphs isomorphic to (i.e., the wedge) for the undirected graph shown in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (a). Thus, we have for 3node undirected motifs. When , . It is not easy to compute manually for 4 and 5node motifs. Hence we provide a simple method to compute in Algorithm 1. The computational complexity is . Note that the cost of computing is not a big concern because in practice, is usually five or less for motif discovery. Denote by and the sets of nodes and edges in subgraph respectively. We have the following equation
where . For all CISes in isomorphic to , the above model tells us that approximately of these CISes are expected to appear as CISes isomorphic to in .
Using the above probabilistic model, we propose a method Minfer to estimate motif statistics of from . Denote by , , , the number of CISes in isomorphic to the motif . The method to compute is presented in next section. Then, the expectation of is computed as
(1) 
In matrix notation, Equation (1) can be expressed as
where , , and . Then, we have
Thus, we estimate as
where . We easily have
therefore is an unbiased estimator of . Finally, we estimate as follows
(2) 
Denote by the concentration of motif in , i.e., . Then we observe that (2) is equivalent to the following equation, which directly describes the relationship between motif concentrations of and . Let and
(3) 
where is a normalizer. For 3node undirected motifs, is computed as
and the inverse of is
Expressions for and for 3node signed undirected motifs, 3node directed motifs, 4node undirected motifs, and 5node undirected motifs can be found in Appendix.
It is difficult to directly analyze the errors of our estimate , because it is complex to model the dependence of sampled CISes due to their shared edges and nodes. Instead, we derive a lower bound on the mean squared error (MSE) of using the CramérRao lower bound (CRLB) of , which gives the smallest MSE that any unbiased estimator of can achieve. For a node CIS selected from node CISes of at random, the probability that is isomorphic to the th node motif is . Let be the induced subgraph of the node set in the RESampled graph . Clearly, may not be connected. Furthermore there may exist nodes in that are not present in . We say is evaporated in for these two scenarios. Let denote the probability that is evaporated given that its original CIS is isomorphic to the th node motif. Then, we have
For a random node CIS of , the probability that its associated in is isomorphic to the th node motif is
and the probability that is evaporated is
When is evaporated, we denote . Then, the likelihood function of with respect to is
The Fisher information of with respect to is defined as a matrix , where
For simplicity, we assume that the CISes of are independent (i.e., none overlapping edges). Then the Fisher information matrix of all node CISes is . The CramérRao Theorem states that the MSE of any unbiased estimator is lower bounded by the inverse of the Fisher information matrix, i.e.,
provided some weak regularity conditions hold [?]. Here the term corresponds to the accuracy gain obtained by accounting for the constraint .
The existing generalized graph enumeration method [?] can be used for enumerating all node CISes in the RESampled graph , however it is complex to apply and is inefficient for small values of . In this section, we first present a method (an extension of the NodeIterator++ method in [?]) to enumerate all 3node CISes in . Then, we propose new methods to enumerate 4 and 5node CISes in respectively. In what follows we denote as the neighbors of in . Note that in this section is the default graph when a function’s underlying graph is omitted for simplicity. For example, the CIS with nodes , , and refers to the CIS with nodes , , and in .
Algorithm 2 shows our 3node CISes enumeration method. Similar to the NodeIterator++ method in [?], we “pivot" (the associated operation is discussed later) each node to enumerate CISes including . For any two neighbors and of , we can easily find that the induced graph with nodes , and is a 3node CIS. Thus, we enumerate all pairs of two nodes in , and update their associated 3node CIS for . We call this process “pivoting" for 3node CISes.
Clearly, a 3node CIS is counted three times when the associated undirected graph of by discarding edge labels is isomorphic to a triangle, once by pivoting each node , , and . Let be a total order on all of the nodes, which can be easily defined and obtained, e.g. from array position or pointer addresses. To ensure each CIS is enumerated once and only once, we let one and only one node in each CIS be “responsible" for making sure the CIS gets counted. When we “pivot" and enumerate a CIS , is counted if is the ‘responsible" node of . Otherwise, is discarded and not counted. We use the same method in [?, ?], i.e., let the node with lowest order in a CIS whose associated undirected graph isomorphic to a triangle be the “responsible" node. For the other classes of CISes, their associated undirected graphs are isomorphic to an unclosed wedge, i.e., the first motif in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (a). For each of these CISes, we let the node in the middle of its associated undirected graph (e.g., the node with degree 2 in the unclosed wedge) be the “responsible" node.
Algorithm 3 shows our 4node CISes enumeration method. To enumerate 4node CISes, we “pivoting" each node as follows: For each pair of ’s neighbors and where , we compute the neighborhood of , , and , defined as . For any node , we observe that the induced graph consisting of nodes , , , and is a 4node CIS. Thus, we enumerate each node in , and update the 4node CIS consisting of , , , and . We repeat this process until all pairs of ’s neighbors and are enumerated and processed.
Similar to 3node CISes, some 4node CISes might be enumerated and counted more than once when we “pivoting" each node as above. To solve this problem, we propose the following methods for making sure each 4node CIS is enumerated and gets counted once and only once: When and , we discard . Otherwise, denote by the associated undirected graph of by discarding edge labels. When includes one and only one node having at least 2 neighbors in , we let be the “responsible" of . For example, the node 4 is the “responsible" node of the first subgraph in Fig. Minfer: Inferring Motif Statistics From Sampled Edges. When includes more than one node having at least 2 neighbors in , we let the node with lowest order among the nodes having at least 2 neighbors in be the “responsible" node of . For example, the nodes 6 and 3 are the “responsible" nodes of the second and third subgraphs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges.
Algorithm 4 describes our 5node CISes enumeration method. For a 5node CIS , we classify it into two types according to its associated undirected graph :

5node CIS with type 1: has at least one node having more than two neighbors in ;

5node CIS with type 2: has no node having more than two neighbors in , i.e., is isomorphic to a 5node line or a circle, i.e., the first or sixth motifs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges (c).
We propose two different methods to enumerate these two types of 5node CISes respectively.
To enumerate 5node CISes with type 1, we “pivoting" each node as follows: When has at least three neighbors, we enumerate each combination of three nodes where , and then compute the neighborhood of , , , and , defined as . For any node , we observe that the induced graph consisting of nodes , , , , and is a 5node CIS. Thus, we enumerate each node in , and update the associated 5node CIS consisting of , , , , and . We repeat this process until all combinations of three nodes are enumerated and processed. Similar to 4node CISes, we propose the following method to make sure each 5node is enumerated and gets counted once and only once: When and , we discard . Otherwise, let be the associated undirected graph of , and we then pick the node with lowest order among the nodes having more than two neighbors in be the “responsible" node. The third and fourth subgraphs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges are two corresponding examples.
To enumerate 5node CISes with type 2, we “pivoting" each node as follows: When has at least two neighbors, we first enumerate each pair of ’s neighbors and where . Then, we compute defined as the set of ’s neighbors not including and and not connecting to and , that is, . Similarly, we compute defined as the set of ’s neighbors not including and and not connecting to and , i.e., . Clearly, . For any and , we observe that the induced graph consisting of nodes , , , , and is a 5node CIS with type 2. Thus, we enumerate each pair of and , and update the 5node CIS consisting of , , , , and . We repeat this process until all pairs of ’s neighbors and are enumerated and processed. To make sure each CIS is enumerated and gets counted once and only once, we let the node with lowest order be the “responsible" node when the associated undirected graph of isomorphic to a 5node circle. When is isomorphic to a 5node line, we let the node in the middle of the line be the “responsible" node. The first and second subgraphs in Fig. Minfer: Inferring Motif Statistics From Sampled Edges are two examples respectively.
Graph  nodes  edges  maxdegree 

Flickr [?]  1,715,255  15,555,041  27,236 
Pokec [?]  1,632,803  22,301,964  14,854 
LiveJournal [?]  5,189,809  48,688,097  15,017 
YouTube [?]  1,138,499  2,990,443  28,754 
WikiTalk [?]  2,394,385  4,659,565  100,029 
WebGoogle [?]  875,713  4,322,051  6,332 
socEpinions1 [?]  75,897  405,740  3,044 
socSlashdot08 [?]  77,360  469,180  2,539 
socSlashdot09 [?]  82,168  504,230  2,552 
signEpinions [?]  119,130  704,267  3,558 
signSlashdot08 [?]  77,350  416,695  2,537 
signSlashdot09 [?]  82,144  504,230  2,552 
comDBLP [?]  317,080  1,049,866  343 
comAmazon [?]  334,863  925,872  549 
p2pGnutella08 [?]  6,301  20,777  97 
caGrQc [?]  5,241  14,484  81 
caCondMat [?]  23,133  93,439  279 
caHepTh [?]  9,875  25,937  65 
Flickr  Pokec  LiveLive  Wiki  Web  
Journal  Talk  
undirected 3node motifs  
1  9.60e01  9.84e01  9.55e01  9.99e01  9.81e01 
2  4.04e02  1.60e02  4.50e02  7.18e04  1.91e02 
directed 3node motifs  
1  2.17e01  1.77e01  7.62e02  8.91e01  1.27e02 
2  6.04e02  1.11e01  4.83e02  4.04e02  1.60e02 
3  1.28e01  1.60e01  3.28e01  3.91e03  9.28e01 
4  2.44e01  1.74e01  1.14e01  5.43e02  3.09e03 
5  1.31e01  1.91e01  1.73e01  5.48e03  1.92e02 
6  1.80e01  1.71e01  2.15e01  3.88e03  1.92e03 
7  5.69e05  7.06e05  2.74e05  1.37e05  4.91e05 
8  6.52e03  2.49e03  8.66e03  1.81e04  6.82e03 
9  1.58e03  1.03e03  1.06e03  8.42e05  2.84e04 
10  5.19e03  1.91e03  6.63e03  1.28e04  2.77e03 
11  6.46e03  2.03e03  6.27e03  8.03e05  5.98e03 
12  1.07e02  5.13e03  9.82e03  1.78e04  1.21e03 
13  9.86e03  3.45e03  1.26e02  6.65e05  2.00e03 
Flickr  Pokec  LiveLive  Wiki  Web  
Journal  Talk  
1  1.92e03  3.26e03  2.69e03  5.21e03  2.93e04 
2  4.56e02  6.92e02  1.64e01  2.67e01  4.00e01 
1  2.90e04  4.10e04  2.64e04  6.06e04  2.92e05 
2  6.90e03  8.68e03  1.61e02  3.11e02  3.99e02 
signEpinions  signSlashdot08  signSlashdot09  

1  6.69e01  6.58e01  6.68e01 
2  2.12e01  2.32e01  2.25e01 
3  9.09e02  1.02e01  9.96e02 
4  2.29e02  5.86e03  5.75e03 
5  2.76e03  9.74e04  9.34e04 
6  2.49e03  1.14e03  1.13e03 
7  3.81e04  1.80e04  1.76e04 
soc  soc  soc  com  

Epinions1  Slashdot08  Slashdot09  Amazon  
1  3.24e01  2.93e01  2.90e01  2.10e01 
2  6.15e01  6.86e01  6.89e01  6.99e01 
3  2.78e03  1.25e03  1.30e03  2.37e03 
4  5.45e02  1.86e02  1.84e02  7.69e02 
5  3.01e03  7.77e04  8.48e04  1.05e02 
6  2.25e04  9.19e05  9.36e05  1.55e03 
comA  com  p2pGn  ca  caCon  ca  

mazon  DBLP  utella08  GrQc  dMat  HepTh  
1  2.9e2  1.4e1  2.6e1  9.8e2  1.4e1  2.6e1 
2  7.5e1  1.8e1  1.8e1  5.2e2  2.2e1  8.2e2 
3  1.6e1  4.4e1  4.6e1  2.1e1  4.3e1  4.4e1 
4  6.0e3  4.8e2  1.1e2  1.0e1  4.9e2  6.0e2 
5  2.3e3  1.1e3  2.7e2  1.4e3  2.1e3  5.4e3 
6  3.6e5  5.0e5  1.4e3  9.2e5  1.1e4  4.1e4 
7  1.5e2  5.6e2  2.7e2  1.1e1  5.5e2  6.4e2 
8  3.5e2  7.9e2  2.2e2  1.2e1  8.0e2  5.2e2 
9  1.4e3  4.2e3  1.4e3  1.5e2  7.0e3  8.4e3 
10  1.7e4  1.4e4  1.0e3  6.5e4  3.0e4  8.0e4 
11  7.3e3  8.1e3  4.3e3  2.3e2  9.9e3  1.0e2 
12  5.3e4  6.4e3  2.8e4  2.3e2  4.5e3  3.6e3 
13  8.2e5  3.5e6  7.4e4  4.5e6  6.4e6  3.5e5 
14  3.9e4  5.2e4  1.7e4  2.8e3  6.6e4  1.0e3 
15  6.7e4  2.6e2  7.6e5  1.5e1  5.9e3  5.3e3 
16  7.1e4  3.4e4  1.4e4  1.4e3  9.2e4  4.4e4 
17  3.9e5  1.1e5  8.0e5  4.3e5  2.9e5  8.4e5 
18  2.3e5  4.9e6  6.0e6  2.3e5  8.5e6  3.0e5 
19  2.4e4  2.8e3  1.5e5  1.9e2  9.8e4  5.8e4 
20  5.8e5  4.2e4  7.0e7  8.0e3  1.4e4  8.2e5 
21  7.2e6  7.9e3  1.5e8  6.1e2  1.5e4  3.2e3 
In this section, we first introduce our experimental datasets. Then we present results of experiments used to evaluate the performance of our method, Minfer, for characterizing CIS classes of size .
We evaluate the performance of our methods on publicly available datasets taken from the Stanford Network Analysis Platform (SNAP)^{2}^{2}2www.snap.stanford.edu, which are summarized in Table Minfer: Inferring Motif Statistics From Sampled Edges. We start by evaluating the performance of our methods in characterizing node CISes over millionnode graphs: Flickr, Pokec, LiveJournal, YouTube, WebGoogle, and Wikitalk, contrasting our results with the ground truth computed through an exhaustive method. It is computationally intensive to calculate the groundtruth of node and nodes CIS classes in large graphs. For example, we can easily observe that a node with degree is included in at least 4node CISes and 5node CISes, therefore it requires more than and operations to enumerate the 4node and 5node CISes of the Wikitalk graph, which has a node with 100,029 neighbors. Even for a relatively small graph such as socSlashdot08, it takes almost 20 hours to compute all of its 4node CISes. To solve this problem, the experiments for 4node CISes are performed on four mediumsized graphs socEpinions1, socSlashdot08, socSlashdot09, comDBLP, and comAmazon, and the experiments for 5node CISes are performed on four relatively small graphs caGRQC, caHEPTH, caCondMat, and p2pGnutella08, where computing the groundtruth is feasible. We also evaluate the performance of our methods for characterizing signed CIS classes in graphs signEpinions, signSlashdot08, and signSlashdot09.
In our experiments, we focus on the normalized root mean square error (NRMSE) to measure the relative error of the estimator of the subgraph class concentration , . is defined as:
where is defined as the mean square error (MSE) of an estimate with respect to its true value , that is
We note that decomposes into a sum of the variance and bias of the estimator . Both quantities are important and need to be as small as possible to achieve good estimation performance. When is an unbiased estimator of , then we have and thus is equivalent to the normalized standard error of , i.e.,