Scalable Positional Analysis for Studying Evolution of Nodes in Networks

Scalable Positional Analysis for Studying Evolution of Nodes in Networks

Pratik Vinay Gupte Indix India, IIT Madras Research Park, Chennai, India 600 113. Email: pratik.gupte@gmail.com. This work was done by the author when he was a Research Scholar at the Department of Computer Science and Engineering, IIT Madras.    Balaraman Ravindran Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India 600 036. Email: ravi@cse.iitm.ac.in.
Abstract

In social network analysis, the fundamental idea behind the notion of position and role is to discover actors who have similar structural signatures. Positional analysis of social networks involves partitioning the actors into disjoint sets using a notion of equivalence which captures the structure of relationships among actors. Classical approaches to Positional Analysis, such as Regular equivalence and Equitable Partitions, are too strict in grouping actors and often lead to trivial partitioning of actors in real world networks. An -Equitable Partition (EP) of a graph, which is similar in spirit to Stochastic Blockmodels, is a useful relaxation to the notion of structural equivalence which results in meaningful partitioning of actors. In this paper we propose and implement a new scalable distributed algorithm based on MapReduce methodology to find EP of a graph. Empirical studies on random power-law graphs show that our algorithm is highly scalable for sparse graphs, thereby giving us the ability to study positional analysis on very large scale networks. We also present the results of our algorithm on time evolving snapshots of the facebook and flickr social graphs. Results show the importance of positional analysis on large dynamic networks.

Keywords: -Equitable Partition, Structural Equivalence, Positional Analysis, Distributed Graph Partitioning

Scalable Positional Analysis for Studying Evolution of Nodes in Networks

Pratik Vinay Guptethanks: Indix India, IIT Madras Research Park, Chennai, India 600 113. Email: pratik.gupte@gmail.com. This work was done by the author when he was a Research Scholar at the Department of Computer Science and Engineering, IIT Madras. and Balaraman Ravindranthanks: Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India 600 036. Email: ravi@cse.iitm.ac.in.


The authors were partly funded by a grant from Ericsson Global Research. This manuscript is the pre-final version of the work which has been accepted at the workshop on Mining Networks and Graphs: A Big Data Analytic Challenge, to be held in conjunction with the SIAM Data Mining Conference in April 2014 (SDM-14).

1 Introduction

Positional Analysis (PA) [26, 1] has been long used by sociologists to draw equivalence classes from network of social relationships. Finding equivalences in the underlying social relations gives us the ability to model social behaviour, which further aids drawing out the social and organizational structure prevalent in the network. In PA, actors who have same structural correspondence to other actors in a network are said to occupy same “position”. As an example, head coaches in different football teams occupy the position manager by the virtue of the similar kind of relationship with players, assistant coaches, medical staff and the team management. It might happen that an individual coach at the position manager may or may not have interaction with other coaches at the same position. Given an organizational setting and the interaction patterns that exist amongst the individuals of this organization, we naturally tend to draw some abstraction around the structure and try to model its behaviour. For example, in our football team setting, the actors at the position manager can be a “Coach” to actors at the position player or a “Colleague” to the actors at the position assistant coach. Similarly, the actors at position medical staff can be a “Physiotherapist” or a “Doctor” to actors at the position player. While positional analysis is a very intuitive way of understanding interactions in networks, this hasn’t been widely studied for large networks due to the difficulty in developing tractable algorithms. In this paper we present a positional analysis approach that has good scaling behaviour. Hence, this work opens up the study of positions in large social networks.

The key element in finding positions, which aid in the meaningful interpretation of the data is the notion of equivalence used to partition the network. Classical methods of finding equivalence like structural equivalence [13], regular equivalence [27], automorphisms [9] and equitable partition [15] often lead to trivial partitioning of the actors in the network. This trivial partitioning of actors is primarily attributed due to either their strictness in the case of structural, automorphisms and equitable equivalence, which results in largely singleton position; or their leniency in the number of connections the actors at each position can possibly have with the actors at another position, as in the case of regular equivalence, which results in a giant equivalence class.

An -equitable partition (EP) [11] is a notion of equivalence, which has many advantages over the classical methods. EP allows a leeway of in the number of connections the actors at a same position can have with the actors at another position. The notion of EP is similar in spirit to the notion of stochastic blockmodels [25], in that both approaches permit a bounded deviation from perfect equivalence among actors. In the Indian movies dataset from IMDB, authors in [11] have shown that actors who fall in the same cell of the partition, tend to have acted in similar kinds of movies. Further, the authors also show that people who belong to a same position of an EP tend to evolve similarly. In large social networks, tagging people who belong to the same position has potentially many advantages, both from business and individual perspective, such as, position based targeted promotions, ability to find anomalies, user churn prediction and personalised recommendations.

Though efficient graph partition refinement techniques and their application in finding the regular equivalence of a graph are well studied in the graph theoretic literature [3, 20], the application of these techniques for doing positional analysis of very large social graphs and networks is so far unknown. In this work, we propose a new algorithm to find the -equitable partition of a graph and focus on scaling this algorithm. We have successfully validated our algorithm with detailed studies on facebook social graph, highlighting the advantages of doing positional analysis on time evolving social network graphs. We present few results on a relatively large component of the flickr social network. Further more, the empirical scalability analysis of the proposed new algorithm shows that the algorithm is highly scalable for very large sparse graphs.

The contribution of our work has been twofold. First, we propose a new algorithm with better heuristics for finding the -equitable partition of a graph. Second, we have been able to parallelize this algorithm using MapReduce [6] methodology, which gives us the ability to study Positional Analysis on large dynamic social networks.

The rest of the paper is organized as follows. Section 2 discusses few mathematical preliminaries along with the definition of EP. We discuss the Fast and Parallel EP Algorithm and implementation in Section 3. We present the scalability analysis, evaluation methodology, dataset details and experimental results in Section 4. Section 5 concludes the paper with possible future directions.

2 Mathematical Preliminaries

Definition 2.1

(Partition of a graph)

Given, graph G V, E, V is the vertex set and E is the edge set. A partition is defined as such that,

  • to and

Thus, the definition of a partition of a graph means that we have non-empty subsets of the vertex set , such that all subset pairs are disjoint to each other. These subsets are called cells or blocks of the partition .

Definition 2.2

(Equitable partition)

A partition on the vertex set of graph is said to be equitable [15] if, ,

where,

(2.1)

The term denotes the number of vertices in cell adjacent to the vertex . An equitable partition can be used to define positions in a network; each cell corresponds to a position and corresponds to the number of connections the actor has to the position .

McKay’s algorithm (equitable refinement procedure [15]) for finding the equitable partition takes as input an ordered partition on and the graph . The initial partition is usually a unit partition, (i.e., all vertices belong to a single cell) of the graph . An active list is used to hold the indices of all the unprocessed cells from , and is updated in every iteration of the refinement procedure. is the set of vertices from the current active cell of the partition . The initial active cell of a unit partition is therefore the entire vertex set . Additionally, a function , which maps every vertex to its degree to is used. Mathematically, is defined as follows:

(2.2)

The procedure then sorts (in ascending order) the vertices in each cell of the partition using the value assigned to each vertex by the function as a key for comparison. The procedure then splits the contents of each cell wherever the keys differ, thereby creating new cells. The partition is updated accordingly, and the indices corresponding to any new cells formed after the split are added to the active list. The procedure exits when the active list is empty. The resulting partition is the coarsest equitable partition of the graph .

Definition 2.3

(-Equitable partition) A partition of the vertex set {}, is defined as -equitable partition if, ,

The above definition proposes a relaxation to the strict partitioning condition of equitable partition (Definition 2.2), now equivalent actors can have a difference of in the number of connections to other cells in the partition.

Definition 2.4

(Degree vector of a vertex)

Given a partition of the vertex set of a graph , the degree vector of a vertex is defined as follows,

(2.3)

Thus, the degree vector of a node is a vector of size (the total number of cells in ), where each component of the vector is the number of neighbours has in each of the member cells of the partition .

3 Fast and Parallel Epsilon Equitable Partition

3.1 Problems Addressed

Kate and Ravindran [11] proposed an algorithm to find the EP of a graph. We discuss this algorithm briefly***For more details on this algorithm, interested readers are kindly referred to [11] Section 3.5.. Input to this algorithm is the graph, the coarsest equitable partition of graph and a value of . The cells in the input equitable partition are arranged by ascending order of their cell degreesCell or block degree of a cell of an equitable partition is the degree of the member nodes in that cell.. The algorithm then computes the degree vector (Equation 2.3 for each of the vertices in the graph . The algorithm then tries to merge these cells by taking two consecutive cells at a time. If the degree vectors of the member nodes from these two cells are within distance of each other, they are merged into a single new cell. For further merging, this new cell becomes the current cell, which is then compared with the next cell for a possible merger. If the merging fails, the next cell becomes the current cell. The algorithm exits if no further merging of cells is possible. Also, the degree vectors need to be updated whenever two cells are merged. The time complexity of this algorithm to find EP of a graph is .

3.2 Fast -Equitable Partition

The implementation of our Fast EP algorithm is directly based on the modification of McKay’s original algorithm [15] to find the equitable partition of a graph, which iteratively refines an ordered partition until it is equitable (Section 2, Definition 2.2). The key idea in our algorithm is to allow splitting a cell only when the degrees of the member nodes of a cell are more than apart. The Fast EP algorithm and its split function is explained in Algorithm 1. The algorithm starts with the unit partition of the graph and the current active cell having the entire vertex set . It then computes the function (line 5, Algorithm 1) for each of the vertices of the graph. The algorithm then calls the split function (line 11, Algorithm 1). The split function takes each cell from the partition and sorts the member vertices of these cells using the function as the comparison key (Equation 2.2, Section 2.2). Once a cell is sorted, a linear pass through the member vertices of the cell is done to check if any two consecutive vertices violate the criteria. In case of violation of the condition, the function splits the cell and updates the partition and the active list accordingly. The algorithm exits either when the active list is empty or when becomes a discrete partition, i.e., all cells in are singletons.

Input: graph , ordered unit partition , epsilon
Output: -equitable partition

1active = indices()
2while (active ) and ( is not do
3    (active)
4    active = active
5    
6     = split line 11
7    active = active [ordered indices of newly split cells from , while replacing (in place) the indices from which were split]
8    
9end while
10return
11function Split()
12     index for return partition
13    for  each currentCell in  do
14         = sort() using as the comparison key i.e., if then appears before in
15        
16        for  each in  do
17            if  then
18                Add
19            else
20                
21                
22                Add
23            end if
24        end for
25        
26    end for
27return
28end function
Algorithm 1 Fast -Equitable Partition

3.2.1 Time complexity Analysis

Let be the size of the vertex set of . The loop of line 2 can run at most for iterations: the case when split leads to the discrete partition of , hence active will have indices from . The computation of the function (line 5, Algorithm 1), either takes time proportional to the length of the current active cell or to the length of the adjacency list of the vertex Finding the degree of a vertex to current active cell translates to finding the cardinality of the intersection set between the current active cell and the adjacency list of the vertex . With a good choice of a data structure, the time complexity of intersection of two sets is usually proportional to the cardinality of the smaller set.. The sort function inside split procedure (line 14, Algorithm 1) is bound to . Also, the “splitting” (line 15 to line 24, Algorithm 1) is a linear scan and comparison of vertices in an already sorted list, hence is bound to . Hence, the total running time of the function split is bound to .

The maximum cardinality of the current active cell can at most be . Further, for dense undirected simple graph, the maximum cardinality of the adjacency list of any vertex can also at most be . Therefore for vertices, line 5 of algorithm 1 performs in . For sparse graphs, the cardinality of the entire edge set is of the order of , hence line 5 of algorithm 1 performs in the order .

Therefore, the total running time complexity of the proposed Fast -Equitable Partitioning algorithm is for dense graphs and for sparse graphs. In reality this would be quite less, since subsequent splits would only reduce the cardinality of the current active cell . Which implies that we can safely assume that the cardinality of set will be less than the cardinality of the adjacency list of the vertices of the graph. This analysis is only for the serial algorithm. Empirical scalability analysis on random power-law graphs shows that our parallel algorithm (Section 3.3, Algorithm 2) is an order faster in time for sparse graphs.

3.3 Parallel -Equitable Partition

This section describes the parallel implementation of the Fast -Equitable Partition Algorithm 1 by MapReduce methodology [6].

In the Parallel EP Algorithm, we have implemented the most computationally intensive step of the EP algorithm, namely, computation of function (Equation 2.2), as a map operation. Each mapper starts by initializing the current active cell for the current iteration (line 3 Algorithm 2). The key input to the map phase is the node id and the node data corresponding to the node is tagged along as the value corresponding to this key. The map operation involves finding the degree of the node to the current active cell , which translates to finding the size of the intersection of the adjacency list of and the member elements of (line 5, Algorithm 2). The map phase emits the node id as the key and the degree of to the current active cell as a value. This corresponds to the value of function (Equation 2.2) for the node . Finally, a single reducer performs the split function (line 6, Algorithm 1). The output of the reduce phase is used by the mapper to initialize the active cell and the reducer itself to update the partition and the active list. Single MapReduce step of the algorithm is depicted in Algorithm 2. The iterative MR job continues till the active list becomes empty or the partition becomes discrete.

1class Mapper
2   method initialize()
3      Current Active Cell [0]
4   method map(id , vertex )
5       corresponds to the value of function , Equation 2.2
6      emit(id , value )
1class Reducer Single Reducer
2   method reduce()
3      split() Algorithm 1, line 11
4      update(active) Algorithm 1, line 7
5      update()
Algorithm 2 MapReduce step of the Parallel -Equitable Partition

3.3.1 Implementation of the Parallel EP Algorithm 2

The proposed Parallel EP algorithm is iterative in nature, which implies that, the output of the current iteration becomes the input for the next one. The number of iterations in the Parallel EP Algorithm for sparse graphs having a million nodes is in the range of few ten thousands. The existing MapReduce framework implementations such as Hadoop and Disco [10, 19] follow the programming model and its architecture from the original MapReduce paradigm [6]. Hence, these implementations focus more on data reliability and fault tolerance guarantees in large cluster environments. This reliability and fault tolerance is usually associated with high data copy and job setup overhead costs. Although these features are suited for programs with a single map & a single reduce operation, they introduce high job setup overhead times across the iterative MR steps [8, 2, 29]. To circumvent this, we implemented a bare minimum MapReduce framework using open source tools GNU Parallel and rsync [22, 23]. We used GNU Parallel to trigger parallel map and reduce jobs on the cluster, rsync was used for data copy across the cluster nodes. We were able to achieve job setup overhead time in the range of few milliseconds using the custom framework, as opposed to ~ seconds for Hadoop on a 10 node cluster isolated over a Gigabit Ethernet switch. Conceptual and detailed overview of our Lightweight MapReduce Framework implementation is depicted in Figure 1. We intelligently sharded the input graph data across the distributed computing nodes. The node data partitioning is performed based on the number of nodes in the input graph and the number of cores available for computation; the methodology is depicted in Figure 1(b). The node partition splits the input graph into nearly equal sized vertex groups for processing on each of the available cores, we cache the vertex data for each of these groups on the corresponding compute nodes. This is conceptually similar to the user control on data persistence and data partitioning in the Resilient Distributed Datasets in the Spark MapReduce framework [29, 28]; though our implementation was inspired independently of the Spark framework and realized before that. This helped us achieve locality in reading the input graph. Execution time empirical studies on random power-law graphs for the proposed Algorithm 2 are presented in Section 4.4.

(a) Conceptual Overview of our Lightweight MapReduce Implementation
(b) Detailed View of our MapReduce Implementation
Figure 1: (a) Conceptual overview of our Lightweight MapReduce implementation. (b) Detailed view of our MapReduce implementation. The data partition number is in the range of , where is the number of available cores. The node partition splits the input graph into nearly equal sized vertex groups for processing on each of the available cores, we cache the vertex data for each of these groups on the corresponding compute nodes.

4 Experimental Evaluation

In this section we present the results of our Fast EP algorithm. We briefly talk about the datasets used for evaluating our proposed algorithm. We also discuss the evaluation methodology and present our results. Finally, we do the scalability analysis of the proposed Parallel EP algorithm.

Graph Vertices Edges Upto Date
1 15273 80005 2007-06-22
2 31432 218292 2008-04-07
3 61096 614796 2009-01-22
Table 2: Flickr Dataset Details
Graph Vertices Edges Upto Date
1 1277145 6042808 2006-12-03
2 1856431 10301742 2007-05-19
Table 1: Facebook Dataset Details

4.1 Datasets used for Dynamic Analysis

We have used the Facebook (New Orleans regional network) online social network dataset from [24]. The dataset consists of timestamped friendship link formation information between September 26th, 2006 and January 22nd, 2009. We created three time evolving graph snapshots for the facebook network, the base network consists of all the links formed between September 26th, 2006 and June 22nd 2007. The remaining two graphs are created such that, the graph at evolved point of time has the graph at time , along with the new vertices and edges that were added to the graph between time and time , with being days. Table 2 tabulates the dataset properties.

The second dataset that we used is the Flickr social network dataset from [18], which consists of a total of 104 days (November 2nd - December 3rd, 2006, and February 3rd - May 18th, 2007) of crawl data. This dataset consists of the timestamped link formation information among nodes. Since the nature of contact links in Flickr are directional in nature, we create an undirected dataset as described next. For each outgoing link from user , if user reciprocates the link , we create an undirected edge . The time of link reciprocation by is marked as the timestamp of the link formation. Further, we create a time evolving snapshot from this graph. The base graph has data from the first crawl, i.e., between Nov 2nd - Dec 3rd, 2006. The second graph is created in a similar fashion as the Facebook graphs, with being plus the augmented data from the second crawl, i.e., between Feb 3rd - May 18th, 2007. Table 2 tabulates the dataset properties.

4.2 Evaluation Methodology

We are primarily interested in studying the effect of PA on dynamic social networks and to characterize what role PA plays in the co-evolution of nodes in the networks. Given, a social network graph at time and its evolved network graph , our algorithm would return an -equitable partitioning for and for . The methodology used to evaluate our proposed EP algorithm is as follows.

  1. Partition Similarity: We find the fraction of actors who share the same position across the partitions and using Equation 4.4. The new nodes in , which are not present in are dropped off from before computing the partition similarity score.

    (4.4)

    where, is the size of the discrete partition of . The quantity is the size of the partition obtained by doing cell-wise intersection among the cells of and . In equation 4.4, if the number of actors who share positions across and is large, the value of will be almost equal to the size of either or . Hence, the resulting partition similarity score will be close to . On the other hand, if the overlap of actors between and is very small, will be a large number, resulting in a similarity score close to . The terms in the denominator of the equation essentially provide a normalization w.r.t. the size of partitions and . The value of given by Equation 4.4 is always between . We propose a fast algorithm to compute the partition similarity score in Appendix A.

  2. Graph theoretic network centric properties: Given co-evolving vertex pairs which occupy the same position in the partition , we study the evolution of network centric properties corresponding to the vertex pairs in the time evolved graph . We study the following properties which are widely used for characterizing the graph structure:

    • Betweenness centrality of a node is the number of shortest paths across all the node pairs that pass through . This signifies the importance of a node, for routing, information diffusion, etc.

    • Degree centrality of a node is the number of nodes it is connected to in a graph. It quantifies the importance of a node w.r.t. the number of connections a node has.

    • Counting the number of triangles a node is part of, is an elementary step required to find the clustering co-efficient of a node in a graph. Clustering co-efficient of a node signifies how strongly-knit a node is with its neighbourhood. There is a scalable algorithm to count the number of triangles in a graph [21].

    • Shapley value centrality corresponds to a game theoretic notion of centrality. This models the importance of a node in information diffusion [17], it is also efficiently computable for large graphs.

    We evaluate the co-evolution of nodes in various positional analysis methods using these four network centric properties as follows. For each pair of nodes which occupy same position in a partition, we compute the difference . Where, and correspond to the score of either of these four properties described previously. For the same pair of nodes, we also compute the difference . The scores and correspond to the property score at time . Finally, we take an absolute value of the difference of these two quantities, i.e., . A low value of this quantity therefore signifies that for a co-evolving node pair at time , the network centric property of node and at time have also evolved similarly. Note that we are not partitioning based on the centrality scores, hence our comparisons are across timestamps.

Figure 2: Co-evolving node pairs for Facebook graphs. Plots (a) to (d) , Plots (e) to (h) .
Figure 3: Co-evolving node pairs for Flickr graphs .
Figure 4: Scalability Curve for Size of the Input vs Time, for varying Power law exponents and .
Epsilon () 0 1 2 3 4 5 6 7 8 d
Graph1 with Graph2 59.59 66.19 76.60 83.00 86.57 89.43 91.29 92.88 94.18 86.93
Graph1 with Graph3 54.11 57.17 69.33 76.61 80.85 84.37 86.60 88.95 90.72 79.42
Graph2 with Graph3 56.88 67.18 76.80 82.12 85.55 87.99 89.87 91.48 92.93 78.11
Table 3: Percentage of EP overlap using the Partition Similarity score (Equation 4.4) for time evolving graphs of the Facebook Network. varied from to , corresponds to an equitable partition. d denotes the partition based on degree.

4.3 Results of Dynamic Analysis

We present the evaluation of our proposed algorithm using the methodology described in the previous subsection. The results of the partition similarity score in percentages are tabulated in Table 3. We compare our method with equitable partition (EP) and the degree partition§§§Nodes having same degree occupy same position in the partition. (DP) for the Facebook dataset. We study the evolution of actors from graph , and , under these three partitioning methods. The -equitable partition and the degree partition display a high percentage of overlap among positions than the equitable partition. The poor performance of the EP under the partition similarity score is attributed due the strict definition of equivalence under EP. As an example, consider two nodes and occupy same position under EP for graph , implies that both have exactly the same degree vector. Suppose, in , the number of connections of remained exactly the same, but node added one extra link to another position, implies that and will now belong to different positions under EP. The EP consistently performs better than the DP for higher values of . The higher values of , correspond to greater bounded relaxation under the definition of EP. In most of the cases for a given graph, the number of positions under EP would decrease as we increase the . Therefore, given two EPs and , both of them would have relatively less number of positions at higher values of . This explains the higher partition similarity percentages for EP. The high values of partition similarity score for degree partition could be attributed due to the nodes in the network which don’t evolve with time.

The question on choosing a correct value of , which corresponds to suitable notion of positions, while satisfying stronger cohesion among actors occupying these positions in dynamic networks is beyond the purview of this paper. Nevertheless results from Table 3 highlight a very important property of -equitable partition, namely “tunability”.

The study of the various network centric properties for co-evolving node pairs of the Facebook and the Flickr datasets, for different positional analysis methods is presented in Figure 2 and Figure 3 respectively. The x-axis corresponds to the bins containing the difference of a network centric property. The y-axis corresponds to the frequency of node pairs that belong to a particular bin, as a percentage of the total number of node pairs that occupy the same position in the partition. The results show that equitable partitioning outperforms both the EP and the DP for each of the network centric properties, which implies that they model positions of co-evolving node pairs pretty well. But the fact that equitable partition leads to trivial partitioning of nodes in a network, makes it the least suitable method for performing PA on real-world networks. Let us consider the example of the equitable partition for the graph from the Facebook dataset which has nodes. The EP of has cells, out of which (~) cells are singletons. The co-evolving node pairs under the EP outperform the DP in most of the cases for the Facebook networks and , especially for smaller values of . The EP with smaller values for perform better because of their closeness to the equitable partition. It is worth mentioning here that, the number of positions given by EP for for is . This implies that, EP guarantee a high degree of confidence on the values of network centric properties of the co-evolving node pairs, along with a partitioning of a reasonable size. The Flickr dataset results in Figure 3 also follow a similar trend, the EP partition performs better than the DP. The percentage counts of both the properties is more spread out across initial few bins for the Flickr dataset. The EP for has higher percentage counts for co-evolving node pairs in the bins, which correspond to smaller difference values, whereas, the co-evolving node pairs from the DP have relatively lower percentage counts in the bins closer to a difference of zero and high percentage of nodes towards the tail end of the x-axis, especially, for the shapley value centrality, which is not desirable. Also, the degree based partitions give very few positions. Therefore, EP is a consistent performer, both from the perspective of node co-evolution characteristics and number of positions it gives.

4.4 Scalability Analysis of the Parallel EP Algorithm

In this section we present empirical studies on the scalability of our proposed parallel algorithm 2. The algorithm execution was done on a single machine having eight cores; utilizing all the eight cores for the program. We study the effect of increasing the size of the input, on the running time of the algorithm. We do this analysis on random power law graphs by varying the power law exponent between . Figure 4 shows the various scalability curves. The size of the input varies from thousand nodes to million nodes. The running time of the algorithm increases as we decrease the value of , this is attributed due to the fact that for small values of , the number of splits which we do (Algorithm 1, line 11) is quite large, which directly translates to increase in the number of iterations for the algorithm. Also, decreasing the power law exponent , increases the running time of the algorithm. Since, a lower value of corresponds to denser graphs, for dense graphs, the computation of the degree of each vertex to the current active cell , therefore, becomes a costly operation. The graph strongly suggests that the algorithm scales almost linearly with increase in the input graph size for the values of . It is worth mentioning here that, for most of the real-world graphs, lies between and [4], with few exemptions. We also performed curve-fitting using polynomial regression to get a complexity bound on the algorithm for . We get a running time bound of , and for random power law graphs generated using , and respectively. It is worth noting that, the sum squared residual for and the curves and was quite marginal.

5 Conclusion and Future Work

In this paper we have presented a scalable and distributed -equitable partition algorithm. To the best of our knowledge, this is the first attempt at doing positional analysis on a large scale online social network dataset. We have been able to compute EP for a significantly large component of the Flickr social graph using our Parallel EP algorithm and its implementation. Further, the results of our algorithm on the Facebook and Flickr datasets show that EP is a promising tool for analyzing the evolution of nodes in dynamic networks. Empirical scalability studies on random power law graphs show that our algorithm is highly scalable for very large sparse graphs.
In future, it would be interesting to explore the implied advantage of our Parallel EP Algorithm to find the coarsest equitable partition of very large graphs for an . Finding the equitable partition of a graph forms an important intermediate stage in all the practical graph automorphism finding softwares [16, 5]. Another possible research direction is to explore algorithms for positional analysis of very large graphs using vertex-centric computation paradigms such as Pregel and GraphChi [14, 12].

6 Acknowledgements

We thank Inkit Padhi, Intern, RISE Lab, IIT Madras for helping prototype the code for the Fast EP algorithm. We would also like to thank Srikanth R. Madikeri, Research Scholar, DON Lab, IIT Madras for his valuable inputs during discussions on implementing the Parallel EP code.

References

  • [1] S.P. Borgatti and M.G. Everett, Notions of Position in Social Network Analysis, Sociological Methodology, 22 (1992), pp. 1–35.
  • [2] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D Ernst, HaLoop: Efficient Iterative Data Processing on Large Clusters, Proceedings of the VLDB Endowment, 3 (2010), pp. 285–296.
  • [3] A. Cardon and M. Crochemore, Partitioning a Graph in O(), Theoretical Computer Science, 19 (1982), pp. 85–98.
  • [4] Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman, Power-Law Distributions in Empirical Data, SIAM review, 51 (2009), pp. 661–703.
  • [5] Paul T Darga, Karem A Sakallah, and Igor L Markov, Faster Symmetry Discovery using Sparsity of Symmetries, in Proceedings of the 45th Annual Design Automation Conference, ACM, 2008, pp. 149–154.
  • [6] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, 51 (2008), pp. 107–113.
  • [7] Lucile Denœud and Alain Guénoche, Comparison of distance indices between partitions, in Data Science and Classification, Springer, 2006, pp. 21–28.
  • [8] Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox, Twister: A Runtime for Iterative MapReduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ACM, 2010, pp. 810–818.
  • [9] Martin G Everett, Role Similarity and Complexity in Social Networks, Social Networks, 7 (1985), pp. 353–359.
  • [10] Apache Software Foundation, Apache Hadoop. Available: http://hadoop.apache.org/. Accessed February 7, 2014.
  • [11] Kiran Kate and Balaraman Ravindran, Epsilon Equitable Partition: A Positional Analysis method for Large Social Networks, in Proceedings of 15th International Conference on Management of Data, 2009.
  • [12] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin, GraphChi: Large-scale Graph Computation on just a PC, in Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012, pp. 31–46.
  • [13] F. Lorrain and H.C. White, Structural Equivalence of Individuals in Social Networks, Journal of Mathematical Sociology, 1 (1971), pp. 49–80.
  • [14] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, Pregel: A System for Large-scale Graph Processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, ACM, 2010, pp. 135–146.
  • [15] B. D. McKay, Practical Graph Isomorphism, Congressus Numerantium, 30 (1981).
  • [16] Brendan D. McKay, nauty User’s Guide (Version 2.4). http://cs.anu.edu.au/ bdm/nauty/, 2009.
  • [17] Tomasz Michalak, KV Aaditha, PL Szczepanski, Balaraman Ravindran, and Nicholas R Jennings, Efficient Computation of the Shapley Value for Game-Theoretic Network Centrality, Journal of AI Research, 46 (2013), pp. 607–650.
  • [18] Alan Mislove, Hema Swetha Koppula, Krishna P Gummadi, Peter Druschel, and Bobby Bhattacharjee, Growth of the Flickr Social Network, in Proceedings of the first workshop on Online Social Networks, ACM, 2008, pp. 25–30.
  • [19] Prashanth Mundkur, Ville Tuulos, and Jared Flatow, Disco: A Computing Platform for Large-Scale Data Analytics, in Proceedings of the 10th ACM SIGPLAN workshop on Erlang, ACM, 2011, pp. 84–89.
  • [20] Robert Paige and Robert E Tarjan, Three Partition Refinement Algorithms, SIAM Journal on Computing, 16 (1987), pp. 973–989.
  • [21] Siddharth Suri and Sergei Vassilvitskii, Counting Triangles and the Curse of the Last Reducer, in Proceedings of the 20th International Conference on World Wide Web, ACM, 2011, pp. 607–614.
  • [22] Ole Tange, GNU Parallel - The Command-line Power Tool, login: The USENIX Magazine, (2011), pp. 42–47.
  • [23] Andrew Tridgell and Paul Mackerras, “The rsync algorithm,”Australian National University, tech. report, TR-CS-96-05, 1996.
  • [24] Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P Gummadi, On the Evolution of User Interaction in Facebook, in Proceedings of the 2nd ACM Workshop on Online Social Networks, ACM, 2009.
  • [25] Stanley Wasserman and Carolyn Anderson, Stochastic a posteriori Blockmodels: Construction and Assessment, Social Networks, 9 (1987), pp. 1–36.
  • [26] Stanley Wasserman and Katherine Faust, Social Network Analysis: Methods and Applications, Cambridge University Press, 1994.
  • [27] D. R. White and K. Reitz, Graph and Semigroup Homomorphisms on Semigroups of Relations, Social Networks, 5 (1983), pp. 193–234.
  • [28] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association, 2012, pp. 2–2.
  • [29] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica, Spark: Cluster Computing with Working Sets, in Proceedings of the 2nd USENIX Conference on Hot topics in Cloud Computing, 2010.

Appendix A Partition Similarity Score

a.1 Mathematical Preliminaries

This section briefs out few mathematical preliminaries, which form the basis for our partition similarity score.

Given, graph G V, E, V is the vertex set, E is the edge set and is a partition of V. We define the following for any two partitions of a graph G V, E:

  1. Two partitions are equal, iff they both partition the vertex set V of G exactly in the same way of each other.
    Example,

    i.e. the order of cells in partition and the order of vertices inside a cell is not important.

  2. We define the intersection of two partitions , as a partition containing the cells obtained from the set intersection operator applied cell-wise to member cells of (discarding the empty sets).
    Example,

  3. Two partitions are dissimilar, iff their intersection leads to a discrete partition. A discrete partition is a one with only singleton cells.
    Example,

    Here, gives a discrete partition.

a.2 Simplified Representation of the Partition Similarity Score

Equation 4.4 can be represented in a simplified form as follows:

(A.1)

Where,

,

,

,

.

The authors in [7] survey and compare several notions of distance indices between partitions on the same set, which are available in the literature.

a.3 MapReduce Algorithm to Compute the Partition Similarity Score

The partition similarity score of Equation 4.4 requires the cardinality of the intersection set of the two partitions and . Finding the intersection of two partitions as per the definition of intersection from , Appendix A.1 is operation, where being the total number of vertices in the partition. Computing this for very large graphs becomes intractable. To counter this problem, we provide an algorithm based on the MapReduce paradigm [6] to compute the size of the intersection set of and (i.e., ). The algorithm is presented in Algorithm box 3. The algorithm initializes by enumerating the indices of for each cell index of partition . For each key from this tuple, the map operation checks if these two cells intersect or not. The map emits a value of corresponding to a constant key. The reduce operations computes the sum of these individual . This sum corresponds to , which is used to compute the partition similarity score from Equation 4.4.

A note on Algorithm 3: The initialize method of Algorithm 3 (line 3), primarily involves replicating/enumerating the cell indices of to all the cell indices of . Since cross operations are computationally very costly, the tractability of the Algorithm is inherently dependant on ability to generate the cross product set of the cell index tuples of the two input partitions.

Input: Partitions and . Let =::: and =:::
Output: Partitions intersection set cardinality

1class Mapper
2   tupleList enumCells=[ ]
3
4   method initialize()
5      for each cell_index of (i.e. ) enumerate the cell_index of (i.e. ) do
6         add them to
7
8   method map(id t, tuple )
9       Appendix A.1,
10      if then
11          emit(id intersect, ) If the two cells have a overlap, emit value corresponding to a constant keyintersect
1class Reducer
2   method reduce(id , values)
3      
4      for value in values
5         
6      emit(key, )
Algorithm 3 MapReduce Partitions Intersection Set Cardinality
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
116797
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description