[

[

Abstract

Many popular measures used in social network analysis, including centrality, are based on the random walk. The random walk is a model of a stochastic process where a node interacts with one other node at a time. However, the random walk may not be appropriate for modeling social phenomena, including epidemics and information diffusion, in which one node may interact with many others at the same time, for example, by broadcasting the virus or information to its neighbors. To produce meaningful results, social network analysis algorithms have to take into account the nature of interactions between the nodes. In this paper we classify dynamical processes as conservative and non-conservative and relate them to well-known measures of centrality used in network analysis: PageRank and Alpha-Centrality. We demonstrate, by ranking users in online social networks used for broadcasting information, that non-conservative Alpha-Centrality generally leads to a better agreement with an empirical ranking scheme than the conservative PageRank.

social network analysis, centrality, random walks, epidemics, social media, influence.

Rethinking Centrality] Rethinking Centrality: The Role of Dynamical Processes in Social Network Analysis Rumi Ghosh and Kristina Lerman] \subjclassPrimary: 91D30 , 68R10; Secondary: 62P25, 37A60.

Rumi Ghosh

HP Labs

1501 Page Mill Road

Palo Alto, CA 94304, USA

Kristina Lerman

USC Information Sciences Institute

4676 Admiralty Way

Marina del Rey, CA 90292, USA


1 Introduction

Social network analysis algorithms examine topology of a network in order to find interesting structure within it. It has been recognized recently, however, that network structure is the product of both its links and the dynamical processes taking place on the network, which determine how ideas, pathogens, or influence flow along social links [10, 11, 40]. Borgatti [10, 11], for example, argued that a node’s centrality, a measure often used to identify important or influential actors in a social network, gives a summary of its participation in the flow taking place on the network. An appropriate centrality for a given network, therefore, is one whose assumptions match the details of the flow. Some of the best-known measures of centrality, such as PageRank [43] and its variants [33], are based on random walk-like phenomena [48, 17]. A random walk on a graph is a stochastic process that starts at some node, and at each time step transitions to a randomly selected neighbor of the current node. Variants of the random walk are often used to model flows in physical systems, e.g., chemical and heat diffusion, and can be used to model social phenomena resulting from one-to-one interactions, such as Web surfing or phone conversations. Random walks, however, do not model many phenomena of interest to social scientists, such as adoption of innovation [46, 5], spread of epidemics [1, 27] and word-of-mouth recommendations [26], viral marketing campaigns [36, 32], growth of social movements [13] and information diffusion [42]. These phenomena are usually modeled as an epidemic process, where rather than choosing one neighbor, an activated or “infected” node will attempt to activate all its neighbors. For example, on the social media site Twitter users broadcast their posts, called tweets, to all their followers. Similarly, in an epidemic, an infectious person will pass the virus to all susceptible contacts. Therefore, unlike the random walk, which conserves the amount of the diffusing substance, epidemic processes are fundamentally non-conservative.

This paper makes two contributions. First, we classify dynamical processes as conservative and non-conservative and study their relationship to two well-known centrality measures: PageRank and Alpha-Centrality. PageRank [43], originally used in Google’s successful search engine, gives the steady state distribution of a conservative dynamical process (specifically, random walk with random restarts [48, 17]). Alpha-Centrality [8] measures the number of paths of any length between two nodes, exponentially attenuated by parameter , so that longer paths contribute less to centrality than shorter paths. We demonstrate that Alpha-Centrality gives the steady state distribution of a class of non-conservative dynamical processes while is bounded by inverse of the largest eigenvalue of the adjacency matrix of the graph. This quantity, called the epidemic threshold, governs the behavior of many non-conservative processes in networks, for example, the spread of a virus along social links [44, 51]. When the effective transmissibility of the virus is below this threshold, it will die out [51, 45], but above the threshold it will reach a finite fraction of all nodes, resulting in an epidemic. Our analysis provides an intuitive explanation for the location of the epidemic threshold, and a further demonstration of the fundamental connection between network structure and dynamics.

The second contribution of the paper is an empirical study of the ability of PageRank and Alpha-Centrality to identify influential social media users. Specifically, we study the online social networks of the social news aggregator Digg and the microblogging service Twitter, both of which are used by people to share news stories and other content with their followers. The spread of information is often modeled as an epidemic process [52, 21, 50, 25], hence it has a non-conservative flavor. We define two empirical measures of influence based on user activity, and rank users according to these measures. We show that non-conservative Alpha-Centrality generally leads to a better agreement with the activity-based rankings than conservative PageRank. While the effect of dynamical processes on centrality was studied theoretically and in simulation [10], our work provides an empirical demonstration that the choice of centrality impacts our ability to identify important people in real-world social networks.

This paper is organized as follows. In Section 2 we provide a description of conservative and non-conservative dynamical processes and demonstrate, in Section 3, that Alpha-Centrality gives the steady state distribution of a non-conservative dynamical process, for example, a spreading epidemic. Then, in Section 4, we compare Alpha-Centrality to PageRank on the task identifying influential social media users and show that Alpha-Centrality gives a better agreement with empirical measures of influence. We conclude with a summary of related work and a conclusion.

2 Classification of Dynamical Processes

We represent a network by a directed graph with nodes and edges. The adjacency matrix of the graph is defined as: if ; otherwise, . Also, , The set of out-neighbors of is ; and the set of in-neighbors is . Another important quantity is the diagonal out-degree matrix , which is defined as and . Here, is a -dimensional row vector of ones, and is its transpose.

A dynamical process is mediated by interactions between nodes, which can be thought to distribute some quantity, or weight, on a network. Let the -dimensional vector represent the weight of each node at time . A dynamical process is described mathematically by a function that maps the weight vector at time to the weight vector at time .

2.1 Conservative Processes

A stochastic process is conservative if it simply redistributes the weights among the nodes of the graph, with the total weight remaining constant: , where represents the -norm of the argument, i.e., .

To give an intuition for the mathematical formulation of conservative processes, imagine a society where nodes interact by redistributing money among themselves, and the money cannot be created or destroyed. Let be the amount of money each node has, and the amount it receives, at time . Suppose that at each time step a node retains a fraction of the amount it received in the previous step and redistributes the rest among its neighbors. Let transfer matrix represent the fraction of the amount transferred by node to . Therefore, the amount of money nodes receive at time is . The transfer matrix encodes the rules of interaction. If each member divides equally amongst her out-neighbors, then .

Step by step, conservative process looks as follows. Initially, the amount each node receives is . At time each node keeps of that amount and divides the rest among its out-neighbors, who receive . At time , each node retains of the amount it received from in-neighbors at , and divides the rest among its out-neighbors, who receive , and so on. The total weight (or amount of money) the nodes have at time , , is the amount they retained from all previous time steps and the amount they received from in-neighbors at time :

(1)

As , this equation reduces to

(2)

The transfer matrix is a stochastic matrix, since its rows sum up to 1. If, instead of distributing evenly among neighbors, each node decided to keep a portion for itself, this variant of a conservative process would be governed by the transfer matrix:

(3)

Random walk on a graph is a prototypical conservative process, since the probability to find a walker on any node of the graph is always one. There exist many flavors of random walk. One of them is the widely studied random walk with random restarts [43, 6, 17], which can be described mathematically as follows. Let the initial probability to find the walker on any node be uniform, i.e., . At any time, with probability the walker at node randomly chooses one of the out-neighbors of and jumps to it. With probability , it randomly chooses any node on the graph and jumps to it. Let matrix encode the probability of jumping to any node, , and . Then the probability of finding the random walker at node at time is given by

which is exactly the same as Eq. 1.

2.2 Non-Conservative Processes

A stochastic process where the total weight can change over time is non-conservative: . To illustrate the difference between conservative and non-conservative processes, we return to our hypothetical society. Again, imagine that each node has some amount of money, however, it also has a money minting machine, so that instead of dividing the money it receives among its out-neighbors, it can give each neighbor the same amount by printing extra as needed.

Let represent the amount of money each node receives at time . At the next time step, each node gives a fraction of this amount to each of its out-neighbors, printing extra as needed. The additional amount it produces can be expressed using the replication matrix . Therefore, . Initially, let . At time , each node prints for each out-neighbor: . Continuing this process, additional amount out-neighbors receive at time is . The total amount each node has at time is obtained by summing what it received from in-neighbors at previous time steps:

(4)

At time , Eq. 4 reduces to

(5)

which can be solved to yield

(6)

This expression is defined for , where is the largest eigenvalue of .

More generally, if along with producing of what it receives from each in-neighbor, a node also produces a portion of this amount for itself, this leads to a more general form of the replication matrix:

(7)

2.2.1 Non-Conservative Dynamics and Epidemic Threshold

Non-conservative processes provide a useful framework for thinking about epidemics and other contact processes and lead to insights into the relation between dynamical processes and network structure. Consider a virus spreading on a network, where at each time step, a contagious node may infect its susceptible neighbors with probability (virus birth rate). At each time step, an infected node may also be cured with probability (virus curing rate). Wang et al. [51] modified existing models of SIS dynamics [2] for use on networks. The probability that node is infected at time can be written in matrix notation as [51]:

(8)

where is a vector , and is the initial probability of infection.111This model holds true only when is very small and there may be situations where . Therefore a more accurate interpretation is that the probability of infection is proportional to . is exactly equal to the additional weight, , accrued by a non-conservative process in Eq. 4, with and . Therefore, a SIS-type epidemic is an example of a non-conservative dynamic process.

In the model in Eq. 8, there exists a threshold such that when the effective transmissibility of virus , it will die out, and for it will spread to a significant portion of the network. For any network, regardless of the details of the spreading mechanism [45], this threshold is given by the inverse of the largest eigenvalue of the adjacency matrix ,  [51], what is known as the spectral radius of the graph. In numerical experiments we simulated epidemics on different graphs using the independent cascade model [50]. We found that the observed threshold where epidemics began to reach many nodes was consistent with the spectral radius of the respective graph.

Threshold behavior appears to be a generic property of non-conservative dynamics. As shown in the Appendix, the expected path length of a non-conservative process, i.e., how far the process spreads as , is for and for . Therefore, expected path length diverges as approaches from below. This is a hallmark of critical behavior. For non-conservative processes, the critical behavior is associated with the epidemic threshold, below which the non-conservative process reaches very few nodes, but above which is reaches a significant fraction of all nodes.

There is another way to think about thresholds. Among epidemiologists, the principal quantity of interest is the reproductive number,  [15]. Intuitively, this quantity is just average number of new infections caused by a single infected person. If , each infection creates new infections indefinitely, and results in an epidemic, while for , the disease eventually dies out. Naively, the reproductive number should just be the average degree times the transmissibility, or contagiousness of the virus. For the Digg follower graph, for example, the average degree so , where is the transmissibility of the virus. In that case, an epidemic threshold at , much higher than we observed in simulations of an SIR epidemic (using independent cascade model) on the Digg follower graph [50]. While heterogeneous degree distribution (a common property of social networks) can lower the threshold compared to this prediction [4], this computation is not simple, making the basic reproductive number less useful in characterizing epidemics in social networks.

3 Dynamical Processes and Centrality

The complex interplay between network structure and dynamics has broad implications for social network analysis. Take the task of identifying influential or prestigious actors in a social network. Over the years many different centrality measures have been developed to address this task, including degree centrality, betweenness centrality [18], eigenvector centrality [7], PageRank [43] and Alpha-Centrality [8], among many others. Applied to the same network, however, each measure leads to a different, even conflicting notion, of who the central actors are. In order to make sense of the scores produced by each centrality measure, it is important to consider the nature of the dynamical process on the network.

3.1 Centrality Measures

We study PageRank and Alpha-Centrality, two widely used measures of centrality, and show their relationship to conservative and non-conservative processes.

PageRank

A PageRank vector is the steady state probability distribution of a random walk with restarts with a damping factor (restart probability= ). The starting vector , gives the probability distribution for where the walk transitions to after restarting. The transfer matrix encodes the transition probabilities of a random walk on the network, . PageRank vector is the unique solution of:

(9)

Equation 9 is identical to the steady state solution of the linear conservative dynamic process given by Eq. 2 where and . Therefore, PageRank is the steady state solution of a conservative process, and it is a conservative measure. Other measures derived from the random walk, such as betweenness centrality, are also conservative.

Alpha-Centrality

Alpha-Centrality measures the total number of paths from a node, exponentially attenuated by their length. Bonacich introduced this measure [8] as a generalization of the index of status proposed by Katz [35], and it is sometimes referred to as Bonacich centrality. It is also similar to the communicability index recently explored by the physics community [16]. For an attenuation parameter , Alpha-Centrality vector is the solution of:

(10)

where the starting vector is taken as indegree centrality,  [9], with a row vector of ones. Equation 10 holds while , the spectral radius of the network. This bound, in fact, is the same as the epidemic threshold (Section 2.2.1). For positive values, parameter determines how far, on average, a node’s effect will be felt and sets the length scale of interactions.222Bonacich proposed to use case to model power relations in social networks. Our focus here is on quantifying influence; therefore, we study case. When is small, Alpha-Centrality probes only the local structure of the network. As grows, more distant nodes contribute to the centrality score of a given node [22]. As , the length scale of interactions diverges (Sec. 2.2.1) and it becomes a global measure.

One difficulty in using Alpha-Centrality is that it is not defined for . We recently introduced normalized Alpha-Centrality that overcomes this problem [22]. It normalizes the score of each node by the sum of the Alpha-Centrality scores of all the nodes. The new measure avoids the problem of bounded parameters while retaining the desirable characteristics of Alpha-Centrality, namely its ability to differentiate between local and global structures. Normalized Alpha-Centrality is written as:

(11)

This is defined for . This value changes with for . For , normalized Alpha-Centrality is independent of and the ordering found by normalized Alpha-Centrality in this parameter range is equivalent to the ordering found by eigenvector centrality [20].

Alpha-Centrality and its normalized version are equivalent to Eq. 5, with the initial distribution of weight given by , where for Alpha-Centrality and

for normalized Alpha-Centrality. Note that we use notation for any matrix . Therefore, (normalized) Alpha-Centrality is the steady state solution of a non-conservative dynamic process. Variations of non-conservative dynamics lead to other non-conservative measures of centrality, such as degree centrality, Katz index [35], SenderRank [37], and eigenvector centrality [7].

3.2 Choosing Appropriate Centrality Measure

When applied to the same network, different measures of centrality may lead to different, often incompatible, views of who the central actors are. The natural question to ask is: Which centrality measure is appropriate for a given network? The choice of centrality must be motivated by details of the dynamical process taking place on the network [10]. Thus, a conservative measure such as PageRank is appropriate for analyzing networks on which conservative processes are taking place, for example, web surfing or money exchange. However, for a social network on which information or epidemics are spreading, the non-conservative Alpha-Centrality may be more appropriate.

4 An Empirical Study of Centrality

In this section we use social media data to evaluate the claim that the measure that best identifies central nodes is one that captures details of the dynamical process taking place on the network. Social media sites such as Facebook, Twitter, and Digg have become important hubs of social activity and conduits of information. Correctly identifying central or influential users in these networks can have far-reaching consequences for identifying noteworthy content, targeted information dissemination, and other applications. While a variety of methods [14, 41, 47, 3] have been used to identify influential social media users, each measure produces different results, with no clear understanding of when it is appropriate. Fortunately, by exposing user activity, social media provides a rare opportunity to study the role of dynamic processes on networks.

Both Digg and Twitter allow users to create social networks by listing others as friends. The friend relationship is asymmetric. When user lists as a friend (), follows ’s activity, but not vice versa. We call the follower of (or fan on Digg). When follower graph is represented in matrix form, a user’s indegree measures the number of followers she has, and her outdegree the number of friends she follows.

By submitting a story to Digg (or tweeting a URL to a story on Twitter), a user broadcasts it to her followers. When another user votes for the story, she re-broadcasts it to her own followers. Broadcast-driven information diffusion has a non-conservative flavor; therefore, a non-conservative centrality measure should better identify influential users.

We analyzed information diffusion on the follower graphs of Digg and Twitter and used this data to construct an empirical estimate of user influence. We then compared how different centrality measures compared to the empirical measure of influence.

4.1 Data Sets

The Digg dataset333http://www.isi.edu/lerman/downloads/digg2009.html contains more than 3 million votes on some 3500 stories promoted to Digg’s front page in June 2009. More than 139K distinct users voted for at least one story in the data set (submission counts as the story’s first vote). We call these users active users. Next, we extracted the friendship links created by active users and constructed a follower graph that contained active users who were following the activities of others. However, only about 71K active users listed others as friends, resulting in network with around 300K users and over 1 million links.

The Twitter data set was collected over the period of three weeks in October 2010 using the Gardenhose streaming API. We focused on tweets that included a URL in the body of the message, usually shortened by some URL shortening service, such as bit.ly or tinyurl. In order to ensure that we had the complete retweeting history of each URL, we used Twitter’s search API to retrieve all tweets containing that URL. Users who tweeted the URL are considered active. Data collection process resulted in more than 3 million posts tweeted by 816K users which mentioned 70K distinct shortened URLs. Next, we used the REST API to collect followers of each active user, keeping only those followers who themselves were active, i.e., tweeted at least one URL during data collection period. The resulting follower graph had almost 700K nodes and over 36 million edges. While filtering out non-active followers will change results of centrality calculations, we argue that this is an appropriate simplification to make, both conceptually and to keep the graph of a computationally manageable size. We argue that inactive users do not contribute to information spread, and should not be considered in calculations of centrality.

While voting on Digg represents pure information diffusion (in contrast to Twitter, Digg user can vote only once for a story), tweeting activity in our sample encompassed diverse behaviors from pure information diffusion of newsworthy content to orchestrated manipulation campaigns, robo-tweeting, advertising and spam. Since our analysis applies only to information diffusion-type behavior, we have to filter out latter activities. We used a method described in [23] to automatically classify tweeting behaviors using two information theoretic features. The first feature is the entropy of the distribution of distinct users who re-tweeted the URL. The second feature is the entropy of the distribution of time intervals between successive re-tweets of the same URL. We showed that these two features alone were able to accurately separate re-tweeting activity into meaningful classes. High user entropy implies that many different people re-tweeted the URL, with most people re-tweeting it once. High time interval entropy implies presence of many different time scales, which is a characteristic of human activity. In contrast, low time interval entropy implies that URL is retweeted at one or few regular time intervals, which is characteristic of automated (possibly spam) activity. In this paper, we focus on those URLs from the data set which are characterized by high () user and time interval entropies. These parameter values are associated with the spread of news-worthy content and excludes robotic spamming and manipulation campaigns driven by few individuals.

4.2 Empirical Estimates of Influence

Katz and Lazarsfeld [34] defined influentials as “individuals who were likely to influence other persons in their immediate environment.” In the years that followed, many attempts were made to identify people who influenced others to adopt a new practice or product [12]. The rise of online social networks has allowed researchers to trace the flow of information through social links on a massive scale. Using the new empirical foundation, some researchers proposed to measure a person’s influence in social media locally, by the number of votes or retweets from followers her posts generate [20, 3], or globally, by the size of cascades her posts trigger [36, 3]. Alternatively, Trusov et al. [49] defined influential people in an online social network as those whose activity stimulates those connected to them to increase their activity, while Cha et al. [14] used the total number of retweets and mentions, including from people not connected either directly or indirectly to the submitter, to measure user influence on Twitter.

Following these works, we measure influence by analyzing user activity in social media. Suppose that a user posts new information on Digg or Twitter, specifically, a URL to a news story. We refer to this user as the story’s submitter. Whether or not her follower will re-broadcast the story (i.e., retweet it on Twitter or vote for it on Digg) depends on its quality and submitter’s influence. We assume that story’s quality is uncorrelated with the submitter.444This may seem like a strong statement, but as other studies of Digg show [30, 31], how interesting a story is to submitter’s followers does not depend on who the submitter is, at least not on Digg. Therefore, we can average out its effect by aggregating over all stories submitted by the same user. We claim that the residual difference between submitters can be attributed to variations in influence. We use two empirical measures of submitter’s influence: () the average number of times her submissions are re-broadcast by her followers (local influence [3]), and () average size of the cascades her posts trigger (global influence [3]).

Digg
(a) (b)
Twitter
(c) (d)
Figure 1: Analysis of the empirical estimate of influence on Digg and Twitter. (a, c) The scatter plot shows the average number of times followers rebroadcast a story within its first 100 rebroadcasts vs. the number of followers the submitter has. Each point represents a distinct submitter. (b, d) Probability of the expected number of follower rebroadcasts being generated purely by chance.

4.2.1 Measuring local influence on Digg

To reduce the effect of the front page to which Digg promotes popular stories, we count the number of votes from submitter’s followers within the first 100 votes only. Since few stories are promoted to the front page before they receive that many votes, this ensures social links are mainly responsible for spreading interest in stories [42]. Of the 3552 stories in the Digg data set, 3489 were submitted by 572 connected users. Of these, 289 distinct users submitted two or more stories which received at least one follower vote within the first 100 votes, providing us with enough information to estimate influence. Figure 1(a) shows the average number of follower votes within the first 100 votes received by stories submitted by these users versus the number of followers these users have.

Are these observations significant? Do submitters with more followers simply get more votes due to greater numbers of followers? Or could we have observed that many follower votes purely by chance? Let’s assume that there are users who vote for stories randomly, independently of who submits them. This type of stochastic voting can described by the urn model [38]. Imagine an urn that contains balls, of which are white. Imagine also that we draw balls from the urn without replacing them. How many of them will be white? The probability that of the first votes come from submitter’s followers purely by chance is equivalent to the probability that of the balls drawn from the urn are white. This probability is given by the hypergeometric distribution:

(12)

Using Eq. 12, we compute the probability (N=71367, n=100) a story submitted by a Digg user with followers received votes from submitter’s followers purely by chance. As shown in Figure 1(b), for , this probability is very small; therefore, it is unlikely () these votes could arise purely by chance. We conclude that average number of follower votes received by stories submitted by a user (with at least 100 followers) is a statistically significant () measure of her influence.

4.2.2 Measuring local influence on Twitter

We analyzed the Twitter data set using the same methodology. There were 174 users who posted at least two URLs that were retweeted at least 100 times. Figure 1(c) shows the average number of times the posts of these users were retweeted by their followers. Figure 1(d) shows the probability these number of retweets could have been observed purely by chance. Since these values are small, we conclude that average number of follower retweets is a statistically significant () estimate of influence on Twitter.

4.2.3 Measuring global influence

Alternatively, we can measure the influence of the submitter by the average size of the cascades her posts trigger. A cascade describes how information spreads on the follower graph. The cascade begins with a seed, e.g., story submitter, who broadcasts the story to her followers. It grows when these followers choose to vote or retweet the story, in turn broadcasting it to their own followers, and so on. All nodes in a cascade are connected to the seed through follower relations, either directly or indirectly though other nodes in the cascade.

For each post, we extracted the cascade that starts with the submitter and includes all voters/retweeters who are connected to voters/retweeters in the cascade via follower links. The larger the cascade size (on average), the more influential the submitter.

4.3 Comparison of Centrality Measures

Digg
(a) Digg local (b) Digg global
Twitter
(c) Twitter local (d) Twitter global
Figure 2: Correlation between the rankings produced by the local (average number of follower re-broadcasts) and global (average cascade size) empirical measures of influence with those predicted by normalized Alpha-Centrality and PageRank on Digg and Twitter.

We use empirical estimates of influence to rank a subset of users in the Digg and Twitter data sets who submitted more than one story (URL) which received at least 100 votes (retweets). However, of the 174 such submitters on Twitter, only 75 could be classified as not spammers according to the entropy criteria [23] mentioned above; therefore, we restrict analysis to these users. We evaluated centrality measures by comparing how they rank these submitters with how they are ranked by the empirical measures of influence using Pearson’s correlation coefficient. We studied standard PageRank (with uniform starting vector) and Alpha-Centrality (with in-degree as the starting vector), both of which were computed on the follower graph. The effect of using other starting vectors for PageRank (as is done in personalized PageRank) and Alpha-Centrality is the course of future work.

Figure 2 shows the correlation between the empirical measures of influence with normalized Alpha-Centrality and PageRank on Digg (a,b) and Twitter (c,d). Parameter stands for the attenuation factor for Alpha-Centrality (see Equations 10 and 11) and the damping factor (restart probability=) for PageRank (see Equation 9). Figures 2(a) and 2(c) show the correlation of the local measure of influence (average number of follower re-broadcasts), while Figures 2(b) and 2(d) show the correlation of the global empirical estimate of influence (average cascade size). On Digg (Fig. 2(a)), Alpha-Centrality correlates better than PageRank with the local measure of influence over a wide range of values; however, on Twitter, PageRank starts to perform better for (Fig. 2(c)). This could be because simple epidemics do not completely describe information spread in social media [50, 28].

Though the correlation with global influence on Digg (Fig. 2(b)) is less overall than with local influence, Alpha-Centrality outperforms PageRank for all values of . Surprisingly, the correlations on Twitter (Fig. 2(d)) are negative. This is consistent observations of Bakshy et al. [3], who found that cascade size of past submissions was not a good predictor of the cascade size of a user’s future submissions on Twitter. The correlations are less negative for PageRank, but it is difficult to conclude anything about the relative performance of Alpha-Centrality and PageRank from these results.

The insets show the interval corresponding to small values of . Note that normalized Alpha-Centrality becomes a global metric very quickly, i.e., over a small range of values. The point at which it becomes constant corresponds to the epidemic threshold. There are interesting differences in the behavior of correlation with the empirical measure of influence on Digg and Twitter. On Digg, the correlation with Alpha-Centrality grows from , suggesting that global structure becomes more important in determining influence, while on Twitter it has the opposite behavior. These differences could arise from differences in the network structure, and will be addressed in future research.

The empirical results, for the most part, support our claim that Alpha-Centrality is better able to identify important users than PageRank because it more closely models the spread of information on social media, which takes place via broadcasts from users to their followers. However, PageRank sometimes outperforms Alpha-Centrality on the local measure of influence, indicating that information spread is a more complex process than a simple epidemic [50, 28]. Incidentally, , above which PageRank outperforms Alpha-Centrality on Digg, was the value suggested by Brin et al. [43] for finding important pages in a Web graph. Empirical studies suggest different values of are appropriate for different domains [24], although some authors caution [6] against using values of close to one. Since it is not clear what value of would be appropriate for social networks, Alpha-Centrality’s better overall performance suggest that it is better suited for identifying influential users on Digg.

Results of correlation of centrality with the global measure of influence are less conclusive. While Alpha-Centrality does correlate better than PageRank with this measure on Digg for all values of , on Twitter these measures are anti-correlated. One possible explanation could be differences in the user interface on these sites. Another possibility is that information spread deviates from a simple epidemic more on Twitter. Yet another explanation could be differences in network structure on the two sites, or simply an artifact of the biases introduced by our aggressive spam filtering or small size of the data set. We are addressing these questions in our ongoing work.

5 Related Work

The interplay between structural properties of networks and the diffusion processes occurring on them contribute to their complexity. This has been realized by several researchers in the past. For example, Lambiotte et al. [39, 40] emphasized that dynamical processes play an important role in characterizing the structure of complex networks. In [39] they measure the quality of a network partition in terms of the statistical property of the dynamic process taking place in the network. In [40] they study the different equilibrium properties of these processes. However, their works focus on what we call conservative processes: unbiased and biased random walks, discrete and continuous time random walks. In contrast, we also study non-conservative dynamical processes. We also relate these processes to centrality. Although the relationship of PageRank to random walk-type processes is well known, we explain how Alpha-Centrality is related to a type of a non-conservative process. We also carry out an empirical study of different centrality measures, unlike previous works.

Non-conservative processes are useful for studying a wide range of social phenomena, including the spread of epidemics within a population and information diffusion in social media, viral marketing, and many others. Many of these phenomena have been investigated by other researchers. The study of epidemics, in particular, has a very long history [27, 2]. It is known, for example, that epidemics exhibit critical behavior, and that the threshold of critical behavior is related to network structure [51, 45, 44]. The present work further confirms the relationship between epidemic threshold and network structure. Moreover, it gives an intuitive explanation for critical behavior of epidemics in terms of diverging length scales of non-conservative interactions.

Borgatti [10] suggested a link between centrality and dynamical processes, defining a node’s centrality in terms of its participation in the flow taking place on the network [11]. Therefore, he claimed, the appropriate centrality measure for a given network is one that takes into account the details of the flow. He proposed a typology of flows, based on the trajectories they follow (e.g., geodesics, paths, trails) and the mechanism of spread (e.g., transfer or broadcast), and used simulations to explore the relationship between flows and centrality measures, such as betweenness, degree, and eigenvector [7] centralities. He showed that centrality whose assumptions matched details of the flow was able to better reproduce key observations, such as how quickly or how frequently the flow reached a node. For example, a flow that follows geodesics (shortest paths) frequently visits nodes with highest betweenness centrality [18]. We propose a simpler classification scheme that differentiates flows based on whether or not they conserve the flowing quantity. Unlike Borgatti’s work, we mathematically explore the relationship between different flows and centrality and empirically study differences between centrality measures.

Estrada et al. [16] studied measures similar to Alpha-Centrality and personalized PageRank (with attenuation factor 1) which they call communicability. They linked the communicability functions to dynamics by showing their relationship to the thermal Green’s function of oscillators. They used communicability to identify important actors in small social networks, demonstrating that different communicability functions led to different judgements of centrality, but did not justify the choice of the particular communicability function in terms of the interactions taking place between actors. Although we study a similar function, the goal of our work is to contrast conservative and non-conservative dynamics and explain how these differences should guide the choice of centrality measure for a given social network.

Researchers are increasingly turning to social media data sets to study the properties of complex networks. Some studies used activity-based measures, such as the number of mentions or re-tweets [14, 3] to identify important social media users. Besides correlating these activity-based measures with degree centrality [14], no study has investigated centrality in social media. Our focus in this paper is to justify the choice of centrality by taking into account the dynamical processes taking place on the network.

6 Conclusion

We described two fundamentally distinct dynamical processes on networks, which can be differentiated based on whether or not they conserve some quantity that is distributed on the network, and studied their relationship to two well-known centrality measures used to identify important or influential actors in a social network: PageRank and Alpha-Centrality. While PageRank represents a steady state distribution of a conservative dynamic process on a network, for example, a random walk with restarts, we showed that Alpha-Centrality is a solution of non-conservative dynamics, examples of which include epidemics and signaling by broadcasts.

By analyzing data about information diffusion in social media, we found that Alpha-Centrality tends to better correlate with the empirical measures of influence than PageRank, although it is not clearly superior overall. Our recent research suggests that while information diffusion in social media does have a non-conservative flavor, it is a more complex process than a simple epidemic [29]. A centrality measure that takes into account the nature of information spread in social media could better predict influential social media users. We are currently studying the impact of the microscopic mechanics of contagion on centrality.

Centrality is but one type of measurement of network structure. Other types of measurements, for instance, community detection or determining the strength of social ties, may also be affected by the nature of the dynamic processes occurring on networks. We are addressing these in our ongoing work.

Appendix

Replication matrix can be written in terms of its eigenvalues and eigenvectors as:

(13)

where is a matrix whose columns are the eigenvectors of . is a diagonal matrix, whose diagonal elements are the eigenvalues, , arranged according to the ordering of the eigenvectors in . Without loss of generality we assume that . The matrices can be determined from the product

(14)

where is the selection matrix having zeros everywhere except for element  [19]. Therefore,

(15)

where if and if . As obvious from above, for Equation 15 to hold non-trivially, . Now assuming is strictly greater than any other eigenvalue

For any matrix , let Therefore, the expected number of paths is . The expected path length is given by:

Therefore, as and , the expected path length is approximately , and for it is .

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grants No. 0915678 and CIF-1217605, the Air Force Office of Scientific Research under Contract Nos. FA9550-10-1-0102 and FA9550-10-1-0569, by the Air Force Research Laboratory under Contract No. FA8750-12-2-0186, and by DARPA under Contract No. W911NF-12-1-0034. KL would like to acknowledge Suradej Intagorn for collecting Digg data, Tawan Surachawala and Jeon-Hyung Kang for collecting and analyzing Twitter data. Authors would also like to thank Konstantin Voevodski for helpful comments and Shang-Hua Teng for the edifying and insightful conversations.

References

  • [1] R. M. Anderson and R. May, Infectious diseases of humans: dynamics and control, Oxford University Press, 1991.
  • [2] N. Bailey, The Mathematical Theory of Infectious Diseases and its Applications, Griffin, London, 1975.
  • [3] E. Bakshy, J. M. Hofman, W. A. Mason and D. J. Watts, Everyone’s an influencer: quantifying influence on twitter, in Proc. the fourth ACM Int. Conf. on Web search and data mining, New York, NY, USA, 2011, 65–74.
  • [4] A. Barrat, M. Barthélemy and A. Vespignani, Dynamical Processes on Complex Networks, 1st edition, Cambridge University Press, Cambridge, England, 2008.
  • [5] L. M. A. Bettencourt, A. Cintrón-Arias, D. I. Kaiser and C. Castillo-Chávez, The power of a good idea: quantitative modeling of the spread of ideas from epidemiological models, Physica A: Statistical Mechanics and its Applications, In Press, Corrected Proof.
  • [6] P. Boldi, M. Santini and S. Vigna, Pagerank as a function of the damping factor, in Proc. the 14th Int. Conf. on World Wide Web, 2005, 557–566.
  • [7] P. Bonacich, Factoring and weighting approaches to status scores and clique identification, Journal of Mathematical Sociology, 2 (1972), 113–120.
  • [8] P. Bonacich, Power and centrality: a family of measures, The American Journal of Sociology, 92 (1987), 1170–1182.
  • [9] P. Bonacich and P. Lloyd, Eigenvector-like measures of centrality for asymmetric relations, Social Networks, 23 (2001), 191–201.
  • [10] S. Borgatti, Centrality and network flow, Social Networks, 27 (2005), 55–71.
  • [11] S. Borgatti and M. Everett, A graph-theoretic perspective on centrality, Social Networks, 28 (2006), 466–484.
  • [12] J. J. Brown and P. H. Reingen, Social Ties and Word-of-Mouth Referral Behavior, The Journal of Consumer Research, 14 (1987), 350–362.
  • [13] D. Centola and M. Macy, Complex contagions and the weakness of long ties, American Journal of Sociology, 113 (2007), 702–734.
  • [14] M. Cha, H. Haddadiy, F. Benevenutoz and K. P. Gummadi, Measuring User Influence in Twitter: The Million Follower Fallacy, in Proc. 4th Int. Conf. on Weblogs and Social Media (ICWSM), 2010.
  • [15] K. Dietz, The estimation of the basic reproduction number for infectious diseases., Statistical methods in medical research, 2 (1993), 23–41.
  • [16] E. Estrada, N. Hatano and M. Benzi, The physics of communicability in complex networks, Physics Reports, 514 (2012), 89 – 119.
  • [17] S. Fortunato and A. Flammini, Random Walks on Directed Networks: the Case of PageRank, International Journal of Bifurcation and Chaos, 17 (2007), 2343–2353.
  • [18] L. C. Freeman, A set of measures of centrality based on betweenness, Sociometry, 40 (1977), 35–41.
  • [19] F. Gebali, Markov chains., Analysis of Computer and Communication Networks, 65:122.
  • [20] R. Ghosh and K. Lerman, Predicting Influential Users in Online Social Networks, in Proc. KDD workshop on Social Network Analysis (SNAKDD), 2010.
  • [21] R. Ghosh and K. Lerman, A Framework for Quantitative Analysis of Cascades on Networks, in Proc. Web Search and Data Mining Conference (WSDM), 2011.
  • [22] R. Ghosh and K. Lerman, Parameterized centrality metric for network analysis, Physical Review E, 83 (2011), 066118+.
  • [23] R. Ghosh, T. Surachawala and K. Lerman, Entropy-based classification of ÔretweetingÕ activity on twitter, in Proc. KDD workshop on Social Network Analysis (SNA-KDD), 2011.
  • [24] D. F. Gleich, P. G. Constantine, A. D. Flaxman and A. Gunawardana, Tracking the random surfer: empirically measured teleportation parameters in PageRank, in Proc. 19th international conference on World wide web, 2010, 381–390.
  • [25] S. Goel, D. J. Watts and D. G. Goldstein, The structure of online diffusion networks, in Proc. 13th ACM Conference on Electronic Commerce (EC 2012), 2012, URL http://5harad.com/papers/diffusion.pdf.
  • [26] J. Goldenberg, B. Libai and E. Muller, Talk of the Network: A Complex Systems Look at the Underlying Process of Word-of-Mouth, Marketing Letters, 211–223.
  • [27] H. W. Hethcote, The Mathematics of Infectious Diseases, SIAM REVIEW, 42 (2000), 599–653.
  • [28] N. Hodas and K. Lerman, How limited visibility and divided attention constrain social contagion, in submitted to Social Computing, 2012.
  • [29] N. O. Hodas and K. Lerman, The simple rules of social contagion, Scientific Reports, 4.
  • [30] T. Hogg and K. Lerman, Stochastic models of user-contributory web sites, in Proc. 3rd Int. Conf. on Weblogs and Social Media (ICWSM), 2009.
  • [31] T. Hogg and K. Lerman, Social dynamics of digg, to appear in EPJ Data Science.
  • [32] J. L. Iribarren and E. Moro, Impact of Human Activity Patterns on the Dynamics of Information Diffusion, Physical Review Letters, 103 (2009), 038702+.
  • [33] G. Jeh and J. Widom, Scaling personalized web search, in Proc. the 12th Int. Conf. on World Wide Web, New York, NY, USA, 2003, 271–279.
  • [34] E. Katz and P. Lazarsfeld, Personal Influence: The Part Played by People in the Flow of Mass Communications, Transaction Publishers, 2005.
  • [35] L. Katz, A new status index derived from sociometric analysis, Psychometrika, 18 (1953), 39–43.
  • [36] D. Kempe, J. Kleinberg and E. Tardos, Maximizing the spread of influence through a social network, 2003.
  • [37] C. Kiss and M. Bichler, Identiffication of influencers-measuring influence in customer networks, Decision Support Systems, 46 (2008), 233–253.
  • [38] S. Kotz and N. Balakrishnan, Advances in urn models during the past two decades, in Advances in combinatorial methods and applications to probability and statistics, MR1456736, Birkhauser Boston, Boston, 1997, 203–257.
  • [39] R. Lambiotte, J. C. Delvenne and M. Barahona, Laplacian dynamics and multiscale modular structure in networks.
  • [40] R. Lambiotte, R. Sinatra, J. C. Delvenne, T. S. Evans, M. Barahona and V. Latora, Flow graphs: Interweaving dynamics and structure, Physical Review E, 84 (2011), 017102+.
  • [41] C. Lee, H. Kwak, H. Park and S. Moon, Finding Influentials from Temporal Order of Information Adoption in Twitter”, in Proc. 19th World-Wide Web (WWW) Conference (Poster), 2010.
  • [42] K. Lerman and R. Ghosh, Information Contagion: an Empirical Study of the Spread of News on Digg and Twitter Social Networks, in Proc. 4th Int. Conf. on Weblogs and Social Media (ICWSM), 2010.
  • [43] L. Page, S. Brin, R. Motwani and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford Digital Library Technologies Project, 1998.
  • [44] R. Pastor-Satorras and A. Vespignani, Epidemic spreading in scale-free networks, Physical Review Letters, 86 (2001), 3200–3203.
  • [45] B. A. Prakash, D. Chakrabartiy, M. Faloutsos, N. Valler and C. Faloutsos, Threshold conditions for arbitrary cascade models on arbitrary networks, in Proc. the Int. Conf. on Data Mining, 2011.
  • [46] E. M. Rogers, Diffusion of Innovations, 5th Edition, Free Press, 2003.
  • [47] D. M. Romero, W. Galuba, S. Asur and B. A. Huberman, Influence and passivity in social media, in Proc. the 20th international Conference on World wide web, 2010.
  • [48] H. Tong, C. Faloutsos and J. Pan, Fast Random Walk with Restart and Its Applications, in ICDM ’06: Proc. the Sixth Int. Conf. on Data Mining, Washington, DC, USA, 2006, 613–622.
  • [49] M. Trusov, A. V. Bodapati and R. E. Bucklin, Determining Influential Users in Internet Social Networks, Journal of Marketing Research, XLVII (2010), 643–658.
  • [50] G. Ver Steeg, R. Ghosh and K. Lerman, What stops social epidemics?, in Proc. 5th International AAAI Conference on Weblogs and Social Media (ICWSM), 2011.
  • [51] Y. Wang, D. Chakrabarti, C. Wang and C. Faloutsos, Epidemic Spreading in Real Networks: An Eigenvalue Viewpoint, Reliable Distributed Systems, IEEE Symposium on, 0 (2003), 25+.
  • [52] D. J. Watts and P. S. Dodds, Influentials, Networks, and Public Opinion Formation, Journal of Consumer Research, 34 (2007), 441–458.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
40933
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description