Heterogeneity shapes groups growth in social online communities
Many complex systems are characterized by broad distributions capturing, for example, the size of firms, the population of cities or the degree distribution of complex networks. Typically this feature is explained by means of a preferential growth mechanism. Although heterogeneity is expected to play a role in the evolution it is usually not considered in the modeling probably due to a lack of empirical evidence on how it is distributed. We characterize the intrinsic heterogeneity of groups in an online community and then show that together with a simple linear growth and an inhomogeneous birth rate it explains the broad distribution of group members.
pacs:89.20.-a, 89.20.Hh, 89.75.Fb
Many complex systems are characterized by heavy-tailed distributions, e.g., Zipf’s law (originally used to describe the frequency of words Zipf1949Human ()), Pareto’s law (originally describing the wealth of nations Pareto1964Cours ()), and more recently scale-free topologies (capturing the degree distribution of complex networks Barabasi1999Emergence ()) Newman2005Power (); Saichev2009Theory (). This property is typically perceived as a symptom of the rich-gets-richer principle, and models implementing some degree of preferential growth are usually the first approach to explain heavy-tailed distributions Leskovec2008Microscopic (); Mislove2008Growth (); Eisenberg2003Preferential (); Yamasaki2006Preferential (); Simon1955class (); Barabasi1999Emergence (); Barabasi1999Mean-field (); Huberman1999Internet: (); Dorogovtsev2000Structure (); Bornholdt2001World (); Maruvka2011The (); Hernandez2006Clone (). In line with the rich-gets-richer principle, the Gibrat’s law suggests that the expected growth of a firm, a city or social activity is proportional to its size Gibrat1931Les (); Gabaix1999Zipf (); Rozenfeld2008Laws (); Rybski2009Scaling (). However, in general, less attention has been devoted to the time evolution of complex systems probably due to the lack of empirical data along time (for some exceptions see Saichev2009Theory (); Barabasi02 (); Palla2007Quantifying (); Tessone10 ()). In many network growth models the time unit is mapped to the number of new arriving elements, which makes it difficult to compare the results with real data. Moreover, many models assume that the elements are born identical leading to correlations between age and frequency (of words, wealth, degree or size) which are not fully supported by empirical observations Adamic2000Power-Law (). In many real systems, especially in social systems, individuals or elements are very diverse. In this direction, some models incorporating heterogeneity in the form of fitness, hidden variables or ranking have been proposed Caldarelli2002Scale-Free (); Soderberg2002General; Boguna2003Class (); Fortunato2006Scale-Free; Ratkiewicz2010Characterizing (). However, there is rather little empirical work showing how intrinsic heterogeneity is distributed and its role in complex system growth Garlaschelli2004Fitness-Dependent (); DeMasi2006Fitness (). Based on data collected on a daily basis on the time evolution of an online social system we will characterize the heterogeneity of the groups and identify the heterogeneity and the distributed birth dates as key players explaining the heavy-tailed distribution of group sizes and the apparent proportional growth of groups to their size.
We study an online community called Flickr Flickr (), where members can create and join groups. The groups in Flickr are mainly used to collaboratively post photos associated with the theme of the group. We will consider each group as an element of the system characterized by the number of members belonging to the group (group size). We have collected two datasets containing in total over 260,000 member-created groups in Flickr, which accounted for over 65% of all public groups existing in Flickr. The first dataset has high temporal resolution and a wide time window. It contains 9,503 groups tracked for 350 days, between June 5, 2008 and May 20, 2009, by the publicly accessible external service called GroupTrackr GroupTrackr (). The service tracked on a daily basis the number of members of the groups. The second dataset has shorter time window and minimal temporal resolution, but it covers a larger number of groups. It contains over 260,000 public groups for which we gathered information on the number of members, collected in two snapshots on December 18, 2009 and January 29, 2010. For these groups we also gathered estimated information on their birth date. As an estimation of the group birth date we consider the time when the first photo was posted to the group pool, as the first photo is normally posted to the pool soon after the group creation. The oldest groups in our dataset date back to July 16, 2004.
Ii Groups’ growth in Flickr
We first analyze the time evolution of groups. In Fig. 1a we show how typical groups grow in number of members on a daily basis during the period of one year. As a first approach, a linear growth captures the individual trend (despite evident deviations in the form of sudden jumps). We have performed linear regression of time evolution of sizes of 9,503 groups over the period of almost one year. For about half of these groups the coefficient of determination has a value over 0.95, and more than of the groups larger than 1000 has higher than 0.95. The difference comes from the fact that the larger groups are affected less by the fluctuations of size. Aggregated residual plots do not show any clear trend deviating from our linear model. The time series cover considerable part of the average lifespan of the groups. Thus, we consider that groups grow linearly in time, the size of the group evolves as
where is the growth per unit of time, is the birth date and is the current age of group . We estimate the two parameters for 260,000 groups. The growth for each group is calculated as the change of its size during 6 weeks, per day. A log-normal distribution provides the best fit to the distribution of growth values (Fig. 1b) with average and standard deviation . Finally, we estimated the current ages of all groups, finding that the number of groups created daily has been growing (almost linearly) in time (Fig. 1c).
Iii Linear growth model with heterogeneous birth and growth
Based on those findings we propose a minimal model of the time evolution of group sizes in Flickr, a linear growth model with heterogeneous birth and growth, which in short we will refer as the heterogeneous linear growth model. The model proceeds as follows, at each time step : (i) new groups are created in the system. The number of groups created in each time step increases linearly with . Each newly created group starts with one member and it is assigned its own growth value , drawn from a log-normal distribution. Growth value remains unchanged for the simulation time; (ii) the size of each group is increased by .
We have run numerical simulations of the heterogeneous linear growth model where each time step of the simulation corresponds to a single day. We have simulated 1959 days in Flickr, from the moment when the first group from our dataset appeared. As a result of the numerical simulations we obtain the daily evolution of the sizes of over 260,000 artificial groups. The distribution of the final sizes of the groups reproduces with a good agreement the observed distribution (Fig. 2a). As it can be seen from Fig. 2a there is a small divergence for large group sizes, which could be explained by the deviations –mostly for small groups– from the linear growth assumption. First, the strong fluctuations of the time evolution of group sizes of the small groups (see the jumps in Fig. 1) lead to a larger ’apparent’ growth than the real one, therefore leading to an over-estimation of their growth and, as a consequence, the model displays a larger number of big groups than in the real system.
The average growth of groups of the same size, , shows that bigger groups grow faster (Fig. 2b) both for the real data and the model in accordance with the Gibrat’s law: . This result is obtained even though the microscopic rules of the model do not implement the rich-gets-richer principle. The average growth is an average over all groups of a given size, each of them growing linearly. Due to the heterogeneity and the linear growth, at a given time larger groups consist of old groups that grow slowly and younger groups that grow faster. Thus, the observation of preferential growth for groups of the same size does not reflect in this case an underlying rich-gets-richer principle, but it is a consequence of the competition of groups with different growth values and ages.
The statistical properties of the model can be estimated analytically. From the definition, the average growth of groups of the same size is given by:
where is the joint probability of having a group of size and growth rate , and . The lower limit of the integral is given by Eq. (1) and depends on , and the maximum value of is limited to , if the first group was created at time . We transform Eq. (2) replacing the joint probability by and making the assumption that and are independent random variables:
is plotted in Fig. 2a. As one can see the solutions for both the average growth and the size distribution are in good correspondence with the results of numerical simulations, which indicates that the assumptions of independent random variables and linear growth are reasonable.111Equations (3) and (5) are easy to solve if and are independent random variables and is a power-law distribution. In such a case one can show that and that is a power-law as well.
Iv Heterogeneity vs. preferential growth
We have shown that the heterogeneous linear growth model captures the statistical properties which commonly are attributed to the preferential growth mechanism. Thanks to the intrinsic heterogeneity, different growth patterns are permitted, even if groups have the same number of members at any point in time. One can see an example of this in Fig. 1a, where group sizes are crossing themselves in time, though they continue to grow as they grew before the crossing. To make a direct comparison between the two mechanisms, heterogeneity vs. preferential growth, we consider the Simon model Simon1955class (). The Simon model has been originally proposed to explain the distribution of words’ frequency in a written text. At every time step, a word is added to the text: with a given probability it is a new word; otherwise, the word is chosen at random from the text, so the words which appear more frequently are chosen more often. We have adapted the Simon model to our system. We have set the parameters to obtain the same total number of groups and members as in the real case; also the number of new groups created in the system in each time step of the Simon model grows linearly, to isolate the effect of the heterogeneity. First, in the Simon model the final size of groups is heavily determined by their initial size measured one year before (Fig. 3a), thus there is little heterogeneity among the groups, in contrast to the heterogeneous linear growth model which displays a degree of heterogeneity similar to the one of real groups. Second, for the Simon model the correlation of size and age is strong, while it is weak for real groups and the heterogeneous linear growth model (Figs. 3b-d)222In the heterogeneous linear growth model the average size of groups of given age is , where and are parameters of the lognormal distribution. In the Simon model, it is given by , where is the age of the system, controls the number of new users introduced into the system in each time step (), and is the probability of new group creation within the model (in our case , and ).. The wide spread of group sizes corresponds to the high heterogeneity of groups, which is not captured by the preferential growth model (as observed in other systems as, for instance, in the World Wide Web where the number of links to the page is not strongly correlated with age of the web page Adamic2000Power-Law ()).
In summary, we have proposed a simple growth model of heterogeneous elements with associated growing counters, based on the findings for a social system in an online community. We found that the model captures many of the features of the real system of online groups, namely the heavy-tailed distribution of group sizes, the average growth proportional to the current size of groups and the weak correlation between the age and the size of groups. Furthermore we made a direct comparison of the heterogeneous linear growth model with a preferential growth model and showed the similarities and the differences between these models. In the heterogeneous linear growth model the heavy-tailed distribution of final sizes of elements does not emerge from the growth process itself (e.g., rich-gets-richer principle), but from the intrinsic heterogeneity of elements which take part in this growth process. This certainly does not answer the question why some groups grow faster than the others, as we do not understand yet what factors influence the fitness of the groups. However it points out that it does not have to be due to the fact that one group is bigger than the other as in preferential attachment models. The simplicity of our approach suggests that the characterization of the heterogeneity may play an important role in understanding the origin of broad distributions and the time evolution of many real systems.
We thank Dario Taraborelli for the access to GroupTrackr data and help with further data collection process. P.A.G. and V.M.E. acknowledge partial support from NEST program of the European Commission through PATRES project, and from MICINN (Spain) through projects FISICOS (FIS2007-60327) and MODASS (FIS2011-247852); P.A.G. acknowledges support from the JAEPredoc program of CSIC (Spain).
- (1) G.K. Zipf, Human Behaviour and the Principle of Least Effort: An Introduction to Human Ecology (Addison-Wesley Press, Reading MA, 1949).
- (2) V. Pareto, Cours d’Économie Politique (Librairie Droz, Genéva, 1964).
- (3) A.-L. Barabási and R. Albert, Science 286, 509–512 (1999).
- (4) M.E.J. Newman, Contemporary Physics 46, 323–351 (2005).
- (5) A. Saichev, Y. Malevergne and D. Sornette, Theory of Zipf’s Law and Beyond (Springer, New York, 2009).
- (6) J. Leskovec, L. Backstrom, R. Kumar and A. Tomkins, in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’08, 462–470 (ACM, Las Vegas, Nevada, USA, 2008).
- (7) A. Mislove, H.S. Koppula, K.P. Gummadi, P. Druschel and B. Bhattacharjee, in Proceedings of the First Workshop on Online Social Networks - WOSP ’08, 25–30 (ACM, Seattle, WA, USA, 2008).
- (8) E. Eisenberg and E.Y. Levanon, Phys. Rev. Lett. 91, 138701 (2003).
- (9) K. Yamasaki et al., Phys. Rev. E 74, 035103 (2006).
- (10) H.A. Simon, Biometrika 42, 425–440 (1955).
- (11) A.-L. Barabási, R. Albert and H. Jeong, Physica A 272, 173–187 (1999).
- (12) B.A. Huberman and L.A. Adamic, Nature 401, 131 (1999).
- (13) S.N. Dorogovtsev, J.F.F. Mendes and A.N. Samukhin, Phys. Rev. Lett. 85, 4633–4636 (2000).
- (14) S. Bornholdt and H. Ebel, Phys. Rev. E 64, 035104 (2001).
- (15) Y.E. Maruvka, D.A. Kessler and N.M. Shnerb, PLoS ONE 6, e26480 (2011).
- (16) E. Hernandez-Garcia, A.F. Rozenfeld, V.M. Eguiluz, S. Arnaud-Haond and C.M. Duarte, Physica D 214, 166–173 (2006).
- (17) R. Gibrat, Les Inégalités Économiques (Librairie du Recueil Sirey, Paris, 1931).
- (18) X. Gabaix, Q. J. Econ. 114, 739–767 (1999).
- (19) H.D. Rozenfeld, D. Rybski, J.S. Jr. Andrade, M. Batty, H.E. Stanley and H.A. Makse, Proc. Natl. Acad. Sci. U.S.A. 105, 18702 (2008).
- (20) D. Rybski, S.V. Buldyrev, S. Havlin, F. Liljeros and H.A. Makse, Proc. Natl. Acad. Sci. U.S.A. 106, 12640 (2009).
- (21) A.-L. Barabási, H. Jeong, Z. Néda, E. Ravasz, A. Schubert and T. Vicsek, Physica A 311, 590–614 (2002).
- (22) G. Palla, A.-L. Barabási and T. Vicsek, Nature 446, 664 (2009).
- (23) C.J. Tessone, M.M. Geipel and F. Schweitzer, arxiv:1007.1330.
- (24) L.A. Adamic and B.A. Huberman, Science 287, 2115 (2000).
- (25) G. Caldarelli, A. Capocci, P. De Los Rios and M.A. Muñoz, Phys. Rev. Lett. 89, 258702 (2002).
- (26) B. Söderberg, Phys. Rev. E 66, 066121 (2002).
- (27) M. Boguñá and R. Pastor-Satorras, Phys. Rev. E 68, 036112 (2003).
- (28) S. Fortunato, A. Flammini and F. Menczer, Phys. Rev. Lett. 96, 218701 (2006).
- (29) J. Ratkiewicz, S. Fortunato, A. Flammini, F. Menczer and A. Vespignani, Phys. Rev. Lett. 105, 158701 (2010).
- (30) D. Garlaschelli and M.I. Loffredo, Phys. Rev. Lett. 93, 188701 (2004).
- (31) G. De Masi, G. Iori and G. Caldarelli, Phys. Rev. E 74, 066112 (2006).
- (32) http://www.flickr.com.
- (33) http://nitens.org/taraborelli/webcommunities.