Characterizing and modeling the dynamics of online popularity
Online popularity has enormous impact on opinions, culture, policy, and profits. We provide a quantitative, large scale, temporal analysis of the dynamics of online content popularity in two massive model systems, the Wikipedia and an entire country’s Web space. We find that the dynamics of popularity are characterized by bursts, displaying characteristic features of critical systems such as fat-tailed distributions of magnitude and inter-event time. We propose a minimal model combining the classic preferential popularity increase mechanism with the occurrence of random popularity shifts due to exogenous factors. The model recovers the critical features observed in the empirical analysis of the systems analyzed here, highlighting the key factors needed in the description of popularity dynamics.
The dynamics of information and opinions have been deeply affected by the existence of Web-mediated brokers such as blogs, wikis, folksonomies, and search engines, through which anyone can easily publish and promote content online. This “second age of information” is driven by the economy of attention, first theorized by Simon Simon (1971). Sources receiving a lot of attention become popular and have formidable power to impact opinions, culture, and policy, as well as advertising profit. The Web 2.0 and social media Tapscott and Williams (2006) not only modify traditional communication processes with new types of phenomena, but also generate a huge amount of time-stamped data, making it possible for the first time to study the dynamics of online popularity at the global system scale.
In this letter we focus on the dynamics of popularity of Wikipedia topics and Web pages. As popularity proxies we have chosen the traffic of a document, expressed by the number of clicks to that page generated by a specific population of users, and the number of hyperlinks pointing to a document. It is well documented that the statistical properties of these variables in the Web are very heterogeneous, with distributions characterized by fat tails roughly following power-law behavior Albert et al. (1999); Broder et al. (2000); Meiss et al. (2005, 2008). Such distributions have been explained with models based on the rich-get-richer mechanism Simon (1955); de Solla Price (1976); Barabasi and Albert (1999), but their validation from the point of view of the dynamical behavior is problematic, mainly due to the difficulty to gather relevant data. The data sets utilized here, however, contain temporal information that makes it possible to observe the growth in popularity of individual topics or pages, and allows us to statistically characterize the microdynamics by which online documents gather popularity.
|Wiki||3,293,102||Jan 2001 – Mar 2007||1 sec.|
|Wiki||3,490,740||Feb 2008 – Current||1 hour|
|Chile||3,252,779||2001 – 2006||1 year|
Prior work on popularity dynamics has focused on news Wu and Huberman (2007); Dezso et al. (2006), videos Szabo and Huberman (2008); Crane and Sornette (2008) and music Salganik et al. (2006). Here, we analyze three large scale data sets that we assembled about two information networks: the entire Wikipedia and the Chilean Web. Wikipedia is a large collaborative online encyclopedia with millions of articles and hundreds of thousands of registered contributors (en.wikipedia.org). By mining the full edit history of every article, we were able to reconstruct the entire Wikipedia structure at any past point in time. The raw data was available until March 2007 (download.wikimedia.org). Traffic data with hourly temporal resolution was obtained by cross-referencing with a separate data set originating from Wikipedia proxy server logs (dammit.lt/wikistats). Our third data source is a yearly sequence of crawls of the Chilean Web, made available by courtesy of the TodoCL search engine (www.todocl.com). This data consists of one complete crawl of the .cl top-level domain for each of the years 2002–2006. Basic statistics on each data set are shown in Table 1. The representative graphs of these data sets have an approximately power-law distribution of indegree Baeza-Yates and Poblete (2006); Capocci et al. (2006); Zlatic et al. (2006), like the Web graph at large.
In order to gauge quantitatively the popularity of documents we consider the number of hyperlinks pointing to a page (indegree in the graph representation of the Web Albert et al. (1999)), and the traffic of the page, expressed by the number of clicks to it. Given either of these two popularity proxies at time , we study its logarithmic derivative , which represents the relative variation of the measure in the time unit.
Fig. 1 shows the logarithmic derivative of the indegree vs time for an example page in the English Wikipedia. Despite a roughly exponential growth, the logarithmic derivative provides a signature by which different topics can be compared on the same scale. Almost all pages experience a burst in near the beginning of their life. Many pages receive little attention thereafter. While some pages maintain a nearly constant positive logarithmic derivative indicating an exponential growth, a number of pages continue to experience intermittent bursts in later in their life as in the example.
The distribution of magnitude for the two popularity measures at representative time resolutions is illustrated in Figs. 2a–c. In all cases and at all granularity we observe a heavy-tail behavior. Such heavy-tailed burst magnitude distributions suggest a dynamics lacking a characteristic scale. This is typical in a wide range of “critical” physical, economic, and social systems, such as avalanches, earthquakes, stock market crashes and human communication Barabási (2005); Mandelbrot (1997); Stanley et al. (1996); Gutenberg and Richter (1944); rybski09 (). Further evidence comes from the study of the distribution of the length of inter-event intervals. For each document we record the time stamp of each event for which and measure the inter-event times . The probability distributions of in the different data sets (Fig. 2d) are not distributed following a Poissonian, as expected by queueing theory in traditional systems, but in a power-law fashion with a finite size cutoff, as in Omori’s law of earthquakes Omori (1894) and other self-organized criticality phenomena bak87 ().
The clear evidence for the bursty behavior of online popularity dynamics calls for a stylized model able to explain the observed features in terms of the already acquired popularity of each page and the shifts in collective attention triggered by exogenous events.
The rich-get-richer mechanism can be simulated with the classic linear preferential attachment model Barabasi and Albert (1999), in its directed version Dorogovtsev et al. (2000), or with the ranking model by Fortunato et al. Fortunato et al. (2006). In the latter items are ranked according to their popularity , and the probability that an existing item receives a unit (e.g., a click) is , where is the rank of and is a free parameter that tunes the power-law popularity distribution , such that . Both preferential attachment and ranking models, however, fail to reproduce the long tails observed in the distributions of both and (Figs. 3a-b). Neither model accounts for the occurrence of exogenous factors that shift the attention of users and suddenly increase the popularity of specific topics because of events such as an actor winning a prize, political elections, etc. The minimal assumption in modeling exogenous perturbation consists in considering external stochastic events interfering with the basic rich-get-richer mechanism by suddenly changing the popularity of a topic. The simplest way to implement this mechanisms consists in introducing in the ranking model a reranking probability , such that at each iteration every item is moved to a new position toward the front of the list, chosen randomly with equal probability between 1 (the top position) and the node’s current rank . We call this the rank-shift model epaps ().
In Fig. 4a and 4b we show the indegree distribution of the rank-shift model for several values of : (a) and (b). The ranking model () yields the slope indicated by the dashed line. The reranking probability introduces an exponential cutoff in the distribution, which becomes relevant for and larger (but we used in our simulations).
The distribution of shows two distinctive features, which are remarkably found in the empirical distributions: a maximum located in the range 0.01–0.1 and a fat tail. Since the reranking probability is low, to understand the existence and the location of the maximum it is convenient to consider the model in the absence of the reranking mechanism. At a large time , the expected value of the degree of the node with rank is proportional to , where is the number of links present in the network at time . Let be the number of links added during the interval at whose extremes the ratio is computed. Let , an assumption verified in our calculations. Therefore, one can safely assume that in the period the addition of new links does not affect significantly the degree of nodes and their relative ranking. So one can regard the growth process as a multinomial process with probabilities . The expected number of new links acquired by a node of rank is therefore . The assumption of (almost) stationarity also provides that . We therefore expect for a node to be distributed around , regardless of the node. In Fig. 4c we compare the simulation of the ranking model with the one of the multinomial process with , by using the parameters relative to the Wikipedia data set of January 2003, which represents an ideal tradeoff between the needs of having a sufficient number of bursts and a system size not too large for the model to run. The number of nodes/pages was , the number of hyperlinks and . Based on the above discussion we expect to observe a maximum in the distribution of located at . This is exactly where the maxima of the empirical distributions of popularity bursts are located (see Fig. 2a).
The ranking model cannot reproduce the fat tail observed in the real data. This is the reason why we introduced the reranking mechanism in our model. Here, it is the nodes that are suddenly promoted to a higher rank that are responsible for the high values of in the simulations. We consider a node that at time (the reference time at which we start measuring ) has rank , and is immediately promoted to rank , with chosen uniformly in . Under the same assumption of stationarity that we made above, the expected degree of the node before promotion is . Let us further assume that and that , which hold for the parameters used in our model. Since the reranking probability is small, we can safely assume that no node is reranked more than once during the observation time . The expected number of links collected during the period is then . We expect therefore . It is straightforward to derive the distribution for a generic node that is promoted at the beginning of by considering all pairs of values , uniformly distributed in . We find . In Fig. 4d we highlight the tail of the distribution as produced by the rank-shift model and our expectation for its slope: the match is surprisingly good.
Simulations of the rank-shift model were performed using parameters matching those from the empirical data (e.g., nodes for the Wikipedia in 2003); the free model parameters were set to fit the empirical distributions: and . For we recover the original ranking model, which yields a lognormal distribution of , like the preferential attachment (Fig. 3a). For numerical simulations show that the tail of the popularity burst magnitude distribution shifts from a lognormal to a power law. The popularity distribution itself remains a power law; its exponent remains , but with an exponential cutoff depending on .
Such a parsimonious model is able to reproduce the most relevant features observed in the empirical data. Not only does rank-shift predict the distributions of both popularity measures in our data sets, but also the long tails of the distributions of indegree and traffic burst size (Fig. 3c). Furthermore, it naturally accounts for the maxima of the empirical distributions. Remarkably the model captures the long-range distribution of inter-burst intervals as well (Fig. 3d). The random rank-shift mechanism is therefore able to capture the way in which Web sites and pages gain and accumulate popularity: not by a gradual proportional process, but by a sequence of bursts that move them to the forefront of people’s attention. Such bursts are different from those observed in news-driven events Wu and Huberman (2007), where attention fades rapidly and overall popularity is lognormal-distributed. We also found that smaller rank shifts are unable to capture the critical burst behavior observed in the data epaps ().
At the present stage our model is mostly descriptive and simply aims at reproducing at the coarsest level the distributions that characterize popularity changes. Possible refinements may include the effect of search engines, external events, news, word of mouth, social media, marketing campaigns, or any combination of them. The study of traffic patterns and models Meiss et al. (2008); Goncalves et al. (2009); Meiss et al. (2010) may help shed empirical light on this question.
Acknowledgements.We thank R. Baeza-Yates, C. Cattuto, B. Dravid, V. Griffith, V. Loreto, M. Marchiori, M. Meiss. This work was supported in part by a Lagrange Senior Fellowship from the CRT Foundation to F.M., NSF grant IIS-0513650 to A.V., and the Lilly Endowment Foundation. S.F. gratefully acknowledges ICTeCollective, grant 238597 of the European Commission.
- Simon (1971) H. A. Simon, in Computers, Communication, and the Public Interest, edited by M. Greenberger (The Johns Hopkins Press, Baltimore, 1971), pp. 37–72.
- Tapscott and Williams (2006) D. Tapscott and A. D. Williams, Wikinomics: How Mass Collaboration Changes Everything (Portfolio Hardcover, 2006).
- Albert et al. (1999) R. Albert, H. Jeong, and A.-L. Barabási, Nature 401, 130 (1999).
- Broder et al. (2000) A. Broder et al., Computer Networks 33, 309 (2000).
- Meiss et al. (2005) M. Meiss, F. Menczer, and A. Vespignani, in Proc. 14th Intl. World Wide Web Conf. (2005), pp. 510–518.
- Meiss et al. (2008) M. Meiss et al., in Proc. 1st Intl. Conf. on Web Search and Data Mining (WSDM) (2008), pp. 65–76.
- Simon (1955) H. A. Simon, Biometrika 42, 425 (1955).
- de Solla Price (1976) D. de Solla Price, J. Amer. Soc. Inform. Sci. 27, 292 (1976).
- Barabasi and Albert (1999) A.-L. Barabasi and R. Albert, Science 286, 509 (1999).
- Wu and Huberman (2007) F. Wu and B. A. Huberman, Proc. Natl. Acad. Sci. USA 104, 17599 (2007).
- Dezso et al. (2006) Z. Dezso et al., Phys. Rev. E 73, 066132 (2006).
- Szabo and Huberman (2008) G. Szabo and B. A. Huberman, Tech. Rep., arXiv:0811.0405v1 [cs.CY] (2008).
- Crane and Sornette (2008) R. Crane and D. Sornette, Proc. Natl. Acad. Sci. USA 105, 15649 (2008).
- Salganik et al. (2006) M. J. Salganik, P. S. Dodds, and D. J. Watts, Science 311, 854 (2006).
- Baeza-Yates and Poblete (2006) R. Baeza-Yates and B. Poblete, Comput. Networks 50, 1464 (2006).
- Capocci et al. (2006) A. Capocci et al., Phys. Rev. E 74, 036116 (2006).
- Zlatic et al. (2006) V. Zlatic et al., Phys. Rev. E 74, 016115 (2006).
- Clauset et al. (2009) A. Clauset, C. R. Shalizi, and M. E. J. Newman, SIAM Review 51, 661 (2009).
- Barabási (2005) A.-L. Barabási, Nature 435, 207 (2005).
- Mandelbrot (1997) B. B. Mandelbrot, Fractals and Scaling in Finance: Discontinuity, Concentration, Risk, vol. E of Selecta (Springer, 1997).
- Stanley et al. (1996) M. H. R. Stanley et al., Nature 379, 804 (1996).
- Gutenberg and Richter (1944) B. Gutenberg and C. Richter, Bull. Seismol. Soc. Am. 34, 185 (1944).
- (23) D. Rybski et al., Proc. Natl. Acad. Sci. USA 106, 12640 (2009).
- Omori (1894) F. Omori, J. Coll. Sci. Imp. Univ. Japan 7, 111 (1894).
- (25) P. Bak, C. Tang and K. Wiesenfeld, Phys. Rev. Lett. 59, 381 (1987).
- Dorogovtsev et al. (2000) S. Dorogovtsev, J. Mendes, and A. Samukhin, Phys. Rev. Lett. 85, 4633 (2000).
- Fortunato et al. (2006) S. Fortunato, A. Flammini, and F. Menczer, Phys. Rev. Lett. 96, 218701 (2006).
- (28) See EPAPS Document No. … for alternative reranking strategies.
- Goncalves et al. (2009) B. Goncalves et al., in Late-breaking results at 2nd Intl. Conf. on Web Search and Data Mining (WSDM) (2009).
- Meiss et al. (2010) M. Meiss et al., Proc. 21sth ACM Conf. on Hypertext and Hypermedia (HT) (2010).