Uncovering the dynamics of citations of scientific papers
We demonstrate a comprehensive framework that accounts for citation dynamics of scientific papers and for the age distribution of references. We show that citation dynamics of scientific papers is nonlinear and this nonlinearity has far-reaching consequences, such as diverging citation distributions and runaway papers. We propose a nonlinear stochastic dynamic model of citation dynamics based on link copying/redirection mechanism. The model is fully calibrated by empirical data and does not contain free parameters. This model can be a basis for quantitative probabilistic prediction of citation dynamics of individual papers and of the journal impact factor.
pacs:01.75.+m, 02.50.Ey, 89.75.Fb, 89.75.Hc
The growth mechanism of complex networks is frequently attributed to preferential attachment Barabasi (). While this mechanism accounts for the ubiquity of networks that are scale-free or have heavy-tailed degree distribution, it is too general and does not specifically address evolving network structure. A more realistic scenario of the dynamics of growing networks is provided by the two-step growth models that have been developed in the context of social networks Jackson (); Pennock (), epidemic-like propagation of ideas Goffman (); Scharnhorst (); Bettencourt (), diffusion of innovations Bass (); Lal (), and citation dynamics Vitanov (). In the context of citations these models are known as redirection/copying Redner2005 (), recursive search Vasquez2001 (); Vasquez2003 (), link copying/referral Simkin (), uniform/preferential attachment Peterson (), and triad formation Holme2009 (); Ren (). Although citation network is specific (it is ordered, directed, acyclic, and does not allow rewiring Bagrow (); Newman2009 ()), it is an excellent example of a growing network since it is well-documented and its dynamics can be reliably traced through long time periods.
We introduce a comprehensive two-step model of a growing citation network. The model is fully calibrated by empirical data and does not contain free parameters. Our measurements revealed an unexpected dynamic nonlinearity that was missing in all previous models. We incorporate this nonlinearity into our framework and come out with a nonlinear stochastic model of citation dynamics. The model predictions are confirmed by the measurements of the age composition of the average reference list on the one hand, and by the statistical distribution of cumulative citations for a large ensemble of papers, on another hand.
Our model can be useful for making probabilistic prediction of citations of scientific papers. This active topic was initiated by Refs. Penner (); Acuna (); Mazloumian (); Barabasi2013 (); Uzzi () who suggested several linear predictive models containing empirical parameters. Statistical uncertainty of these one-step models is too high. We introduce here a much more realistic nonlinear two-step model where all parameters have been calibrated in the independent measurements. The nonlinearity leads to divergent citation dynamics that explains why predicting citation behavior of individual papers is so difficult. Our model can be a basis for the probabilistic forecasting the scientific impact of a paper or of a journal.
Ii Statistics of references
ii.1 Scenario: how an author composes his reference list
Consider a cartoon scenario of an author writing a scientific paper. He reads research journals or media articles, searches the databases, finds the relevant papers and cites some of them in his reference list. Then he studies the reference lists of these preselected papers, picks up relevant references, reads them, cites some of them, and the process continues recursively. We distinguish between the direct references that the author found through media or database search, and indirect references that the author picked up from the reference lists of the preselected papers Vitanov (); Simkin (); Peterson (). Figure 1 shows the corresponding reference network.
The direct and indirect references emerge in another scenario where the author finds each reference independently. Since old references are usually seminal studies, the author’s most recent references will probably cite these old papers as well. In our parlance the older references are indirect ones although the author could choose them without knowing that other preselected papers cite them as well.
ii.2 Age distribution of references
The above scenario yields a very specific age distribution of references. Indeed, consider a reference list of an average paper that comprises references where is the publication year and is the number of references that were published in the year . The latter consist of the direct and indirect references, . For example, Fig. 2 shows for Physics papers published in .
To find we make a crude approximation that once the author cites some paper, he can cite any of its references with equal probability. An average reference list comprises preselected papers published in year (where ), each of which bringing in average indirect references. The fraction of the latter that were published in the year is . This is equal to since the age composition of the average reference list is fairly independent of the publication year (Fig. 2). Finally, we obtain
Since all variables now refer to the same publication year , we can drop from our notation, in such a way that in Eq. 1 denotes .
Once we know the functions and , we can solve Eq. 1 and find . These functions are the key parameters of the model since they capture the citation habits of an average author. Although these functions could be found by analyzing reference lists of papers, this is not easy since bibliometric databases focus more on citations than on references. However, since there is a duality between citations and references, we can find and by considering citation dynamics of the papers.
Iii Citation dynamics- a mean-field model
iii.1 Reference-citation duality
Since one paper’s citation is another paper’s reference, the reference and citation networks are dual (Fig. 1). Consequently, the age distribution of references for the papers published in one year (diachronous or retrospective citation distribution Nakamoto (); Glanzel ()) (Fig. 2) is very similar to (Fig. 3), the mean citation rate of the papers published in one year (synchronous or prospective citation distribution).
In what follows we analyze consequences of this duality and how it can be used to measure relevant parameters in Eq. 1. Indeed, consider a set of all papers in a certain research field that were published in year . The mean number of citations that a paper from this set garners in the -th year after publication is . Since the majority of citing papers belong to the same research field, the total number of citations garnered by these papers in the year shall be equal to the total number of the references in the reference lists of the papers published in the year ,
Equation 2 relates the total number of citations and references for the papers belonging to the same research field but published in different years. To find corresponding relation for the papers published in the same year, we shall take into account that both the number of publications and the reference list length grow exponentially with time, . We substitute these exponential dependences into Eq. 2, notice that , and find the mathematical expression for the reference-citation duality,
iii.2 A mean-field model of citation dynamics
where and has been replaced by using the properties of the convolution. Equation 4 tells us that an average paper published in some year has first-generation citations published in year , each of which generating second-generation citations in some later year . The probability that a second-generation citation induces (indirect) citation of the parent paper is where is the average length of the reference list of the papers published in the year .
To solve Eq. 4 we have to determine functions and . In fact, we used Eq. 4 to find . To this end we measured and (see next section), substituted these functions into Eq. 4, and found an exponential kernel where , yr, and the publication year corresponds to . (These numbers mean that references in the average reference list of a paper are direct and are indirect).
To find age distribution of references we calculated where for Physics papers (as found in our independent measurements). We substituted , and the functions found above into Eq. 1 and solved it to find . Figure 2 shows that the model prediction fits perfectly well our measurements. We conclude that the mean-field model (Eqs.1,2,4) faithfully accounts for the average age composition of the reference list and for the mean citation dynamics of scientific papers. In what follows we use this model to infer citation dynamics of individual papers.
Iv Citation dynamics of individual papers
iv.1 Linear stochastic model
To infer equation describing citation dynamics of individual papers we consider , the number of citations garnered by some paper during a short time interval from to . The cumulative number of citations of this paper is . We assume that is a discrete random variable that follows a time-inhomogeneous Poisson process Golosovsky () with the rate . This latent citation rate Burrell () consists of the direct and indirect contributions, .
To infer dynamic equation for we note that for the set of papers published in one year, the average rate of direct citations is , the average rate of total citations is , and where . We substitute these equalities into Eq. 4, replace integral by sum, dispense with the averaging, and obtain
This discrete stochastic equation is consistent with Eq. 4. Our initial assumption (to be revised soon) is that the functions and , determined from our studies of mean-field citation dynamics, govern citation dynamics of individual papers as well.
iv.2 Measurements and comparison to the model
To verify Eq. 5 empirically we need to measure the direct and indirect citations separately. To this end we chose 37 representative research papers that were published in the Physical Review B in one year, analyzed their first- and second-generation citing papers (Fig. 1), identified the direct and indirect citations and measured their dynamics. As an aggregate measure of the paper’s individuality we took , the long-time limit of cumulative citations.
iv.2.1 Direct citations
We found (not shown here) that the direct citation rate of a paper can be represented as
where is the numerical parameter and the function is the same for all papers published in one year, whereas . Figure 3 shows that grows immediately after publication of the paper, achieves its maximum after 2 years and slowly decays thereafter. The long tail of is a mathematical expression of delayed recognition (”sleeping beauty” sleeping-beauty ()) phenomenon.
The parameter is the long-time limit of the number of direct citations and it is a proxy to the so-called ”fitness” Simkin (); Bianconi () which shall depend on the scientific quality of the parent paper, the journal where it was published, popularity of the research field, etc. On the one hand, can be estimated a priori from the initial citation rate of the paper. [Indeed, shortly after publication .] On another hand, since the solution of Eqs. 5,6 yields , this relation can serve as an a posteriori estimate of . Our measurements (not shown here) yield a sublinear dependence
where small parameter accounts for the fact that previously uncited papers have some probability to be cited in future.
iv.2.2 Indirect citations
We found that Eq. 5 with the kernel fits the dynamics of indirect citations of individual papers only if we allow for and to depend on the number of previous citations . The reason for this surprising -dependence is that new citations modify the very structure of citation network associated with the cited paper, this modification being most pronounced for highly-cited papers.
Indeed, consider two generations of citing papers associated with a parent paper. Obviously, the number of the first-generation citations is equal to the number of the first-generation citing papers. However, the numbers of second generation citations and citing papers can differ. We denote by and , correspondingly, the long-time limits of the number of second-generation citations and citing papers per one first-generation citing paper, and introduce , an average number of the first-generation citing papers cited by a second-generation citing paper. Figure 5 shows that increases with following empirical dependence . The inset shows that this growth is associated with the dependence (this means that the citation network is assortative) while is almost independent on . Figure 4 demonstrates that measures the average number of paths leading from the parent paper to a second-generation citing paper. For a low-cited parent paper , indicating that it is connected to each of its second-generation descendant by a single path. For a highly-cited parent paper indicating that it is connected to some of its second-generation descendants by multiple paths.
We found that the parameter in Eq. 5 is also -dependent (not shown here). Therefore, we merge all -dependent parameters together and introduce , the probability that a second-generation citing paper cites the parent paper (indirectly). Our measurements revealed that increases nonlinearly with , the number of citation paths connecting the parent paper with its second-generation descendant, as it is schematically shown in Fig. 4c. In particular, we found quadratic dependence indicating constructive interference between multiple paths. Since the number of multiple paths increases with , this translates into empirical dependence (while in the absence of multipath interference one would obtain ) The dependence stems from the dependence and it has loose analogy with bootstrap percolation.
V Nonlinear model of citation dynamics
v.1 Dynamic equation
To introduce nonlinearity into Eq. 5 we replaced the kernel by . Here, is the average number of the second generation citing papers per one first-generation citing paper (fan-out coefficient), and is the probability that a second-generation citing paper cites the parent paper (indirectly). The novelty here is the dependence which is shown in Fig. 5. We introduce this kernel into Eq. 5, plug there Eq. 6, and obtain our key result- nonlinear stochastic dynamic equation for the latent citation rate of a paper A-
The empirical functions , , and are shown in the Figs. 3,5, correspondingly. Equation 8 is a nonlinear first-order discrete stochastic differential equation with the initial condition set by . This equation expresses , the latent citation rate of the paper at time , through past citations of the same paper, and . The probability distribution of actual citations at time is given by the Poisson distribution, .
v.2 Stochastic simulation
To verify Eq. 8 we performed stochastic numerical simulation imitating citation dynamics of a set of 40195 Physics papers published in 1984. Figure 6 shows the cumulative citation distributions for this set over the time span of 25 years. We wish to imitate these distributions using Eq. 8. This requires that the statistical distribution of initial conditions (”fitness” ) for the actual and ”simulated” papers be the same. We estimated for each paper using Eq. 7 and assuming . The inset in Fig. 6 shows corresponding statistical distribution of .
We run stochastic simulation based on Eq. 8 with this distribution of and empirical functions , , shown in Figs. 3,5, correspondingly. Figure 6 shows excellent agreement between the simulated and measured cumulative citation distributions. Moreover, our simulation accounts fairly well for such intricate characteristics of citation dynamics as stochastic variability, temporal autocorrelation, and the dynamics of uncited papers (not shown here). We present here our measurements with Physics papers while we obtained very similar results with the Mathematics and Economics papers as well.
v.3 Analysis of the model
To have more insight into citation dynamics of scientific papers we consider continuous approximation of Eq. 8. Without loss of generality we disregard stochasticity, consider as a continuous variable, and approximate the kernel by the exponential, where and . [The rationale for this approximation is the fact that with yr has much stronger time dependence than . The latter is captured by the term 0.08 yr]. We replace the sum in Eq. 8 by the integral, drop index and arrive at
Equation 9 appears in the context of Bellman-Harris branching (cascade) processes Bellman (). It is well-known in the population dynamics where it describes the age-dependent birth-death process with immigration Ebeling () where direct and indirect citations are analogs of immigration and reproduction, correspondingly, and is the reproduction number.
Dynamic behavior described by Eq. 9 results from the interplay between the positive feedback rate characterized by the factor and the rate of obsolescence characterized by the parameter . The latter shall be compared to the average paper longevity (citation lifetime), , that we define empirically using a crude exponential approximation , where characterizes delayed recognition. In the limit Eq. 9 reduces to the first-order autoregressive model of citation dynamics Golosovsky ()
The latter is nothing else but the Bass equation for diffusion of innovations Bass (); Lal () in an infinite market. Citations correspond to adopters, direct citations correspond to innovators, and indirect citations correspond to imitators. The connection to the Bass model is not occasional since each paper can be considered as a new product whose penetration to the market of ideas is gauged by the number of citations. The novelty here is the nonlinear dependence. In the context of diffusion of innovations the nonlinear coefficient of imitation is not unexpected. This would indicate increased probability of adoption of a new product if several neighbors in the network already adopted it. To the best of our knowledge, such possibility didn’t deserve much attention.
Vi Consequences of nonlinearity: runaways
To analyze consequences of the nonlinearity we note that since dependence is weak we can integrate Eq. 9 over time assuming constant . This yields
The first term in the square brackets corresponds to direct citations, the second term stays for indirect citations. Each direct citation induces a cascade of indirect citations that propagates in time if and decays if . In the latter case comes to saturation, (ordinary papers) while in the former case grows exponentially (seminal papers). Since grows with time, an ordinary paper which by pure chance garnered excessive number of citations, can become a seminal paper.
Although the dependence is weak, it is important since it enters in the exponent. This results in a ”winner takes all” instability Vasquez2003 (); Redner2001 (); Mondragon (); Krap-kryukov (); Barabasi2012 (). To analyze how this instability develops with we again consider the paper longevity (citation lifetime) . The latter is determined by the exponent , and to a lesser extent, by the function . Equation 12 suggests that . Since increases with , the inverse dependence means that with increasing number of citations, increases and diverges upon approaching the branching (tipping) point Simkin (). This means that each new citation extends a paper’s lifetime 222Quite opposite to the situation described in the famous Balzac’s novel ”La peau de chagrin”.
Figure 7 demonstrates that the citation lifetime indeed increases with increasing and diverges when , in such a way that the papers with more than 600-1000 citations exhibit runaway behavior - their citation career does not saturate even after 25 years. This complements the famous parable ”rich get richer” by ”rich live longer”.
We developed a nonlinear stochastic model of citation dynamics of scientific papers and validated this model by measurements. The underlying scenario is as follows. We assume that the author of a new scientific paper finds relevant papers from the media or journals and cites them. Then he studies the reference lists of these preselected papers, picks up some references, cites them as well, and continues this process recursively. We add here a new ingredient: if some paper is cited by several preselected papers, the author chooses it with higher probability than that cited by only one preselected paper.
This new ingredient, combined with the assortativity of the citation network, introduces dynamic nonlinearity. The account of this nonlinearity is crucial for predicting future citation behavior of the papers. Our nonlinear dynamic model can serve as a basis for probabilistic forecasting of citation dynamics of a paper or a group of papers (journal impact factor).
Acknowledgements.We are grateful to S. Redner, A. Scharnhorst, L. Muchnik, and D. Shapiro for fruitful discussions, we appreciate instructive correspondence with M. Simkin. We acknowledge financial support of the EU COST Action TD1210.
- (1) R. Albert and A.L. Barabasi, Statistical mechanics of complex networks, Reviews of Modern Physics, 74 (2002), pp. 47–97.
- (2) M.O. Jackson and B.W. Rogers, Meeting strangers and friends of friends: How random are social networks?, American Economic Review, 97 (2007), pp. 890–915.
- (3) D.M. Pennock, G. W. Flake, S. Lawrence, E. J. Glover, and C. L. Giles,, Winners don’t take all: Characterizing the competition for links on the web , P.N.A.S., 99 (2002), pp. 5207–5211.
- (4) W. Goffman and V.A. Newill, Generalization of the epidemic theory. Application to transmission of ideas , Nature, 204 (1964), pp. 225–228.
- (5) E. Bruckner, W. Ebeling, and A. Scharnhorst, The application of evolution models in scientometrics , Scientometrics, 18 (1990), pp. 21–41.
- (6) L.M.A. Bettencourt, A. Cintron-Arias, D. I. Kaiser, and C. Castillo-Chavez, The power of a good idea: Quantitative modeling of the spread of ideas from epidemiological models, Physica a-Statistical Mechanics and Its Applications, 364 (2006), pp. 513–536.
- (7) F.M. Bass, New product growth for model consumer durables, Management Science Series a-Theory, 15 (1969), pp. 215–227.
- (8) V. B. Lal, Karmeshu, and S. Kaicker, Modeling Innovation Diffusion with Distributed Time–Lag , Technological Forecasting and Social Change, 34 (1988), pp. 103–113.
- (9) N. K. Vitanov and M. R. Ausloos, in Models of Science Dynamics, ed. by A.Scharnhorst, K.Borner, and P. van den Besselaar, Springer, Berlin, (2012), pp. 69–126.
- (10) P. L. Krapivsky and S. Redner, Network growth by copying, Physical Review E, 71 (2005), p. 036118.
- (11) A. Vazquez, Disordered networks generated by recursive searches, Europhysics Letters, 54 (2001), pp. 430–435.
- (12) A. Vazquez, Growing network with local rules: Preferential attachment, clustering hierarchy, and degree correlations, Physical Review E, 67 (2003), p. 056104.
- (13) M. V. Simkin and V. P. Roychowdhury, A mathematical theory of citing, Journal of the American Society for Information Science and Technology, 58 (2007), pp. 1661–1673.
- (14) G. J. Peterson, S. Presse, and K. A. Dill, Nonuniversal power law scaling in the probability distribution of scientific citations, P.N.A.S., 107 (2010), pp. 16023–16027.
- (15) Z.-X. Wu and P. Holme, Modeling scientific–citation patterns and other triangle–rich acyclic networks, Physical Review E, 80 (2009), p. 037101.
- (16) F.-X. Ren, H.-W. Shen, and X.-Q. Cheng, Modeling the clustering in citation networks, Physica A, 391 (2012), pp. 3533–3539.
- (17) J. P. Bagrow and D. Brockmann, Natural Emergence of Clusters and Bursts in Network Evolution, Phys. Rev. X 3 (2013), p. 021016.
- (18) B. Karrer and M.E.J. Newman, Random Acyclic Networks, Phys. Rev. Lett. 128 (2009), p.128701.
- (19) O. Penner, R. K. Pan, A. M. Petersen, K. Kaski, and S. Fortunato, The case for caution in predicting scientists’ future impact, Physics Today, 66 (2013), pp. 8–9.
- (20) D.E. Acuna, S. Allesina, and K.P. Kording, Predicting scientific success, Nature, 489 (2012), pp. 201–202.
- (21) A. Mazloumian, Predicting Scholars’ Scientific Impact, Plos One, 7 (2012), p. e49246.
- (22) A.L. Barabasi, C.M. Song, and D.S. Wang, Handful of papers dominates citation, Nature, 491 (2012), pp. 40–40.
- (23) B. Uzzi, S. Mukherjee, M. Stringer, and B. Jones, Atypical Combinations and Scientific Impact, Science, 342 (2013), pp. 468–472.
- (24) H. Nakamoto, Synchronous and diachronous citation distributions, in Informetrics 87/88, Belgium : Diepenbeek, pp. 157–163 (1988), ed. by L. Egghe and R. Rousseau.
- (25) W. Glanzel, Towards a model for diachronous and synchronous citation analyses, Scientometrics, 60 (2004), pp. 511–522.
- (26) M. Golosovsky and S. Solomon, Stochastic Dynamical Model of a Growing Citation Network Based on a Self–Exciting Point Process. Phys. Rev. Lett., 109 (2012), p. 098701.
- (27) Q.L. Burrell, Predicting future citation behavior, Journal of the American Society for Information Science and Technology, 54 (2003), pp. 372–378.
- (28) G. Bianconi and A.L. Barabasi, Bose–Einstein condensation in complex networks, Physical Review Letters, 86 (2001), pp. 5632–5635.
- (29) W. Ebeling, A. Engel, and V.G. Mazenko, Modeling of selection processes with age-dependent birth and death rates, BioSystems 19(1986), pp. 213-221.
- (30) T. E. Harris, The Theory of Branching Processes, Springer–Verlag, Berlin, 2002.
- (31) A.F.J. van Raan, Sleeping beauties in science, Scientometrics, 59 (2004), pp. 467–472.
- (32) J.L. Iribarren and E. Moro, Branching dynamics of viral information spreading, Physical Review E, 84 (2011), p. 046116.
- (33) M.E.J. Newman, The first–mover advantage in scientific publication, EPL, 86 (2009), p. 68001
- (34) P.L. Krapivsky and S. Redner, Organization of growing random networks, Physical Review E, 63 (2001), p. 066123.
- (35) S. Zhou and R.J. Mondragon, Accurately modeling the internet topology, Physical Review E, 70 (2004), p. 066108.
- (36) P.L. Krapivsky and D. Krioukov, Scale–free networks as preasymptotic regimes of superlinear preferential attachment, Physical Review E, 78 (2008), p. 026114.
- (37) D.S. Wang, C.M. Song, and A.L. Barabasi, Quantifying Long–Term Scientific Impact, Science, 342 (2013), pp. 127–132.