A Derivation of the random diffusion model from the blogger model

Statistical properties of fluctuations of time series representing appearances of words in nationwide blog data and their applications: An example of modelling fluctuation scalings of nonstationary time series

Statistical properties of fluctuations of time series representing appearances of words in nationwide blog data and their applications: An example of modelling fluctuation scalings of nonstationary time series

Hayafumi Watanabe hayafumi.watanabe@gmail.com    Yukie Sano    Hideki Takayasu    Misako Takayasu Hottolink,Inc., 6 Yonbancho Chiyoda-ku, Tokyo 102-0081, Japan Risk Analysis Research Center, The Institute of Statistical Mathematics, 10-3 Midori-cho, Tachikawa, Tokyo 190-8562, Japan Faculty of Engineering, Information and Systems, University of Tsukuba, Tennodai, Tsukuba, Ibaraki 305-8573 Japan Sony Computer Science Laboratories, 3-14-13 Higashi-Gotanda, Shinagawa-ku, Tokyo 141-0022, Japan Institute of Innovative Research, Tokyo Institute of Technology, 4259 Nagatsuta-cho, Midori-ku, Yokohama 226-8502, Japan

Abstract

To elucidate the non-trivial empirical statistical properties of fluctuations of a typical non-steady time series representing the appearance of words in blogs, we investigated approximately three billion Japanese blog articles over a period of six years and analyse some corresponding mathematical models. First, we introduce a solvable non-steady extension of the random diffusion model, which can be deduced by modelling the behaviour of heterogeneous random bloggers. Next, we deduce theoretical expressions for both the temporal and ensemble fluctuation scalings of this model, and demonstrate that these expressions can reproduce all empirical scalings over eight orders of magnitude. Furthermore, we show that the model can reproduce other statistical properties of time series representing the appearance of words in blogs, such as functional forms of the probability density and correlations in the total number of blogs. As an application, we quantify the abnormality of special nationwide events by measuring the fluctuation scalings of 1771 basic adjectives.

pacs:
89.75.Da, 89.65.Ef, 89.20.Hh

I Introduction

In order to understand human behaviour with high accuracy, the use of data from social media is rapidly spreading in both practical applications (such as marketing, television shows, politics, and finance) and basic sciences (such as sociology, physics, psychology, and information science) (1); (2); (3); (4); (5); (6); (7). In such analyses of social media data, one of the most important basic objects is the time series representing the appearance of considered keywords. That is, a sequence of daily counts of the appearances of a considered word within a huge social media data set. This quantity is mostly used to measure temporal changes in social concerns related to the considered word.

Our research focuses on the “fluctuation” (i.e., occurrence of random noise) in the time series. We aim to describe this fluctuation precisely, whereas the majority of previous research has focused on “trends” in the time series (i.e., nonrandom parts of the time series) for practical reasons. The reasons why we focus on fluctuation are as follows: (i) The information regarding noise is important for extracting essential information from the data in precise observations. For example, this can be used to eliminate noise, detect anomalies, etc. (ii) The fluctuation of a time series of social media data obeys a statistical law known as “fluctuation scaling”, which can be observed in various complex systems relating to both natural and human phenomena (8); (9); (10); (11); (12); (13); (14). Thus, it is also important to understand the properties of fluctuation in social media data in the context of general complex systems science or physical sciences.

Fluctuation scaling (FS), which is also known as “Taylor’s law” (15) in ecology, is a power law relation between the system size (e.g., a mean) and the magnitude of fluctuation (e.g., a standard deviation). FS is observed in various complex systems, such as random work on a complex network (16), internet traffic (17), river flows (17), animal populations (8), insect numbers (8); (9), cell numbers (9), foreign exchange markets (11), the download numbers of Facebook applications (10), word counts of Wikipedia (12), academic papers (12), old books (12), crimes (14), and Japanese blogs (13).

Note that physicists have studied linguistic phenomena using concepts of complex systems (18) such as competitive dynamics (19), statistical laws (20), and complex networks (21). Our study can also be positioned within this context, that is, we study properties of the time series of word counts in nationwide blogs (a linguistic phenomenon) using FS, which is one of the concepts of complex science or statistical physics. By this viewpoint, we can analyse fluctuations very accurately.

A certain type of FS can be explained by the random diffusion (RD) model (16). The RD model, which is described by a Poisson process with a random time-variable Poisson parameter, has been introduced as a mean field approximation for a random walk on a complex network. It can be demonstrated that the fluctuation of this model obeys FS, with an exponent of for a small system size ( i.e., a small mean) or for a large system size (i.e., a large mean). Because this model is based only on a Poisson process, it is not only applicable to random walks on complex networks, but also to a wide variety of phenomena related to random processes. For instance, this model can reproduce a type of FS concerning the appearance of words in Japanese blogs (22); (23). However, owing to the assumption of stationarity for the RD model, this steady model cannot be applied to describe unsteady properties, as observed in real time series regarding the appearance of words in blogs.

There exists a pioneering study regarding the relations between FS and unsteady time series. In Ref. (17), Argollo de Menezes et al. introduced a method of separating “internal fluctuations” corresponding to individual factors and “external fluctuations” corresponding to unsteady shared factors. Moreover, they showed that there are two types of FS for internal fluctuations with exponents of 0.5 or 1.0, by applying this method to empirical data regarding internet routers (0.5), a microchip (0.5), the World Wide Web (1.0), and the highway system (1.0). However, a theoretical basis for these FSs has not been clarified.

In our study, in order to validate the model we explore not only the fluctuation scalings (a scaling between the mean and variance), but also the functional forms of probability distributions. Although the vast majority of previous theoretical and empirical studies have focused only on scalings (24); (25), there have been a few previous studies that investigated the relations between fluctuation scalings and the distributions. A. Fronczak et al. described a relationship with the canonical distribution that is deduced from the second law of thermodynamics (26). S. Wayne et al. demonstrated a relationship with Tweedy distributions that was introduced by Tweedy in 1984 in order to explain fluctuation scalings, and is related to scale invariance of the family of probability distributions (27); (28). Joel E Cohen examined a relationship with random sampling of a skewed distribution, such as the log normal distribution (29).

In this paper, we first introduce a simple nonsteady extention of the RD model to describe nonsteady time series that obey FS, such as word appearance in blogs. Second, we derive three types of mathematical expressions for FSs in this model: the raw time series of word appearances, the time series scaled by the total number of blogs, and ensemble scalings at fixed times. In addition, we demonstrate that these expressions reproduce the empirical scalings over eight orders of magnitude, by using five billion Japanese blog articles from 2007. Furthermore, we show that the model can also reproduce other statistical properties, such as the shapes of probability density functions. Third, we apply our model to the quantification of the abnormalities of special nationwide events, and the temporal dependence of an abnormality regarding a particular word. Finally, we conclude with a discussion.

Ii Data set

In our data analysis, we analyse a time series representing the frequencies with which words appear in Japanese blogs per day. In order to obtain this time series, we employed a large database of Japanese blogs (”Kuchikomi@kakaricho”), which is provided by Hottolink Inc. This database contains three billion articles from Japanese blogs, covering 90 percent of Japanese blogs since November 1st 2006. Fig. 1 shows a example of the time series.

Figure 1: (a) An example of daily time series of raw word appearances for the “yowai (weak)”, . (b) The daily time series of the normalised total number of blogs, (see Appendix H). (c) The daily time series of word appearances scaled by the normalised total number of blogs, for “yowai”, (defined by Eq. 23.) (d) The differential time series of word appearances scaled by the normalised total number of blogs, for “yowai”, (defined by Eq. 30). From these figures, we can confirm that the time-variation of raw word appearances shown in the panel (a) is almost the same as that of the total number of blogs shown in the panel (b).

Iii Model

It was reported in Ref. (16) that the FS of a blog has two scaling regions, the exponents of which are 0.5 for a small mean region (Poisson region) and 1.0 for a large mean region (non-Poisson region), and this FS can be explained by the (steady) RD model. The RD model represents a mixture of Poisson models, and was originally introduced from a mean field average approximation of the random walk on a complex network, in order to understand the FS of transport on a complex network (16). However, because the original (steady) RD model represents a steady probabilistic process, it is unable to describe non-steady effects on the FS, such as changes in the number of bloggers. Thus, for a theoretical analysis we introduce and analyse a simple nonstationary extension of the RD model (extended RD model) to describe nonstationary time series.

iii.1 Steady RD model

The original (steady) RD model described in (13), (16), which is a Poisson process consisting of a stochastic process whose Poisson parameter (the mean value) varies randomly, is defined as follows for :

(1)
(2)

where is defined by a random variable that obeys the Poisson distribution and has Poisson parameter , is defined such that obeys a uniform distribution with support , and . This equation indicates that the random variable , which represents the -th observable at time , is sampled from a Poisson distribution with a Poisson parameter . Here, is a scale factor of the Poisson parameter of the model, and , which is related to the total number of blogs, is a random factor that obeys the independent uniform distribution defined by Eq. 2. In the case of a time series of blogs, the observable corresponds to the frequency with which the -th word occurs on the -th day, corresponds to the temporal mean of the frequency , and is related to the total number of blogs (scaled by its temporal mean).

iii.2 Extended RD model

We extend the (steady) RD model to precisely describe non-stationary effects (i.e., time-variances in the usages of words and the total number of blogs) as follows: (i)The scale parameter is modified from a constant to a time-varying parameter . (ii)The distribution of the random part , which is related to the total number of blogs, is modified from a steady uniform distribution to an arbitrary distribution with time varying mean and standard deviation . Then, the extended RD model, which is a nonstationary Poisson process consisting of a stochastic process whose Poisson parameter (the mean value) varies randomly, is defined as follows for :

(3)
(4)

The first equation indicates that the random variable is sampled from a Poisson distribution whose Poisson parameter takes a value . Furthermore, is a scale factor of the Poisson parameter of the model, is a random factor, and is the total number of types of words at time . In the case of a time series of blogs, as in the original steady RD model, the observable corresponds to the frequency with which the -th word occurs on the -th day, and larger values of indicate that the j-th word appears more frequently at time .

is a non-negative random variable with mean and standard deviation . Here, is a shared time-variation factor for the whole system, and we assume that , for normalization. In the case of a time series of blogs, closely corresponds to the normalised number of blogs.

For particular settings, this corresponds to the following known models: (i) In the case that is a constant and is equal to 1, namely, the probability density function of is the delta function , this represents the steady Poisson process with the parameter . (ii) In the case that the probability density function of is the delta function , this represents the non-steady Poisson process with the parameter . (iii) In the case that and , this represents the original (steady) RD model given by (16); (13). Therefore, the proposed model (i.e., the extended RD model) represents an extension of Poisson processes and the steady RD model.

For convenience of analysis, we assume that can be decomposed into a scale component , which corresponds to the temporal mean of the count of the j-th word in an observation period, and a time variance component , such that

(5)

Here, we also assume for normalization that . In addition, we assume for simplicity that is decomposed as

(6)

where is a real constant and is a part which does not depend on . Note that this assumption constitutes a simplification of the correlation between the mean of and the corresponding standard deviation . For instance, with the condition that the standard deviation is immutable, regardless of the mean , and with the condition that the standard deviation is proportional to the mean .

Parameter Meaning Estimation
(i)
scale factor of the -th word
time-variant factor of the -th word unable to estimate accurately
(ii) distribution with mean and standard deviation ,
mean of (scaled number of blogs) Appendix H
scale parameter of the standard deviation of 0.021
relation parameter between the standard deviation and the mean 1.0
For , the model does not contradict empirical data. We use for simplicity in our empirical analysis.
Exceptionally, when we calculate the density function, we estimate roughly using the moving average (Eq. 60, Eq. G.3).
Instead of a direct estimation, we use lower or upper bounds on for a rigorous analysis.
Table 1: Summary of the model and parameters
Coefficients of : Coefficients of
(i) Temporal scaling ()
1
-Lower bound 1
-Lower bound
-Lower bound
(ii) Ensemble scaling ()
1
-Lower bound 1
()
-Lower bound
*We assume that , for simplicity.
Table 2: Summary of the coefficients of the scalings of the extended RD model and corresponding lower bounds on for the conditions that , . See Appendix B.

The extended RD model we introduce is determined by five parameters, , , , , and . Hence, in our study we investigate the precise dependence of FSs and their accompanying phenomena on these five parameters. When comparing the model with empirical data, is estimated by the temporal average of counts of the j-th word ; is estimated by ensemble median of counts of the words at the time t, as described in Appendix H;, (We assume that is constant for simplicity of empirical analysis). For , the model does not contradict empirical data, and is not easy to differentiate between different values for the parameter . Thus, we set in this paper for simplicity. A summary of the parameters of the model is presented in table 1.

Note that cannot accurately be estimated using data. Therefore, in order to analyse the data rigorously, we do not perform a direct estimation of this parameter. Instead of a direct estimation, we compare the theory and data by considering lower or upper bounds on . Exceptionally, when we calculate density functions we roughly estimate using the moving average (see Eq. 60 and Eq. G.3).

iii.3 Derivation of the extended RD model from a micro model of blogger behaviour

The extended RD model (macro model) described above can be deduced from a model that describes simplified heterogeneous blogger behaviour. The details of the derivation are provided in Appendix A. In this section, we only present the results.

The random blogger model (the micro model)

We consider the following model for a system consisting of bloggers and words. The bloggers perform the following behaviour from time to :

  1. The i-th blogger writes his or her blog randomly with probability () . Here, by we denote the set of bloggers who write his or her blog at the time t.

  2. The -th blogger who writes a blog in step 1 writes the j-th word times. is sampled from a Poisson distribution with Poisson parameter .

  3. The total count of the -th word of the system calculated by , and the total number of bloggers who write blogs in the system is .

The extended RD model deduced from the blogger model

Here, we consider the probability distribution of . Under the conditions that can be observed and , the distribution of can be approximated as follows:

(7)

Here, the scale factor is given by

(8)

and the mean and the variance of are given by

(9)
(10)
(11)
(12)

where , , and the specific form of the distribution function of is determined by the parameters and .

These equations indicate that we can connect the macro parameters of the extended RD model, given in Eqs. 3, 5, and 6, with the statistics for the parameters for micro bloggers, and .

Under the condition that , we can obtain that from Eq. 11. Thus, the model represents Poisson processes in the case of homogeneous bloggers. In other words, the condition that the model exhibits the particular properties of the RD model is that the bloggers are heterogeneous.

Note that the fact that can be deduced from the blogger model. However, as will be mentioned in the following section, empirical observations are not contradicted in the range . Thus, we will need more precise observations in the future in order to verify that . That is, to verify the validity of the blogger model.

Iv Properties of the model

In this section, we investigate the statistical properties of the extended RD model, and compare them with the corresponding properties of blog data. Note that as mentioned above, the model does not contradict empirical data for , and is not easy to differentiate regarding the parameter . For simplicity, we present only the case of in this section. Discussions concerning general are given in Appendix B. (The results of this section can be obtained by substituting into the results of Appendix B.).

Table 2 presents a summary of the fluctuation scalings employed in this section.

iv.1 Temporal fluctuation scaling

First, we discuss the temporal fluctuation scaling (TFS). The TFS of variable the () is defined by the scaling between the temporal mean and the temporal variance ,

(13)

Here, the temporal mean and temporal variance are defined by

(14)
(15)

Note that the above definition of the TFS in Eq. 13 is expressed in terms of the variance, although the standard deviation is usually used in observations. Under this condition, the TFS expressed by the standard deviation can be written as . In addition, we assume in this section that for simplicity, and that the following approximations are rigorous in the limit .

TFS of raw data

Here, we investigate the TFS of the time series of the raw counts of word appearances (see Fig. 1(a)), which is determined by the steady RD model described in Ref. (22). From Eq. LABEL:V_F_ex in Appendix B, we can obtain the temporal mean

(16)

and the temporal variance

By inserting Eq. 16 into Eq. LABEL:V_F, we obtain the relationship between the variance and the mean of given by

In the case that (), which indicates the -th word is steady, from Eq. LABEL:V_F we can also obtain the following more simple expression:

(19)

which can be written as a function of the mean as

In addition, because this equation also gives a lower bound on over . From this result, we can deduce the following scaling relations:

(21)

and the corresponding scaling as a function of the mean is written as

(22)

where the conditions of the scaling with a single exponent are only that and . That is, the time series is a steady Poisson process (i.e., and .).

From Fig. 2 (a), we can confirm that the theoretical curve given by Eq. LABEL:V_F2E is in agreement with the lower bound of empirical data. Here, because cannot accurately be estimated using data, we consider the lower bound for comparing the theory with empirical data. Similarly, in later sections we will also employ lower or upper bounds.

Note that the steady RD model given by Eq. 1 explains the temporal FS of raw data by approximating the unsteady time series of blogs using a steady time series (i.e., and ). However, the model cannot consistently explain certain properties under this condition, which will be discussed later.

Figure 2: (a) TFS of raw time series, . The data shown are empirical results of 1771 adjectives (black triangles), corresponding to the theoretical curve given by Eq. 19 or Eq. LABEL:V_F2E (red dashed line). The blue dash-dotted line indicates , which corresponds to the Poisson distribution. (b) The correlations of raw time series between individual time series and the scaled total number of blogs (i.e., the shared Poisson parameter). The data shown are empirical results of 1771 adjectives (black plus triangles), the theoretical curve given by Eq. 56 (red dashed line), and the theoretical upper bound, which is considered for the finite number of observables given in Eq. LABEL:cor_up (grey dashed-dotted line). From these panels, we can confirm that the empirical data is in good agreement with the theoretical curve.

TFS of the data scaled by the total number of blogs

Figure 3: (a) TFS of normalised time series, ( is defined by Eq. 23). The data shown are empirical results of 1771 adjectives (black triangles), and the corresponding results of theoretical curves given by Eq. 27 or Eq. 28 (red dashed line). The green dashed-dotted line indicates the correction of the theoretical lower bound, which is considered in the nonstationary given by Eq. 29. In addition, the thin blue dash-dotted line indicates , and thin grey dashed line is the theoretical curve of the raw time series given by Eq. LABEL:V_F2E. (b) The TFS of the differential of the time series of word appearances ( is defined by Eq. 30). The black triangles indicate the actual data, the red dashed line indicates the theoretical lower bound given by Eq. LABEL:v_delta_tilda_f2 or Eq. 33, and the blue dash-dotted line indicates . We can confirm that the theoretical curve is in accordance with the empirical data from the panel (b).

Next, we discuss the time series of word appearances scaled by the total numbers of blogs

(23)

In practice, this value is used to estimate , which corresponds to the original time deviation of the -th word separated from the effects of deviations in the total number of blogs (see Figs. 1(b) and (c)). From Eq. LABEL:V_tilde_F_ex in Appendix B, we can obtain that

(24)

Thus, by combining Eq. 24 and Eq. LABEL:V_tilde_F we obtain the relationship between the variance and mean of given by

In the case that or , which means that the -th word is steady, we can obtain the simple expression

(27)

and the corresponding variance can be written as a function of the mean as

(28)

where because , this equation also gives a lower bound on .

Fig. 3 (a) shows a comparison between Eq. 28 and the corresponding data. From this figure, we can confirm that the lower bound of the data disagrees slightly with Eq. 28 as given by the red dashed curve. The reason for this disagreement is that cannot be neglected for all (by comparison ). In fact, the corrected lower bound with respect to (green dashed-dotted line) is given by

(29)

which is confirmed to be in agreement with the actual data.

Here, we assume that is roughly estimated by . Note that we limit the words by in order to focus on the words that are dominant in the term of in Eq. LABEL:V_tilde_F.

Next, we investigate the temporal FS of the time difference of a scaled time series of word appearances (see Fig. 1(d)),

(30)

By taking the difference, we can reduce the nonstationary effects on temporal FSs.

From Eq. LABEL:delta_V_mf_a in Appendix B, the variance is given as follows:

(31)

Using the facts that and , we can obtain the lower bound

and by using Eq. 24, we can obtain the corresponding lower bound as a function of the mean as

(33)

The condition of scaling with a single exponent is that . That is, the Poisson parameter does not fluctuate. In terms of the blogger model, this condition also corresponds to the case that bloggers are homogeneous.

From Fig. 3 (b), we can confirm that the theoretical curve given by Eq. LABEL:v_delta_tilda_f2 or Eq. 33 is in agreement with the corresponding lower bound of the empirical data.

iv.2 Ensemble fluctuation scaling

Figure 4: The box ensemble scaling of the differential of the time series of word appearances , where (defined by Eq. 45). (a) The box EFSs: 06.07.2007 (black triangles) for a typical date; 26.06.2010 (red circles) in the case of a special nationwide event, the FIFA World Cup. The corresponding theoretical lower bounds given by Eq. LABEL:EFS_zeta_eq or Eq. LABEL:EFS_zeta_eqE: the black solid line for 06.07.2007, the red dashed line for 26.06.2010. The thick purple dashed line indicates the theoretical curve in the case of the box size , as given by Eq. 49 or Eq. 50 and the blue thin dash-dotted line is . (b) The corresponding figure for the great east Japan earthquake on 03.11.2011 (black triangles), 16.03.2011 (red circles), and 29.03.2011 (green circles); the inserted figure shows the deviations from theory, with as defined by Eq. 52, where vertical lines correspond to the dates of the main figure. (c) The time series of deviations from the theoretical lower bound defined by Eq. 52 (i.e., the ensemble abnormality ), is shown by the black line. The vertical lines indicate the dates shown in panels (a) and (b): 06.07.2007 (the red dashed line), 26.06.2010 (the green dash-dotted line), 03.11.2011, 16.03.2011, and 29.03.2011 (the blue dash-double-dotted line). The horizontal lines are the 50th percentile of the set (the red dashed line) and the 90th percentile (the green dash-dotted line). (d) The relation between the deviation and the EFSs. The median of the EFS regarding the dates on which is in the lower 50th percentile , defined by Eq. 53, where we use dates under the 50th percentile as shown by the horizontal red dashed line in panel (c). The error bars, corresponding to the 25th percentile and the 75th percentile, and . The green lines indicate the corresponding statistics of the upper 90th percentile, , for which we use the dates above the 90th percentile shown by the horizontal green dash-dotted line in the panel (c). The thick purple dashed line indicates the theoretical curve given by Eq. 49, and the blue thin dash-dotted line indicates .

Next, we consider ensemble fluctuation scaling (EFS). The EFS is defined by a power-law scaling between the ensemble mean and the ensemble variance at a fixed time, . Here, the ensemble mean and the ensemble variance of the value of at time are defined as follows:

(34)
(35)

where we take the ensemble of words whose temporal means take the same particular value, .

Note that in the actual data analysis, we approximate the ensemble scalings by the following box ensemble scaling:

(36)
(37)

where the box ensemble scaling is in agreement with the ensemble scaling in the case that . In this paper, we assume that with .

From Eq. B.81 in Appendix B, the box ensemble mean and the variance of are given as:

(38)
(39)

Here, we assume that . Thus, by combining Eq. 38 and Eq. 39 we obtain the relationship between the ensemble variance and ensemble mean of given by

(40)

By using the fact that , we can obtain the following theoretical lower bound over :

(41)

and as a function of the mean,

In addition, in the case of the ensemble scaling, meaning , we can obtain the simpler relationship

(43)

and the corresponding expression written by the mean,

(44)

Here, cannot be neglected in the observations, in the same manner as the temporal scaling . Therefore, we also consider an ensemble scaling of the differential of appearances of a word, for comparing data.

From Eq. B.95 in Appendix B, the ensemble scalings of the differential of the time series of word appearances,

(45)

is given by

(46)

where is defined by and we assume that is constant with and . By using the facts that , , and , we can obtain the lower bound

and the corresponding expression as a function of the mean is written as

In addition, in the case of ensemble scaling, meaning