Statistical properties of fluctuations of time series representing appearances of words in nationwide blog data and their applications: An example of modelling fluctuation scalings of nonstationary time series
Statistical properties of fluctuations of time series representing appearances of words in nationwide blog data and their applications: An example of modelling fluctuation scalings of nonstationary time series
To elucidate the non-trivial empirical statistical properties of fluctuations of a typical non-steady time series representing the appearance of words in blogs, we investigated approximately three billion Japanese blog articles over a period of six years and analyse some corresponding mathematical models. First, we introduce a solvable non-steady extension of the random diffusion model, which can be deduced by modelling the behaviour of heterogeneous random bloggers. Next, we deduce theoretical expressions for both the temporal and ensemble fluctuation scalings of this model, and demonstrate that these expressions can reproduce all empirical scalings over eight orders of magnitude. Furthermore, we show that the model can reproduce other statistical properties of time series representing the appearance of words in blogs, such as functional forms of the probability density and correlations in the total number of blogs. As an application, we quantify the abnormality of special nationwide events by measuring the fluctuation scalings of 1771 basic adjectives.
pacs:89.75.Da, 89.65.Ef, 89.20.Hh
In order to understand human behaviour with high accuracy, the use of data from social media is rapidly spreading in both practical applications (such as marketing, television shows, politics, and finance) and basic sciences (such as sociology, physics, psychology, and information science) (1); (2); (3); (4); (5); (6); (7). In such analyses of social media data, one of the most important basic objects is the time series representing the appearance of considered keywords. That is, a sequence of daily counts of the appearances of a considered word within a huge social media data set. This quantity is mostly used to measure temporal changes in social concerns related to the considered word.
Our research focuses on the “fluctuation” (i.e., occurrence of random noise) in the time series. We aim to describe this fluctuation precisely, whereas the majority of previous research has focused on “trends” in the time series (i.e., nonrandom parts of the time series) for practical reasons. The reasons why we focus on fluctuation are as follows: (i) The information regarding noise is important for extracting essential information from the data in precise observations. For example, this can be used to eliminate noise, detect anomalies, etc. (ii) The fluctuation of a time series of social media data obeys a statistical law known as “fluctuation scaling”, which can be observed in various complex systems relating to both natural and human phenomena (8); (9); (10); (11); (12); (13); (14). Thus, it is also important to understand the properties of fluctuation in social media data in the context of general complex systems science or physical sciences.
Fluctuation scaling (FS), which is also known as “Taylor’s law” (15) in ecology, is a power law relation between the system size (e.g., a mean) and the magnitude of fluctuation (e.g., a standard deviation). FS is observed in various complex systems, such as random work on a complex network (16), internet traffic (17), river flows (17), animal populations (8), insect numbers (8); (9), cell numbers (9), foreign exchange markets (11), the download numbers of Facebook applications (10), word counts of Wikipedia (12), academic papers (12), old books (12), crimes (14), and Japanese blogs (13).
Note that physicists have studied linguistic phenomena using concepts of complex systems (18) such as competitive dynamics (19), statistical laws (20), and complex networks (21). Our study can also be positioned within this context, that is, we study properties of the time series of word counts in nationwide blogs (a linguistic phenomenon) using FS, which is one of the concepts of complex science or statistical physics. By this viewpoint, we can analyse fluctuations very accurately.
A certain type of FS can be explained by the random diffusion (RD) model (16). The RD model, which is described by a Poisson process with a random time-variable Poisson parameter, has been introduced as a mean field approximation for a random walk on a complex network. It can be demonstrated that the fluctuation of this model obeys FS, with an exponent of for a small system size ( i.e., a small mean) or for a large system size (i.e., a large mean). Because this model is based only on a Poisson process, it is not only applicable to random walks on complex networks, but also to a wide variety of phenomena related to random processes. For instance, this model can reproduce a type of FS concerning the appearance of words in Japanese blogs (22); (23). However, owing to the assumption of stationarity for the RD model, this steady model cannot be applied to describe unsteady properties, as observed in real time series regarding the appearance of words in blogs.
There exists a pioneering study regarding the relations between FS and unsteady time series. In Ref. (17), Argollo de Menezes et al. introduced a method of separating “internal fluctuations” corresponding to individual factors and “external fluctuations” corresponding to unsteady shared factors. Moreover, they showed that there are two types of FS for internal fluctuations with exponents of 0.5 or 1.0, by applying this method to empirical data regarding internet routers (0.5), a microchip (0.5), the World Wide Web (1.0), and the highway system (1.0). However, a theoretical basis for these FSs has not been clarified.
In our study, in order to validate the model we explore not only the fluctuation scalings (a scaling between the mean and variance), but also the functional forms of probability distributions. Although the vast majority of previous theoretical and empirical studies have focused only on scalings (24); (25), there have been a few previous studies that investigated the relations between fluctuation scalings and the distributions. A. Fronczak et al. described a relationship with the canonical distribution that is deduced from the second law of thermodynamics (26). S. Wayne et al. demonstrated a relationship with Tweedy distributions that was introduced by Tweedy in 1984 in order to explain fluctuation scalings, and is related to scale invariance of the family of probability distributions (27); (28). Joel E Cohen examined a relationship with random sampling of a skewed distribution, such as the log normal distribution (29).
In this paper, we first introduce a simple nonsteady extention of the RD model to describe nonsteady time series that obey FS, such as word appearance in blogs. Second, we derive three types of mathematical expressions for FSs in this model: the raw time series of word appearances, the time series scaled by the total number of blogs, and ensemble scalings at fixed times. In addition, we demonstrate that these expressions reproduce the empirical scalings over eight orders of magnitude, by using five billion Japanese blog articles from 2007. Furthermore, we show that the model can also reproduce other statistical properties, such as the shapes of probability density functions. Third, we apply our model to the quantification of the abnormalities of special nationwide events, and the temporal dependence of an abnormality regarding a particular word. Finally, we conclude with a discussion.
Ii Data set
In our data analysis, we analyse a time series representing the frequencies with which words appear in Japanese blogs per day. In order to obtain this time series, we employed a large database of Japanese blogs (”Kuchikomi@kakaricho”), which is provided by Hottolink Inc. This database contains three billion articles from Japanese blogs, covering 90 percent of Japanese blogs since November 1st 2006. Fig. 1 shows a example of the time series.
It was reported in Ref. (16) that the FS of a blog has two scaling regions, the exponents of which are 0.5 for a small mean region (Poisson region) and 1.0 for a large mean region (non-Poisson region), and this FS can be explained by the (steady) RD model. The RD model represents a mixture of Poisson models, and was originally introduced from a mean field average approximation of the random walk on a complex network, in order to understand the FS of transport on a complex network (16). However, because the original (steady) RD model represents a steady probabilistic process, it is unable to describe non-steady effects on the FS, such as changes in the number of bloggers. Thus, for a theoretical analysis we introduce and analyse a simple nonstationary extension of the RD model (extended RD model) to describe nonstationary time series.
iii.1 Steady RD model
The original (steady) RD model described in (13), (16), which is a Poisson process consisting of a stochastic process whose Poisson parameter (the mean value) varies randomly, is defined as follows for :
where is defined by a random variable that obeys the Poisson distribution and has Poisson parameter , is defined such that obeys a uniform distribution with support , and . This equation indicates that the random variable , which represents the -th observable at time , is sampled from a Poisson distribution with a Poisson parameter . Here, is a scale factor of the Poisson parameter of the model, and , which is related to the total number of blogs, is a random factor that obeys the independent uniform distribution defined by Eq. 2. In the case of a time series of blogs, the observable corresponds to the frequency with which the -th word occurs on the -th day, corresponds to the temporal mean of the frequency , and is related to the total number of blogs (scaled by its temporal mean).
iii.2 Extended RD model
We extend the (steady) RD model to precisely describe non-stationary effects (i.e., time-variances in the usages of words and the total number of blogs) as follows: (i)The scale parameter is modified from a constant to a time-varying parameter . (ii)The distribution of the random part , which is related to the total number of blogs, is modified from a steady uniform distribution to an arbitrary distribution with time varying mean and standard deviation . Then, the extended RD model, which is a nonstationary Poisson process consisting of a stochastic process whose Poisson parameter (the mean value) varies randomly, is defined as follows for :
The first equation indicates that the random variable is sampled from a Poisson distribution whose Poisson parameter takes a value . Furthermore, is a scale factor of the Poisson parameter of the model, is a random factor, and is the total number of types of words at time . In the case of a time series of blogs, as in the original steady RD model, the observable corresponds to the frequency with which the -th word occurs on the -th day, and larger values of indicate that the j-th word appears more frequently at time .
is a non-negative random variable with mean and standard deviation . Here, is a shared time-variation factor for the whole system, and we assume that , for normalization. In the case of a time series of blogs, closely corresponds to the normalised number of blogs.
For particular settings, this corresponds to the following known models: (i) In the case that is a constant and is equal to 1, namely, the probability density function of is the delta function , this represents the steady Poisson process with the parameter . (ii) In the case that the probability density function of is the delta function , this represents the non-steady Poisson process with the parameter . (iii) In the case that and , this represents the original (steady) RD model given by (16); (13). Therefore, the proposed model (i.e., the extended RD model) represents an extension of Poisson processes and the steady RD model.
For convenience of analysis, we assume that can be decomposed into a scale component , which corresponds to the temporal mean of the count of the j-th word in an observation period, and a time variance component , such that
Here, we also assume for normalization that . In addition, we assume for simplicity that is decomposed as
where is a real constant and is a part which does not depend on . Note that this assumption constitutes a simplification of the correlation between the mean of and the corresponding standard deviation . For instance, with the condition that the standard deviation is immutable, regardless of the mean , and with the condition that the standard deviation is proportional to the mean .
|•||scale factor of the -th word|
|•||time-variant factor of the -th word||unable to estimate accurately|
|(ii)||distribution with mean and standard deviation ,|
|•||mean of (scaled number of blogs)||Appendix H|
|•||scale parameter of the standard deviation of||0.021|
|•||relation parameter between the standard deviation and the mean||1.0|
|For , the model does not contradict empirical data. We use for simplicity in our empirical analysis.|
|Exceptionally, when we calculate the density function, we estimate roughly using the moving average (Eq. 60, Eq. G.3).|
|Instead of a direct estimation, we use lower or upper bounds on for a rigorous analysis.|
|Coefficients of||: Coefficients of|
|*We assume that , for simplicity.|
The extended RD model we introduce is determined by five parameters, , , , , and . Hence, in our study we investigate the precise dependence of FSs and their accompanying phenomena on these five parameters. When comparing the model with empirical data, is estimated by the temporal average of counts of the j-th word ; is estimated by ensemble median of counts of the words at the time t, as described in Appendix H;, (We assume that is constant for simplicity of empirical analysis). For , the model does not contradict empirical data, and is not easy to differentiate between different values for the parameter . Thus, we set in this paper for simplicity. A summary of the parameters of the model is presented in table 1.
Note that cannot accurately be estimated using data. Therefore, in order to analyse the data rigorously, we do not perform a direct estimation of this parameter. Instead of a direct estimation, we compare the theory and data by considering lower or upper bounds on . Exceptionally, when we calculate density functions we roughly estimate using the moving average (see Eq. 60 and Eq. G.3).
iii.3 Derivation of the extended RD model from a micro model of blogger behaviour
The extended RD model (macro model) described above can be deduced from a model that describes simplified heterogeneous blogger behaviour. The details of the derivation are provided in Appendix A. In this section, we only present the results.
The random blogger model (the micro model)
We consider the following model for a system consisting of bloggers and words. The bloggers perform the following behaviour from time to :
The i-th blogger writes his or her blog randomly with probability () . Here, by we denote the set of bloggers who write his or her blog at the time t.
The -th blogger who writes a blog in step 1 writes the j-th word times. is sampled from a Poisson distribution with Poisson parameter .
The total count of the -th word of the system calculated by , and the total number of bloggers who write blogs in the system is .
The extended RD model deduced from the blogger model
Here, we consider the probability distribution of . Under the conditions that can be observed and , the distribution of can be approximated as follows:
Here, the scale factor is given by
and the mean and the variance of are given by
where , , and the specific form of the distribution function of is determined by the parameters and .
Under the condition that , we can obtain that from Eq. 11. Thus, the model represents Poisson processes in the case of homogeneous bloggers. In other words, the condition that the model exhibits the particular properties of the RD model is that the bloggers are heterogeneous.
Note that the fact that can be deduced from the blogger model. However, as will be mentioned in the following section, empirical observations are not contradicted in the range . Thus, we will need more precise observations in the future in order to verify that . That is, to verify the validity of the blogger model.
Iv Properties of the model
In this section, we investigate the statistical properties of the extended RD model, and compare them with the corresponding properties of blog data. Note that as mentioned above, the model does not contradict empirical data for , and is not easy to differentiate regarding the parameter . For simplicity, we present only the case of in this section. Discussions concerning general are given in Appendix B. (The results of this section can be obtained by substituting into the results of Appendix B.).
Table 2 presents a summary of the fluctuation scalings employed in this section.
iv.1 Temporal fluctuation scaling
First, we discuss the temporal fluctuation scaling (TFS). The TFS of variable the () is defined by the scaling between the temporal mean and the temporal variance ,
Here, the temporal mean and temporal variance are defined by
Note that the above definition of the TFS in Eq. 13 is expressed in terms of the variance, although the standard deviation is usually used in observations. Under this condition, the TFS expressed by the standard deviation can be written as . In addition, we assume in this section that for simplicity, and that the following approximations are rigorous in the limit .
TFS of raw data
Here, we investigate the TFS of the time series of the raw counts of word appearances (see Fig. 1(a)), which is determined by the steady RD model described in Ref. (22). From Eq. LABEL:V_F_ex in Appendix B, we can obtain the temporal mean
and the temporal variance
By inserting Eq. 16 into Eq. LABEL:V_F, we obtain the relationship between the variance and the mean of given by
In the case that (), which indicates the -th word is steady, from Eq. LABEL:V_F we can also obtain the following more simple expression:
which can be written as a function of the mean as
In addition, because this equation also gives a lower bound on over . From this result, we can deduce the following scaling relations:
and the corresponding scaling as a function of the mean is written as
where the conditions of the scaling with a single exponent are only that and . That is, the time series is a steady Poisson process (i.e., and .).
From Fig. 2 (a), we can confirm that the theoretical curve given by Eq. LABEL:V_F2E is in agreement with the lower bound of empirical data. Here, because cannot accurately be estimated using data, we consider the lower bound for comparing the theory with empirical data. Similarly, in later sections we will also employ lower or upper bounds.
Note that the steady RD model given by Eq. 1 explains the temporal FS of raw data by approximating the unsteady time series of blogs using a steady time series (i.e., and ). However, the model cannot consistently explain certain properties under this condition, which will be discussed later.
TFS of the data scaled by the total number of blogs
Next, we discuss the time series of word appearances scaled by the total numbers of blogs
In practice, this value is used to estimate , which corresponds to the original time deviation of the -th word separated from the effects of deviations in the total number of blogs (see Figs. 1(b) and (c)). From Eq. LABEL:V_tilde_F_ex in Appendix B, we can obtain that
Thus, by combining Eq. 24 and Eq. LABEL:V_tilde_F we obtain the relationship between the variance and mean of given by
In the case that or , which means that the -th word is steady, we can obtain the simple expression
and the corresponding variance can be written as a function of the mean as
where because , this equation also gives a lower bound on .
Fig. 3 (a) shows a comparison between Eq. 28 and the corresponding data. From this figure, we can confirm that the lower bound of the data disagrees slightly with Eq. 28 as given by the red dashed curve. The reason for this disagreement is that cannot be neglected for all (by comparison ). In fact, the corrected lower bound with respect to (green dashed-dotted line) is given by
which is confirmed to be in agreement with the actual data.
Here, we assume that is roughly estimated by . Note that we limit the words by in order to focus on the words that are dominant in the term of in Eq. LABEL:V_tilde_F.
Next, we investigate the temporal FS of the time difference of a scaled time series of word appearances (see Fig. 1(d)),
By taking the difference, we can reduce the nonstationary effects on temporal FSs.
From Eq. LABEL:delta_V_mf_a in Appendix B, the variance is given as follows:
Using the facts that and , we can obtain the lower bound
and by using Eq. 24, we can obtain the corresponding lower bound as a function of the mean as
The condition of scaling with a single exponent is that . That is, the Poisson parameter does not fluctuate. In terms of the blogger model, this condition also corresponds to the case that bloggers are homogeneous.
iv.2 Ensemble fluctuation scaling
Next, we consider ensemble fluctuation scaling (EFS). The EFS is defined by a power-law scaling between the ensemble mean and the ensemble variance at a fixed time, . Here, the ensemble mean and the ensemble variance of the value of at time are defined as follows:
where we take the ensemble of words whose temporal means take the same particular value, .
Note that in the actual data analysis, we approximate the ensemble scalings by the following box ensemble scaling:
where the box ensemble scaling is in agreement with the ensemble scaling in the case that . In this paper, we assume that with .
From Eq. B.81 in Appendix B, the box ensemble mean and the variance of are given as:
By using the fact that , we can obtain the following theoretical lower bound over :
and as a function of the mean,
In addition, in the case of the ensemble scaling, meaning , we can obtain the simpler relationship
and the corresponding expression written by the mean,
Here, cannot be neglected in the observations, in the same manner as the temporal scaling . Therefore, we also consider an ensemble scaling of the differential of appearances of a word, for comparing data.
From Eq. B.95 in Appendix B, the ensemble scalings of the differential of the time series of word appearances,
is given by
where is defined by and we assume that is constant with and . By using the facts that , , and , we can obtain the lower bound
and the corresponding expression as a function of the mean is written as
In addition, in the case of ensemble scaling, meaning