Skewed distributions as limits of a formal evolutionary process
Time series of observables measured from complex systems do often exhibit non-normal statistics, their statistical distributions (PDF’s) are not gaussian and often skewed, with roughly exponential tails. Departure from gaussianity is related to the intermittent development of large-scale coherent structures. The existence of these structures is rooted into the nonlinear dynamical equations obeyed by each system, therefore it is expected that some prior knowledge or guessing of these equations is needed if one wishes to infer the corresponding PDF; conversely, the empirical knowledge of the PDF does provide information about the underlying dynamics. In this work we suggest that it is not always necessary. We show how, under some assumptions, a formal evolution equation for the PDF can be written down, corresponding to the progressive accumulation of measurements of the generic observable . The limiting solution to this equation is computed analytically, and shown to interpolate between some of the most common distributions, Gamma, Beta and Gaussian PDF’s. The control parameter is just the ratio between the rms of the fluctuations and the range of allowed values. Thus, no information about the dynamics is required.
Keywords: Statistical distributions; data analysis; replicator equation; evolutionary dynamics
Signals produced by noisy, random or chaotic systems do fluctuate around their mean value. While the normal (gaussian) curve is the paradigmatic distribution for these fluctuations, there exist several counterexamples of signals produced from complex systems which exhibit non-normal statistical properties: their statistical distributions (PDF’s) are not gaussian, often skewed and with roughly exponential tails. Typical examples are Gamma, log-normal, Weibull distributions (1); the fields where they are encountered span practically all the scientific disciplines (plasma physics, meteorology, financial data, oceanography, biology, …). Departure from gaussianity is related to the intermittent development of large-scale coherent structures that break the independence between measurements. Ultimately, the existence of these structures is argued on the basis of the equations governing the dynamics of the system, and in particular by their nonlinearities. In some cases the true equations are not known and replaced by phenomenological models. Thus, a prerequisite for inferring the possible shape of the PDF should be some knowledge or guessing about the system’s dynamics (2); (3); (4); (5); (9); (6); (7); (8). Conversely, one might expect to employ the empirical knowledge about the PDF in order to infer some information about the unknown underlying dynamics of the system studied. It is interesting to note incidentally that–on the one hand–one and the same kind of distribution may appear in totally different fields but, on the other hand, different experiments measuring the same quantity may yield different PDFs: this is the case, for instance, of laboratory plasma physics where particle density measurements are mostly interpolated by Gamma distributions, but other PDF’s are suggested to be suitable candidates as well (10); (11); (12); (13); (14); (15). This might imply that the governing equations must admit several possible classes of solutions.
In this work we propose a different possibility. We argue that the analytical form of the PDF might be fixed by just few gross constraints which are extracted by the measured data, without invoking any detailed knowledge of the underlying dynamics. To make pictorial this claim we resort to the following paradigmatic case: let us imagine to be measuring some positive-definite quantity, i.e. a (number, mass, …) density. Suppose as well that the measured quantity is fluctuating and the typical amplitude of its fluctuations (the rms) is comparable to the average value. It is obvious that negative (below the average) fluctuations are penalized, since no less-than-zero values are permitted, whereas no similar constraint holds for the positive (above the average) ones. Hence, one must expect a fortiori the PDF to be skewed and non-gaussian: a log-normal PDF, e.g., might be appropriate in this instance (16). Of course, dynamical equations do contain the constraint of positiveness but our point is that, in this instance, it is not necessary to invoke them. Another intringuing aspect is that it is not necessary that these contraints do reflect an intrinsic property of the system; rather, they might be imposed by the measuring apparatus as well. Hence–in principle–the same system observed by different observers may produce different PDFs.
We describe the measurement process as an evolutionary process; the evolution does not occur in real time, but in a fictitious time whose flow parametrizes any increase of information about the system, acquired, e.g., via measurements, within a Bayesian viewpoint, and is therefore a fairly generic feature. The evolving quantity is the PDF, obtained by binning the measured signal, and the quantities of interest are the stable solutions of this equation.
Let us introduce the model. Any scalar signal is bounded between a minimum and a maximum value, . The two extrema are either fixed by physical constraints, or are empirically determined by experiments. For the forthcoming analytical manipulations it is convenient to deal with a PDF whose support is semicompact: either or . If both of them are finite, then it is necessary to transform the measured variable, say switch to which converts the initial interval into . In most scenarios choosing the opposite range would be a more convenient choice, since measured signals are ordinarily defined as positive numbers. However, the difference is immaterial since amounts to just a change of sign of the observable involved. The present choice has the advantage of being consistent with the existing literature on evolutionary dynamics.
Let thus be the measured quantity, and its prior statistical distribution, which may have been obtained via some sets of measurements. Any further investigation provides new information about and, correspondingly, acts to modify . Formally we can represent this through a state-transition operator:
After subtracting from both members we arrive to
Eq (2) is an instance of a Master Equation (17). The function accounts for the flow of probability from to . Generically, it is expected to be system-dependent; however, we will look for universal traits that restrict its possible analytical forms. There are actually two constraints; the first one comes by integrating (2) over :
Secondly, the diagonal part of the operator, accounts for a flow of density into itself. Since this is already accounted for by in the lhs of (2), this term must be null: .
These two constraints allow to guess a plausible form for . First of all we factorize it as: . The first term quantifies the probability flux along distance , regardless of the absolute value of the initial and final point. We must account for inhomogeneity as well, and this is brought by , that quantifies the propensity for each point to act as final target. Let us start guessing an analytical expression for . The propensity may be estimated a posteriori on the basis of the past history of the system, which is quantified by itself. Thus, we may guess that , and we choose the simplest proxy: . The function comes accordingly: the simplest choice compliant with the above requirements is
In conclusion, eq. (2) rewrites
Notice that does not depend from nor ; actually, it quantifies the degree of difference between the two successive estimates : by assuming that each set of measurements increases only marginally the information about , i.e. , we require therefore , and may thus be attributed the meaning of an infinitesimal increment of a variable that parametrizes the state of knowledge about . Eq. (5) takes formally the expression of a finite-difference equation, which may further be turned into a differential equation:
Equation (6) is a non-linear integro-differential equation, often encountered in biosciences and game theory, where is known as ”replicator equation” (18). In those contexts, is the true time whereas in our case it is a fictitious time, parameterizing the flow of information. Several different choices for other than above are obviously possible in principle. However, they are penalized in terms of complexity criteria: Eq. (6) is not the only possible evolution equation but is the simplest one.
Let us discuss some features of (6). First of all, the support of does not vary throughout time: if at some time for some then will be zero at all times. It is possible to generalize Eq. (6) by including the possibility for to spread over larger and larger intervals of (see (19)) but for the moment we will not discuss this option and will return back to it only in the final part of this work. Secondly, Eq. (6) admits a Dirac-delta stationary solution: . Within our picture, this solution corresponds to the case where enough information has been collected to assign a univocal value to the observable : . In actual situations, this never happens. However, in our analysis it will not be necessary taking explicitly this limit: we will show that , regardless of initial conditions, approaches quickly a limiting function that remains unaltered in shape as time flows.
Smerlak and Youssef, through a lengthy analysis worked out analytically the solution to Eq. (6) for large times. We do not replicate here their study, rather we follow the other way round by plugging their result into Eq. (6) and showing that it is a valid solution at all times. The candidate solution takes the form of a flipped Gamma function:
The parameter is conveniently expressed in terms of the average value :
which reduces to (6) when
It is easy to check that: (i) as , hence converges towards and is self-similar: it does not change shape as time grows; (ii) the rate of convergence slows down and is very small as time increases: . On the other hand, varies rapidly in the first stages, which suggests that–regardless of the starting point – approaches quickly its limiting value (7). The numerical integration of Eq. (6) confirms the validity of this guess (see fig. 1). Notice that of Eq. (7) is a valid solution at all times, but is also a limiting stable solution: if one starts from a different curve, the solution evolves eventually approaching Eq. (7).
Ultimately, we have to revert to the physical variable by replacing into (7) with: . The result is
This expression is not particularly illuminating, but its limiting forms are revealing. The relevant figures here are the ratios , where is a measure of the typical amplitude of the signal: say the average value, or the mean fluctuation around the average. We will consider both to be non-zero in order not to bother with sub-cases. (I) Let us keep finite and take . This obviously corresponds practically to reverting back to the original variable : indeed, . Hence, turns into a Gamma function like Eq. (7). (ii) A second limiting case involves both to be very large: . The function has its maximum in . If we expand around this point we get
Thus now, . As long as remains much smaller than , hence, the lowest-order term dominates and reduces to a gaussian PDF. (iii) Finally, let us consider the case when both and are finite. A fairly versatile family of functions, often employed to model distributions within finite domains, is the Beta distribution (1). With our definitions it reads:
In Fig. (2) we show that (11) indeed does fit remarkably well Eq. (13). This is not unexpected, since, for , we can Taylor expand and there is an explicit correspondence between Eq. (11) and (13): the Beta PDF can be regarded as the lowest-order series expansion of (11). In conclusion, we may claim that even this family of distributions is accounted for within our model.
Rather than inspecting the whole PDF, an often employed strategy when dealing with experimental data is to study the mutual relations between the lowest-order statistical moments, in particular the third (skewness ) and fourth (kurtosis ) one. Low-order statistical moments are robust quantities to compute from raw data; furthermore, several PDFs feature characteristic correlations between them. For the Gamma distribution (7), in particular, it is easy to show that
(The gaussian PDF is a special case of this law, with ). It is a well established empirical result that data from very diverse fields do obey a law like (14): , with close to the values , respectively. This scaling has been studied extensively in laboratory plasma physics (12); (13); (15); (20); (21); (22), but also in oceanography (23), meteorology (24), seismology and financial time series (25); (26). In Fig. (3) we plot the curve from Eq. (14) together with a large sample of couples computed from (11) by varying and . Expression (11) does not allows for a unique relation between and , rather a cloud of points is generated, which is bounded from above by the Gamma limiting solution (14). Visual comparison of Fig. (3) with the similar ones from experiments shows that that there is a good overlap between the two data sets, but for one aspect: Eq. (11) does not appear to include solutions featuring , whereas part of the experimental data do lie in this region. In terms of PDFs this implies that datasets do exist whose histograms are not interpolated by none of the distributions studied above. This is, at this stage, a shortcoming of the model but we do not think it is a critical one. One possible way of addressing this issue is by acknowledging that we have so far dealt with immutable ranges . This is justifed as long are fixed a fortiori by the physics of the system or by the measuring apparatus. In several situations, on the other hand, are only defined a posteriori as the extrema of the signal measured so far, hence they are susceptible of varying just like does. Smerlak and Youssef (19) argue that this effect can be accounted for by adding an effective diffusive term in eq. (6)–converting it into a nonlinear Fokker-Planck equation with reaction terms. Their numerical studies of this modified equation show that–transiently–its solutions do move in the plane above the curve .
In conclusion, we are suggesting that the statistical distributions encountered in the analysis of experimental data (with the exception of power-law ones) may arise generically enforced by few natural constraints. Our whole discussion relies basically on just the three ansatzs defining the functions , , . The appealing consequence of our hypothesis is that we are able to reduce ourselves to just one basic solution, Eq. (11). The best known and most common distributions do arise just as several different cases of this solution. The relevant parameters for interpolating between the one and the other limits are, roughly, the ratios between the typical measured values (say, the rms) and the maximum measurable ones, with the Gaussian PDF appearing when both ratios go to zero. Thus we able to claim that the universality of PDFs across several different systems is possible, while at the same time providing a rationale for one kind of signals being modelled by different curves in different experiments.
Acknowledgements.The author wishes to thank S. Cappello, D. Escande, N. Vianello, M. Agostini, M. Zuin and I. Predebon for useful discussions
- N.L. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions (John Wiley & Sons, New York, 1995), 2nd ed
- F. Sattin, N. Vianello, Phys. Rev. E 72 (2005) 016407
- O.E. Garcia, Phys. Rev. Lett. 108 (2012) 265001
- O.E. Garcia, R. Kube, A. Theordosen, H.L. Pécseli, Phys. Plasmas 23 (2016) 052308
- D. Guszejnov, N. Lazányi, A. Bencze, S. Zoletnik, Phys. Plasmas 20 (2013) 112305
- J.A. Krommes, Phys. Plasmas 15 (2008) 030703
- B. Portelli, P.C.W. Holdsworth, J.-F. Pinton, Phys. Rev. Lett. 90 (2003) 104501
- V. Carbone, et al, Europhys. Lett. 58 (2002) 349
- I. Sandberg, S. Benkadda, X. Garbet, G. Ropokis, K. Hizadinis, D. del-Castillo-Negrete, Phys. Rev. Lett. 103 (2009) 165001
- F. Sattin, N. Vianello, M. Valisa, Phys. Plasmas 11 (2004) 5032
- J.P. Graves, J. Horacek, R.A. Pitts, K.I. Hopcraft, Plasma Phys. Control. Fusion 47 (2005) L1
- B. Labit, I. Furno, A. Fasoli, A. Diallo, S.H. Müller, G. Plyushchev, M. Podestá, F.M. Poli, Phys. Rev. Lett. 98 (2007) 255002
- F. Sattin, M. Agostini, R. Cavazzana, P. Scarin, J.L. Terry, Plasma Phys. Control. Fusion 51 (2009) 055013
- O.E. Garcia, S.M. Fritzner, R. Kube, I. Cziegler, B. LaBombard, J.L. Terry, Phys. Plasmas 20 (2013) 055901
- S. Banerjee, et al, Phys. Plasmas 21 (2014) 072311
- F. Sattin, N. Vianello, M. Valisa, V. Antoni, G. Serianni, Journal of Physics: Conf. Series 7 (2005) 247
- C.W. Gardiner, Handbook of Stochastic Methods (Springer, 2009)
- P.D. Taylor and L.B. Jonker, Math. Biosci. 40 (1978) 145
- M. Smerlak and A. Youssef, J. Theor. Biol. 416 (2017) 68
- F. Sattin, M. Agostini, R. Cavazzana, G. Serianni, P. Scarin, N. Vianello, Phys. Scr. 79 (2009) 045006
- N. Yan et al, Plasma Phys. Control. Fusion 55 (2013) 115007
- R. Kube, A. Theordosen, O.E. Garcia, B. LaBombard, J.L. Terry, Plasma Phys. Control. Fusion 58 (2016) 054001
- P. Sura and P.D. Sardeshmukh, J. Phys. Oceanogr. 38 (2008) 638
- T.P. Schlopflocher and P.J. Sullivan, Boundary-Layer Meteor. 115 (2005) 341
- M. Cristelli, A. Zaccaria, L. Pietronero, Phys. Rev. E 85 (2012) 066108
- C. Adcock, M. Eling, N. Loperfido, Eur. J. Finance 21 (2015) 1253