Multifractal Diffusion Entropy Analysis: Optimal Bin Width of Probability Histograms
Abstract
In the framework of Multifractal Diffusion Entropy Analysis we propose a method for choosing an optimal binwidth in histograms generated from underlying probability distributions of interest. The method presented uses techniques of Rényi’s entropy and the mean squared error analysis to discuss the conditions under which the error in the multifractal spectrum estimation is minimal. We illustrate the utility of our approach by focusing on a scaling behavior of financial time series. In particular, we analyze the S&P500 stock index as sampled at a daily rate in the time period 19502013. In order to demonstrate a strength of the method proposed we compare the multifractal spectrum for various binwidths and show the robustness of the method, especially for large values of . For such values, other methods in use, e.g., those based on moment estimation, tend to fail for heavytailed data or data with long correlations. Connection between the spectrum and Rényi’s parameter is also discussed and elucidated on a simple example of multiscale time series.
keywords:
Multifractals, Rényi entropy, Stable distributions, Time seriesPacs:
89.65.Gh, 05.45.Tp00 \journalnamePhysica A \runauthP. Jizba and J. Korbel \jnltitlelogoPhysica A \CopyrightLine2014Published by Elsevier Ltd.
1 Introduction
The evolution of many complex systems in natural, economical, medical and biological sciences is usually presented in the form of time datasequences. A global massification of computers together with their improved ability to collect and process large datasets have brought about the need for novel analyzing methods. A considerable amount of literature has been recently devoted to developing and using new dataanalyzing paradigms. These studies include such concepts as fractals and multifractals Kim:04 (), fractional dynamics Machado (); West (), complexity Park (); Lee:physics0607282 (), entropy densities Lee:physics0607282 () or transfer entropies Schreiber (); Marschinski (); transferent (). Particularly in the connection with financial time series there has been rapid development of techniques for measuring and managing the fractal and multifractal scaling behavior from empirical highfrequency data sequences. A nontrivial scaling behavior in a time dataset represents a typical signature of a multitime scale cooperative behavior in much the same way as a nontrivial scaling behavior in secondorder phase transitions reflects the underlying longrange (or multiscale) cooperative interactions. The usefulness of the scaling approach is manifest, for instance, in quantifying critical or closetocritical scaling which typically signalizes onset of financial crises, including stock market crashes, currency crises or sovereign defaults kleinert:a (). A multifractal scaling, in particular, is instrumental in identifying the relevant scales that are involved in both temporal and interasset correlations transferent (). In passing, one can mention that aside from financial data sequences, similar (multi)fractal scaling patterns are also observed (and analyzed) in time datasets of heart rate dynamics Peng2 (); Voit (), DNA sequences Peng (); mantegna (), longtime weather records Talker:00 () or in electrical power loads Bozic:13 ().
In order to identify fractal and multifractal scaling in time series generated by a complex system (of both deterministic and stochastic nature), several tools have been developed over the course of time. To the most popular ones belong the Detrended Fluctuation Analysis Peng (); mfdfa (), Wavelets multiwavelets (), or Generalized Hurst Exponents generhurst (). The purpose of the present paper is to discuss and advance yet another pertinent method, namely the Multifractal Diffusion Entropy Analysis (MFDEA). In doing so we will stress the key rôle that Rényi’s entropy (RE) plays in this context. To this end, we will employ two approaches for the estimation of the scaling exponents that can be directly phrased in terms of RE, namely, the monofractal approach of Scafetta et al. dea () and the multifractal approach of Huang et al. mdea (), with further comments of Morozov commentmfdea (). The most important upshot that will emerge from this study is the proposal for the optimal binwidth in empirical histograms. The latter ensures that the error in the RE (and hence also scaling exponents) evaluation, when the underlying probability density function (PDF) is replaced by its empirical histograms, is minimal in the sense of Rényi’s information divergence and the associated distance. We further show that the ensuing optimal binwidth permits the characterization of the hierarchy of multifractal scaling exponents and in a fully quantitative fashion.
This paper is structured as follows: In Section 2 we briefly review foundations of the multifractal analysis that will be needed in following sections. In particular, we introduce such concepts as Lipschitz–Hölder’s singularity exponent, multifractal spectral function and their Legendre conjugates. In Section 3 we state some fundamentals of the Fluctuation collection algorithm and propose a simple instructive example of a heterogeneous multiscale time series. Within this framework we discuss the MFDEA and highlight the rôle of Rényi’s entropy as a multiscale quantifier. After this preparatory material we turn in Section 4 to the question of the optimal binwidth choice that should be employed in empirical histograms. In particular, we analyze the binwith that is needed to minimize error in the multifractal spectrum evaluation. In Section 5, we demonstrate the usefulness and formal consistency of the proposed error estimate by analyzing time series from S&P500 market index sampled at a daily (end of trading day) rate basis in the period from January 1950 to March 2013 (roughly 16000 data points). We apply the symbolic computations with the open source software to illustrate the strength of the proposed optimal binwidth choice. In particular, we graphically compare the multifractal spectrum for various binwidths. Our numerical results imply that the proposed binwidths are indeed optimal in comparison with other alternatives used. Implications for the spectrum as a function of Rényi’s parameter are also discussed and graphically represented. Conclusions and further discussions are relegated to the concluding section. For the reader’s convenience, we present in A the source code in the language that can be directly employed for efficient estimation of the spectrum (and ensuing generalized dimension ) of longterm data sequences.
2 Multifractal analysis
Let us have a discrete time series , where are obtained from measurements at times with an equidistant time lag . We divide the whole domain of definition of ’s into distinct regions and define the probability of each region as
(1) 
where “card” denotes the cardinality, i.e., the number of elements contained in a set. For every region, we consider that the probability scales as , where are scaling exponents also known as the Lipschitz–Hölder (or singularity) exponents. The key assumption in the multifractal analysis is that in the small limit we can assume that the probability distribution depends smoothly on and thus the probability that some arbitrary region has the scaling exponent in the interval can be considered in the form
(2) 
The corresponding scaling exponent is known as the multifractal spectrum and by its very definition it represents the (boxcounting) fractal dimension of the subset that carries PDF’s with the scaling exponent .
A convenient way how to keep track with various ’s is to examine the scaling of the correspondent moments. To this end one can define a “partition function”
(3) 
Here we have introduced the scaling function which is (modulo a multiplicator) an analogue of thermodynamical free energy Renyi (); stanley (). In its essence is the partition function (3) nothing but the th order moment of the probability distribution^{1}^{1}1 is a random variable with probabilities and values equal to ., i.e., . It is sometimes convenient to introduce the generalized mean of the random variable as
(4) 
The function is also known as the Kolmogorov–Nagumo function Renyi (). For the choice one obtains the socalled mean which is typically denoted as . It is then customary to introduce the scaling exponent of the mean as
(5) 
Such is called a generalized dimension and from (3) it is connected with via the relation: . In some specific situations, the generalized dimension can be identified with other well known fractal dimensions, e.g., for we have the usual boxcounting fractal dimension, for we have the informational dimension and for it corresponds to the correlation dimension. The generalized dimension itself is obtainable from the relation
(6) 
which motivates the introduction of Rényi’s entropy^{2}^{2}2Here and throughout we use the natural logarithm, thought from the information point of view it is more natural to phrase RE via the logarithm to the base two. RE thus defined is then measured in natural units — nats, rather than bits.
(7) 
The generalized dimension defined via (6) is also known as a Rényi dimension of order . Eq. (7), in turn, implies that for small one has where is the term independent of (cf. Ref. Renyi ()). The multifractal spectrum and scaling function are not independent. The relation between them can be found from the expression for the partition function which can be, with the help of (1) and (2), equivalently written as
(8) 
By taking the the steepest descent approximation one finds that
(9) 
Here is such a value of the singularity exponent, that maximizes for given . Together with the assumption of differentiability, one finds that , and hence Eq. (9) is nothing but the Legendre transform between two conjugate pairs and . The Legendre transform of the function contains the same information as , and is the characteristic customarily studied when dealing with multifractals multiwavelets (); mdea (); Renyi (); stanley ().
In passing we can observe that the passage from multifractals to monofractals is done by assuming that the interval gets infinitesimally narrow an the underlying PDF is smooth. In such a case both and collapse to and (cf. Ref. Renyi ()).
3 Diffusion entropy analysis
3.1 Fluctuation collection algorithm
In order to be able to utilize RE for extracting scaling exponents, we need to correctly estimate probability distribution from given empirical data sequence. The method that is particularly useful for our purpose is the socalled Fluctuation collection algorithm (FCA) dea (). To understand what is involved, let us assume that is a stationary time series. This can be converted to a random walk (i.e., a diffusion trajectory) by considering the time series as being a set of onedimensional diffusiongenerating fluctuations. With this a diffusing “particle’s” position at time will be . Here with and being the elementary time lag. The reason behind usage of random walklike process instead of the original (noiselike) data time series is the Central Limit Theorem (CLT) and its Lévy–Gnedenko heavytailed generalization mantegna (); Kolmogorov (). In fact, the ensuing distributions in the CLT basin of attraction allow to identify (and quantify) the truly universal scaling behavior imprinted in the auxiliary diffusion process. This is because the nontrivial scaling can be phrased in terms of parameters of the involved Lévy stable nonGaussian distributions.
As a rule, the intertime scale correlations typically cause the breakdown of the CLT globally but in many cases the validity of the CLT is retained at least locally. In such a case the Lévy scaling parameters depend on the local time scale, which in turn leads to a typical multifractal picture seen in time series born out of complex dynamical systems. An example of this kind of behavior will be presented in the next subsection.
In order to establish the possible existence of a scaling we need to know the probability for the auxiliary diffusion trajectory to be found at the position at time ( an integer satisfying the condition ). The FCA addresses precisely this issue by turning a single time sequence, or better the derived diffusion trajectory into a Gibbs ensemble of diffusion trajectories. To this end one defines subsequences where
(10) 
In particular, each aforementioned subsequence has the time duration . One might envisage that the subsequences with different are formed by shifting a window of a fixed length by a constant amount within the original time sequence. There will be clearly only such shifts possible. For this reason is such a selection of subsequences also known as the mobile window method Grigolini ().
From any subsequence one may form an auxiliary diffusion trajectory (basically a sample path) as
(11) 
This mapping of a time series to a diffusion process generates an overlapping diffusion process. The ensemble of such diffusion trajectories allows now to find the required probability; all obtained values of for different ’s are divided into regions of length and the probability is estimated from the (normalized) equidistant histogram, i.e.
(12) 
The aforementioned FCA is illustrated in Fig. 1 with the time series of the S&P500 index.
In case of multidimensional data series, i.e., when , we estimate dimensional histogram with hypercubical bins of the elementary volume . Unfortunately as it stands, the FCA does not prescribe the size of which is, however, crucial for the correct RE evaluation. In the following subsection we will see that often and hence the optimal bin width will be generally dependent. In Section 4 we will discuss the actual optimal value of such .
It is clear that the FCA is not sensibly applicable in cases of nonergodic and nonMarkovian systems, phrased, e.g., via accelerating, pathdependent or aging random walks. This is because the key assumption behind the whole fluctuation collection construction is a reliable use of the Gibbs ensemble method which, in turn is justified only when the time series in question represents a time evolution of an underlying system, whose complex statistical rules do not change in time. It will be this limited framework of the FCA that will be, for simplicity’s sake, utilized in sections to follow. Despite its limited applicability, one is still able to draw from this scheme valuable conclusions about a scaling behavior of complex multiscale data sets. Interested reader may find some generalizations of the FCA to nonstationary time series, e.g., in Refs. Grigolini (); GrigoliniII ().
3.2 Monofractal and Multifractal diffusion entropy analysis
Statistical fractals and multifractals cannot be described comprehensively by descriptive statistical measures, such as mean or variance, because these do depend on the scale of observation in a power law fashion. In the literature one can find a number of approaches allowing to extract above multifractal scaling exponents from empirical data sequences. To these belong, for instance, Detrended Fluctuation Analysis constructed on random walks Peng (); mfdfa (), signal theory based Wavelet Analysis multiwavelets (), Generalized Hurst Exponents generhurst (), etc. In the following we will employ yet two another approaches for the estimation of scaling exponents that are directly based on the use of RE, namely, the Monofractal Diffusion Entropy Analysis of Scafetta et al. dea () and the MFDEA of Huang et al. mdea () and Morozov commentmfdea (). The reason for choosing these approaches is that unlike other methods in use, the diffusion entropy analysis can determine the correct scaling exponents even when the statistical properties are anomalous or strange dynamics (e.g., longrange patterns in evolution) are involved.
Many techniques in modern finance rely substantially on the assumption that the random variables under investigation follow a Gaussian distribution. However, time series observed in finance — but also in other applications — often deviate from the Gaussian model, in that their marginal distributions are heavytailed and often even asymmetric. In such situations, the appropriateness of the commonly adopted normal assumption is highly questionable. It is often argued that financial asset returns are the cumulative outcome of a large number of pieces of information and individual decisions arriving almost continuously in time. Hence, in the presence of heavytails it is natural to assume that they are approximately governed by some (Lévy) stable nonGaussian distribution. Other leptokurtic distributions, including Student’s , Weibull, and hyperbolic, lack the attractive CLT property. RE can deal very efficiently with heavytailed distributions that occur in realworld complex dynamics epitomized, e.g., in biological or financial time series transferent (). Moreover, as we have already seen, RE is instrumental in uncovering a selfsimilarity present in the underlying distribution beck (). To better appreciate the rôle of RE in a complex data analysis and to develop our study further, it is helpful to examine some simple model situation. To this end, let us begin to assume that the PDF has the selfsimilarity encoded via a simple scaling relation Grigolini ()
(13) 
with being a dimensionless integer (say ) measuring elapsed time by counting timer ticks that are units apart. Alternatively we could use in (13) instead of the (dimensionless) the fraction with both and dimension full. In the latter case and would be measured in seconds.
Such a scaling is known to hold, for instance, in Gaussian distribution, exponential distribution or more generally for Lévy stable distributions. For stable distributions, the relation (13) basically means that the probability , when understood as a sum of random variables is, up to a certain scale factor, identical to that of individual variables, i.e., . This is a typical signature of fractal behavior — the whole looks like its parts. One way how to operationally extract the scaling coefficient is to employ the differential (or continuous) Shannon entropy
(14) 
because in such a case
(15) 
with being a independent constant. So can be decoded from loglin plot in the () plane. Note that for the Brownian motion this gives which implies the wellknown Brownian scaling, and for (Lévy) stable distribution is the exponent related with the Lévy parameter via relation .
Although the above singlescale models capture much of the essential dynamics involved, e.g., in financial markets, they neglect the phenomenon of temporal correlation that in turn leads to both long and shortrange heterogeneities encountered in empirical time sequences. Such correlated time series typically do not obey the central limit theorem and are described either with anomalous singlescaling exponent or via multiplescaling. As a rule, in realistic time series the scaling exponents differ at different time scales, i.e., for each time window of size^{3}^{3}3Here is again a dimensionless (time) tick count variable, i.e., integer. (within some fixed time horizon ) exponents have generally different values for different . In order to avoid technical difficulties, let us consider that such that which means, that is divisible by . One can further assume that for each fixed time scale the underlying processes are statistically independent and so the total PDF with the time horizon can be written as a convolution of respective PDF’s, i.e.
(16) 
where Note in particular that (16) can be equivalently rewritten as
(17) 
where
(18) 
In order to obtain a nontrivial scaledependent scaling we will assume that the PDF is the Lévy stable distribution. For simplicity’s sake we will consider only the symmetric stable Lévy distribution, i.e., distribution of the form
(19) 
where and are the Lévy scale parameter and Lévy stability index, respectively. This allows to identify with and with .
Analogously as in the singlescale case, we can pinpoint a concrete scaledependent scaling by looking at the dependence of the differential entropy on the time scale . As mentioned in Section 2, the monofractals correspond to situations with , which is also the basic reason why the RE of order (i.e., Shannon’s entropy) plays such an important rôle in the monofractal DEA. On the other hand, in the multifractal case (such as the case of scale dependent distribution (16)) we should use instead of the Shannon differential entropy the whole class of differential RE, defined as
(20) 
to obtain the local scaling . Indeed, by plugging (16) with (19) to (20) we receive
(21)  
As a consequence of the asymptotic behavior of for (which at large behaves as ), the formula (21) has a good mathematical sense only for . The limit case corresponds to the Gaussian distribution which is well behaved for any . We may note further that the scale is not explicitly present in (21), instead it is implicitly reflected only via the stability index . This is understandable given that the scale cannot be directly observed, while is, in principle, an observable quantity JA2 (); Bashkirov:06 (). When the time lag is small one might expect that can be analytically extended from to . It is then easy to check that
(22) 
and, in particular
(23)  
Because is a monotonically increasing function of (see, e.g., Ref. Johnson:04 () — the more leptocurtic the higher Shannon’s entropy/ignorance), the derivative term in (23) is positive. Since also (note that is an integer fulfilling ), the condition (23) implies that or equivalently that is a constant. This confirms the former observation that with Shannon’s differential entropy one can well quantify only monofractal time series. On the other hand, when we can address also the case with a nonconstant scaling exponent. This is possible because the expression within can be zero for . In fact, it is not difficult to check numerically that is for monotonically increasing function of while for it is not. This in turn allows (at least for some time horizons ) for a nontrivial dependence of on . This is depicted on Fig. 2. Of course, the fact that does not, per se, preclude a possibility that is still a constant (as seen from (23)). If this would be the case, the time series would be a monofractal, which — being a independent statement — could be verified by looking at few ’s with . If monofractality would be indeed confirmed then further analyzed could done solely via Shannon’s entropy. Whether monofractal or multifractal, from the loglin plot in the () plane one can extract the desired scaling (cf. Ref. Renyi ()). By realizing that the considered (dimensionless) time variable is measured in units , we can the logarithmic term in (21) write as
(24) 
where both and are now measured in (dimensionless) seconds. By comparing this with (7) we see that for sufficiently small time lags (much smaller than a typical autocorrelation time) the generalized dimension . Our toy model system is thus clearly indicative of the RE’s important rôle in multifractal spectrum analysis inasmuch as a particular choice of can “highlight” specific time scales .
For general heterogeneous empirical time sequences, the question remains, if we are able to evaluate the RE for distributions and parameters of interest. For instance, for negative ’s the situation is notoriously problematic, because for PDF’s with unbounded support the integral of does not converge. Of course, we can assume a truncated (or regulated) model, where we would specify minimal and maximal value of the support. This helps formally with the integral convergence, nevertheless, regarding the time dependence of the PDF, it will be very problematic to define such timedependent bounds properly, and what more, this most certainly will affect the scaling behavior. In fact, from the information theoretical ground, one should refrain from using Rényi’s entropy with negative ’s. This is because reliability of a signal decoding (or information extraction) is corrupted for negative ’s (see, e.g., Ref. Renyi ()). In the following we will confine our attention mostly on positive valued ’s.
It should be finally stressed that in contrast to the discrete case (7), the differential RE is not(!) generally positive. In particular, a distribution which is more confined than a unit volume has less Rényi’s entropy than the corresponding entropy of a uniform distribution over a unit volume, and hence it yields a negative (see, e.g., Ref. Renyi ()).
4 Probability histogram estimation and its error
As we have seen in the previous section, when trying to estimate the scaling exponent , one needs to specify the underlying probability distribution in the form of a histogram for each fixed . Multifractality is, however, a continuous property. Any tool designed to validate the multifractal property of a time series faces difficulties implied by finite size and discretization and any technique that is supposed to verify the multifractality involves some interpolation scheme which can make it prone to some bias. Particularly, the formation of the histogram from the (unknown) underlying should be done with care because in the REbased analysis one needs not only all ’s, but also their powers, i.e., ’s for different values of . In this section we evaluate the error when is replaced by its histogram and outline what choice of the binsize is optimal in order to minimize the error in the RE estimation. We begin with basic definition of histogram and its properties, then analyze the situation for different values of and finally derive explicit formulas for optimal binwidths for different values of .
4.1 Basic properties of histograms
Let assume that from the underlying PDF we generate (e.g., via sampling) a histogram with some inherent binwidth , i.e.
(25) 
where is a characteristic function of the interval , is the number of bins, are the actual sampling values at time and are sampled counts, so that . In the latter equality, we formally extended the sum over all , which enlarges the domain of the histogram to the whole real axis, i.e. below and above . Indeed for is . We note further that there is a simple relationship between binwidth and number of bins , namely
(26) 
where denotes the ceiling function (i.e., the smallest integer not less than ). Depending on concrete realization of sampling, i.e., on the series , frequencies acquire values from to , probability that the frequency is equal to is equal to binomial distribution after trial, each with probability , where , and indeed, in the large limit . From this point of view the histogram represents a class of possible approximations of underlying PDF, which we further denote as . The class of all histograms contains piecewise constant functions, which in each interval of length acquire values from the set , and to each such histogram is assigned a probability of occurrence (determined by probabilities of ). Construction of histogram from sampled data corresponds thus to a choice of a one particular histogram from the class . The th power of the histogram (necessary for the RE estimation) is then simply given as
(27) 
which is a simple consequence of the definition of characteristic functions . Our aim, of course, is to create such histogram that would be the best representation of the underlying distribution, in a sense, that it would yield the smallest possible error in the RE evaluation. For this purpose, it is necessary to find an appropriate measure between underlying PDF and sampled histogram that would be both computationally tractable and conceptually well motivated. The crucial question that should be then addressed is whether the optimal binwidth can be independent or whether the minimality of the error will require some nontrivial dependence. Next subsections are devoted to the discussion of different measures used for measuring the distance (or dissimilarity) between underlying PDF and its histogram and to the ensuing dependence issue.
4.2 Distance between histogram and probability distribution
In statistical estimation problems, there exits a number of measures of dissimilarity between probability distributions, among these Hellinger coefficient, Jeffreys distance, Akaike’s criterion, directed divergence, and its symmetrization divergence provide paradigmatic examples. In the context of the REbased analysis, the most natural measure is the Rényi information divergence of from . This is defined as Renyi:76 ()
(28) 
Apart from a multiplicative prefactor, the divergence coincides with the Chernoff probabilistic distance of order (see, e.g., Ref. Basseville:89 ()). From information theory it is known (see, e.g., Ref. transferent (); Renyi (); Renyi:76 ()) that the Rényi information divergence represents a measure of the information lost when the PDF is replaced (or approximated) by the PDF . Note that in the limit one recovers the usual Shannon entropybased Kullback–Leibler divergence. In the case of histogram sampling, the expected Rényi information divergence
(29) 
represents the mean loss of information. Symbol represents the ensemble average with respect to the ensemble of histograms . Unfortunately, working with the measure (28) brings about some technical difficulties related to the nonlinear structure of the RE. This can be circumvented by progressing with other measures that are closely related to Rényi informational divergence and yet are computationally more tractable. Among these, we have found that the squared measure serves best our purposes. The latter quantifies the distance between the underlying PDF and histogram as
(30) 
This measure has number of desirable properties, as we shall see in the following subsections. The relation between the Rényi information divergence and measure is provided as follows: by using Jensen’s inequality for the logarithm
(31) 
(valid for any ), we obtain that
(32) 
where
(33) 
Eq. (32) is the generalization of the Csiszár–Kulback inequality Csiszar:67 () known from Shannon’s information theory. Note that for we have . This is because for we can write
(34) 
where the integrated probabilities are defined as
(35) 
and where denotes a point in the interval . The last inequality in (34) results from Jensen’s inequality for convex functions.
The situation with is less trivial because
(36) 
due to concavity of . A simple (but not very stringent) majorization of can be found by using
(37) 
and hence . Finally, from previous expressions, it is easy to see that
(38) 
where the second inequality follows from Hölder’s inequality gradstein:a (). Thus, we can estimate the expected Rényi information divergence via the distance, because
(39) 
A substantial advantage in using the expected squared distance, i.e.
(40) 
lies in the possibility of interchanging integral and expectation value . In the case, when the expected value stands before the integral, we have to calculate the expected value over all frequencies , such that and . After interchanging integration and , the ensemble averaging acts on only locally, i.e., the expected value is calculated only over one particular frequency , for which . This brings about a significant simplification in calculations. This is the main reason, why to prefer in practical calculations the norm.
Let us note that in the course of calculations we have also used the norm (see, Eq. (32)). Indeed, some authors use the norm to measure the distance between histogram and PDF (e.g., Ref. Hall ()). In the following we will stick to the mean square distance mainly because of its computational superiority — norms are notoriously easier to work with than norms Scott (). Naturally, different measures can generally lead to different results, and from the previous discussion, it might seem that perhaps other measures could be even more convenient to employ. Nevertheless, discussions in Refs. Hall (); Scott (); Scott2 (); Silverman () imply that in case of histograms which can be regarded as a parameter class of step functions, one can “reasonably” well assume that optimal binwidths obtained from different measures will not drastically differ among themselves. This is also the case of Rényi divergence, and squared measures, among which the measure is definitely the most convenient.
4.3 Dependence of binwidth on
A natural question arises, if it is necessary to estimate the binwidth for each separately, or whether it is not simply enough to estimate the binwidth only for one “referential” , e.g., for , and then work with such a histogram for all other cases. We will now briefly motivate the necessity of introduction for dependent binwidth. More thorough discussion will be presented in the section to follow. Let us have a series sampled from . Let us also have a histogram estimated from the data, with a binwidth , that makes optimal among all histograms that can be obtained from sampled data by changing the value of . For further convenience we denote . The squared distance between th powers of and appearing in Eq. (40) can be further conveniently rewritten as
(41)  
Assuming that the error is for each sufficiently small, we may approximate as
(42) 
and therefore
(43) 
where . Denoting
(44) 
the total squared distance can be expressed as
(45) 
In the following we will confine our discussion only to the middle sum . This is because depends only on the choice of the histogram and hence on the value of , while expressions and dependent more on the actual underlying PDF. We shall note that with increasing the values and get closer to the respective borders of the support of . So for sufficiently large the outer errors can be omitted, and the total error can be represented only by . The discussion of can be divided into three distinct situations, according to the value of :

for negative values of , the sum accentuates the error that is particularly pronounced for distributions with extremely small probabilities . This can be partially compensated by smaller binwidth. However, in case of extreme distributions it is very hard to decide, whether the estimated probability is only an inappropriate outlier, or a sign of presence of extreme events in the system. This error is usually the more pronounced the more negative is. Consequently, the estimation of RE for negative ’s is extremely sensitive (in fact, RE is in this case unreliable information measure Renyi ()) and many authors evaluate RE only for positive ’s.

for these values, the exponent is larger than . Even though the errors from small probabilities are again accentuated, the error is bounded, because for , so the error is not as dramatic as in the first case.

in this case is the error diminished, because the factor suppresses the error exponentially with . The prefactor does not grow as fast, and therefore the error is reduced in this case. Against this suppression acts , which increases the error for small . It is thus generally better in this case not to overfit the histogram too much.
From the aforementioned it should be also evident that minimization of the local squared error , or the integrated squared error does not necessarily minimize the error of . This is because we have to create histograms with different (dependent) binwidths, which generally does not minimize the distance between histogram and the underlying PDF .
4.4 Optimal width of the histogram bin
As discussed above, the proper histogram formation from the underlying distribution is a crucial step in the RE estimation. The issue at stake is thus to find the optimal binwidth that minimizes the Rényi informational error in Eq. (29) (or better the meansquared distance (40)) and hence it provides the least biased empirical value for the RE.
There exist several approaches for optimal binwidth choices in the literature that can be well employed in our mathematical framework. In this connection one can mention, e.g., Sturges rule Sturges (), that estimates the number of bins of the histogram as , which is motivated by the histogram of binomial distributions. This rule is very useful when visualizing data, but in case of PDF approximations, one can find more effective prescriptions. Along those lines, particularly suitable is the classic mean square error (MSE) method (see, e.g., Ref. Lehmann ()) which, among others, allows to quantify the difference between and a th power of the underlying PDF in terms of previously discussed meansquared distance. This leads us to the task of solving
(46) 
Expression (46) implies that we minimize integrated expected local deviations between the underlying PDF and th power of it histogram (or simply histogram) . We begin by first calculating (for simplicity’s sake we omit the subscript ) which can be conveniently rewritten as
(47) 
The first term on the righthandside represents local variance of the estimator and the second term represents squared local bias of the histogram, i.e., . Eq. (47) represents a local deviation of the histogram from the th power of underlying PDF, and so from the estimation theory point of view serves as an estimator of .
To progress with computation of , we find from Eq. (27) that it corresponds to the calculation of variance of the quantity , where labels the bin for which . Naturally, is binomially distributed (as discussed in Sect. 4.1), namely , where is the integrated underlying probability adapted to th bin. Thus,
(48) 
This leads to calculation of fractional moments of binomial distribution, which is generally an intractable task, unless is natural. Indeed, when we have enough statistics, then the CLT implies (see, e.g., Refs. Feller (); Box ()) that the binomial distribution can be approximated by the normal distribution as . In this case we get
(49) 
Moment was replaced by the absolute moment