Multifractal Diffusion Entropy Analysis: Optimal Bin Width of Probability Histograms

Multifractal Diffusion Entropy Analysis: Optimal Bin Width of Probability Histograms

Petr Jizba p.jizba@fjfi.cvut.cz Jan Korbel korbeja2@fjfi.cvut.cz Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Břehová 7, 11519, Prague, Czech Republic Institute of Theoretical Physics, Freie Universität in Berlin, Arnimallee 14, 14195 Berlin, Germany Max Planck Institute for the History of Science, Boltzmannstrasse 22, 14195 Berlin, Germany
Abstract

In the framework of Multifractal Diffusion Entropy Analysis we propose a method for choosing an optimal bin-width in histograms generated from underlying probability distributions of interest. The method presented uses techniques of Rényi’s entropy and the mean squared error analysis to discuss the conditions under which the error in the multifractal spectrum estimation is minimal. We illustrate the utility of our approach by focusing on a scaling behavior of financial time series. In particular, we analyze the S&P500 stock index as sampled at a daily rate in the time period 1950-2013. In order to demonstrate a strength of the method proposed we compare the multifractal -spectrum for various bin-widths and show the robustness of the method, especially for large values of . For such values, other methods in use, e.g., those based on moment estimation, tend to fail for heavy-tailed data or data with long correlations. Connection between the -spectrum and Rényi’s parameter is also discussed and elucidated on a simple example of multiscale time series.

keywords:
Multifractals, Rényi entropy, Stable distributions, Time series
Pacs:
89.65.Gh, 05.45.Tp
volume:

00 \journalnamePhysica A \runauthP. Jizba and J. Korbel \jnltitlelogoPhysica A \CopyrightLine2014Published by Elsevier Ltd.

\dochead

1 Introduction

The evolution of many complex systems in natural, economical, medical and biological sciences is usually presented in the form of time data-sequences. A global massification of computers together with their improved ability to collect and process large data-sets have brought about the need for novel analyzing methods. A considerable amount of literature has been recently devoted to developing and using new data-analyzing paradigms. These studies include such concepts as fractals and multifractals Kim:04 (), fractional dynamics Machado (); West (), complexity Park (); Lee:physics0607282 (), entropy densities Lee:physics0607282 () or transfer entropies Schreiber (); Marschinski (); transferent (). Particularly in the connection with financial time series there has been rapid development of techniques for measuring and managing the fractal and multifractal scaling behavior from empirical high-frequency data sequences. A non-trivial scaling behavior in a time data-set represents a typical signature of a multi-time scale cooperative behavior in much the same way as a non-trivial scaling behavior in second-order phase transitions reflects the underlying long-range (or multi-scale) cooperative interactions. The usefulness of the scaling approach is manifest, for instance, in quantifying critical or close-to-critical scaling which typically signalizes onset of financial crises, including stock market crashes, currency crises or sovereign defaults kleinert:a (). A multifractal scaling, in particular, is instrumental in identifying the relevant scales that are involved in both temporal and inter-asset correlations transferent (). In passing, one can mention that aside from financial data sequences, similar (multi)fractal scaling patterns are also observed (and analyzed) in time data-sets of heart rate dynamics Peng2 (); Voit (), DNA sequences Peng (); mantegna (), long-time weather records Talker:00 () or in electrical power loads Bozic:13 ().

In order to identify fractal and multifractal scaling in time series generated by a complex system (of both deterministic and stochastic nature), several tools have been developed over the course of time. To the most popular ones belong the Detrended Fluctuation Analysis Peng (); mfdfa (), Wavelets multiwavelets (), or Generalized Hurst Exponents generhurst (). The purpose of the present paper is to discuss and advance yet another pertinent method, namely the Multifractal Diffusion Entropy Analysis (MF-DEA). In doing so we will stress the key rôle that Rényi’s entropy (RE) plays in this context. To this end, we will employ two approaches for the estimation of the scaling exponents that can be directly phrased in terms of RE, namely, the monofractal approach of Scafetta et al. dea () and the multifractal approach of Huang et al. mdea (), with further comments of Morozov commentmfdea (). The most important upshot that will emerge from this study is the proposal for the optimal bin-width in empirical histograms. The latter ensures that the error in the RE (and hence also scaling exponents) evaluation, when the underlying probability density function (PDF) is replaced by its empirical histograms, is minimal in the sense of Rényi’s information divergence and the associated -distance. We further show that the ensuing optimal bin-width permits the characterization of the hierarchy of multifractal scaling exponents and in a fully quantitative fashion.

This paper is structured as follows: In Section 2 we briefly review foundations of the multifractal analysis that will be needed in following sections. In particular, we introduce such concepts as Lipschitz–Hölder’s singularity exponent, multifractal spectral function and their Legendre conjugates. In Section 3 we state some fundamentals of the Fluctuation collection algorithm and propose a simple instructive example of a heterogeneous multiscale time series. Within this framework we discuss the MF-DEA and highlight the rôle of Rényi’s entropy as a multiscale quantifier. After this preparatory material we turn in Section 4 to the question of the optimal bin-width choice that should be employed in empirical histograms. In particular, we analyze the bin-with that is needed to minimize error in the multifractal spectrum evaluation. In Section 5, we demonstrate the usefulness and formal consistency of the proposed error estimate by analyzing time series from S&P500 market index sampled at a daily (end of trading day) rate basis in the period from January 1950 to March 2013 (roughly 16000 data points). We apply the symbolic computations with the open source software to illustrate the strength of the proposed optimal bin-width choice. In particular, we graphically compare the multifractal -spectrum for various bin-widths. Our numerical results imply that the proposed bin-widths are indeed optimal in comparison with other alternatives used. Implications for the -spectrum as a function of Rényi’s parameter are also discussed and graphically represented. Conclusions and further discussions are relegated to the concluding section. For the reader’s convenience, we present in A the source code in the language that can be directly employed for efficient estimation of the -spectrum (and ensuing generalized dimension ) of long-term data sequences.

2 Multifractal analysis

Let us have a discrete time series , where are obtained from measurements at times with an equidistant time lag . We divide the whole domain of definition of ’s into distinct regions and define the probability of each region as

(1)

where “card” denotes the cardinality, i.e., the number of elements contained in a set. For every region, we consider that the probability scales as , where are scaling exponents also known as the Lipschitz–Hölder (or singularity) exponents. The key assumption in the multifractal analysis is that in the small- limit we can assume that the probability distribution depends smoothly on and thus the probability that some arbitrary region has the scaling exponent in the interval can be considered in the form

(2)

The corresponding scaling exponent is known as the multifractal spectrum and by its very definition it represents the (box-counting) fractal dimension of the subset that carries PDF’s with the scaling exponent .

A convenient way how to keep track with various ’s is to examine the scaling of the correspondent moments. To this end one can define a “partition function”

(3)

Here we have introduced the scaling function which is (modulo a multiplicator) an analogue of thermodynamical free energy Renyi (); stanley (). In its essence is the partition function (3) nothing but the -th order moment of the probability distribution111 is a random variable with probabilities and values equal to ., i.e., . It is sometimes convenient to introduce the generalized mean of the random variable as

(4)

The function is also known as the Kolmogorov–Nagumo function Renyi (). For the choice one obtains the so-called -mean which is typically denoted as . It is then customary to introduce the scaling exponent of the -mean as

(5)

Such is called a generalized dimension and from (3) it is connected with via the relation: . In some specific situations, the generalized dimension can be identified with other well known fractal dimensions, e.g., for we have the usual box-counting fractal dimension, for we have the informational dimension and for it corresponds to the correlation dimension. The generalized dimension itself is obtainable from the relation

(6)

which motivates the introduction of Rényi’s entropy222Here and throughout we use the natural logarithm, thought from the information point of view it is more natural to phrase RE via the logarithm to the base two. RE thus defined is then measured in natural units — nats, rather than bits.

(7)

The generalized dimension defined via (6) is also known as a Rényi dimension of order . Eq. (7), in turn, implies that for small one has where is the term independent of (cf. Ref. Renyi ()). The multifractal spectrum and scaling function are not independent. The relation between them can be found from the expression for the partition function which can be, with the help of (1) and (2), equivalently written as

(8)

By taking the the steepest descent approximation one finds that

(9)

Here is such a value of the singularity exponent, that maximizes for given . Together with the assumption of differentiability, one finds that , and hence Eq. (9) is nothing but the Legendre transform between two conjugate pairs and . The Legendre transform of the function contains the same information as , and is the characteristic customarily studied when dealing with multifractals multiwavelets (); mdea (); Renyi (); stanley ().

In passing we can observe that the passage from multifractals to monofractals is done by assuming that the -interval gets infinitesimally narrow an the underlying PDF is smooth. In such a case both and collapse to and (cf. Ref. Renyi ()).

3 Diffusion entropy analysis

3.1 Fluctuation collection algorithm

In order to be able to utilize RE for extracting scaling exponents, we need to correctly estimate probability distribution from given empirical data sequence. The method that is particularly useful for our purpose is the so-called Fluctuation collection algorithm (FCA) dea (). To understand what is involved, let us assume that is a stationary time series. This can be converted to a random walk (i.e., a diffusion trajectory) by considering the time series as being a set of one-dimensional diffusion-generating fluctuations. With this a diffusing “particle’s” position at time will be . Here with and being the elementary time lag. The reason behind usage of random walk-like process instead of the original (noise-like) data time series is the Central Limit Theorem (CLT) and its Lévy–Gnedenko heavy-tailed generalization mantegna (); Kolmogorov (). In fact, the ensuing distributions in the CLT basin of attraction allow to identify (and quantify) the truly universal scaling behavior imprinted in the auxiliary diffusion process. This is because the non-trivial scaling can be phrased in terms of parameters of the involved Lévy stable non-Gaussian distributions.

As a rule, the inter-time scale correlations typically cause the breakdown of the CLT globally but in many cases the validity of the CLT is retained at least locally. In such a case the Lévy scaling parameters depend on the local time scale, which in turn leads to a typical multifractal picture seen in time series born out of complex dynamical systems. An example of this kind of behavior will be presented in the next subsection.

In order to establish the possible existence of a scaling we need to know the probability for the auxiliary diffusion trajectory to be found at the position at time ( an integer satisfying the condition ). The FCA addresses precisely this issue by turning a single time sequence, or better the derived diffusion trajectory into a Gibbs ensemble of diffusion trajectories. To this end one defines sub-sequences where

(10)

In particular, each aforementioned sub-sequence has the time duration . One might envisage that the sub-sequences with different are formed by shifting a window of a fixed length by a constant amount within the original time sequence. There will be clearly only such shifts possible. For this reason is such a selection of sub-sequences also known as the mobile window method Grigolini ().

From any sub-sequence one may form an auxiliary diffusion trajectory (basically a sample path) as

(11)

This mapping of a time series to a diffusion process generates an overlapping diffusion process. The ensemble of such diffusion trajectories allows now to find the required probability; all obtained values of for different ’s are divided into regions of length and the probability is estimated from the (normalized) equidistant histogram, i.e.

(12)

The aforementioned FCA is illustrated in Fig. 1 with the time series of the S&P500 index.

Figure 1: Illustration of fluctuation collection algorithm. From above: a) Time series of financial index S&P500 from January 1950 to March 2013, containing approx. 16000 entries. b) A subset of S&P500 for the year 2008 alone. c) Fluctuation collection algorithm for the first two months of 2008 and ( is day). The series is partially integrated, i.e., fluctuation sums (defined in Eq. (11)) are collected into the histogram on the right-hand-side. d) Fluctuation collection algorithm for the whole year 2008 for ( is day). This histogram was estimated independently of the first histogram (for every histogram was separately applied the Scott optimal bin-width rule, broadly discussed in following sections), and therefore estimated bin-widths for both histogram differ. This is a problem for RE estimation, so it is necessary to find some common bin-width for all scales.
Fluctuation collection algorithm for time series S&P 500a)b)c)d)

In case of multidimensional data series, i.e., when , we estimate -dimensional histogram with hyper-cubical bins of the elementary volume . Unfortunately as it stands, the FCA does not prescribe the size of which is, however, crucial for the correct RE evaluation. In the following subsection we will see that often and hence the optimal bin width will be generally -dependent. In Section 4 we will discuss the actual optimal value of such .

It is clear that the FCA is not sensibly applicable in cases of non-ergodic and non-Markovian systems, phrased, e.g., via accelerating, path-dependent or aging random walks. This is because the key assumption behind the whole fluctuation collection construction is a reliable use of the Gibbs ensemble method which, in turn is justified only when the time series in question represents a time evolution of an underlying system, whose complex statistical rules do not change in time. It will be this limited framework of the FCA that will be, for simplicity’s sake, utilized in sections to follow. Despite its limited applicability, one is still able to draw from this scheme valuable conclusions about a scaling behavior of complex multi-scale data sets. Interested reader may find some generalizations of the FCA to non-stationary time series, e.g., in Refs. Grigolini (); GrigoliniII ().

3.2 Monofractal and Multifractal diffusion entropy analysis

Statistical fractals and multifractals cannot be described comprehensively by descriptive statistical measures, such as mean or variance, because these do depend on the scale of observation in a power law fashion. In the literature one can find a number of approaches allowing to extract above multifractal scaling exponents from empirical data sequences. To these belong, for instance, Detrended Fluctuation Analysis constructed on random walks Peng (); mfdfa (), signal theory based Wavelet Analysis multiwavelets (), Generalized Hurst Exponents generhurst (), etc. In the following we will employ yet two another approaches for the estimation of scaling exponents that are directly based on the use of RE, namely, the Monofractal Diffusion Entropy Analysis of Scafetta et al. dea () and the MF-DEA of Huang et al. mdea () and Morozov commentmfdea (). The reason for choosing these approaches is that unlike other methods in use, the diffusion entropy analysis can determine the correct scaling exponents even when the statistical properties are anomalous or strange dynamics (e.g., long-range patterns in evolution) are involved.

Many techniques in modern finance rely substantially on the assumption that the random variables under investigation follow a Gaussian distribution. However, time series observed in finance — but also in other applications — often deviate from the Gaussian model, in that their marginal distributions are heavy-tailed and often even asymmetric. In such situations, the appropriateness of the commonly adopted normal assumption is highly questionable. It is often argued that financial asset returns are the cumulative outcome of a large number of pieces of information and individual decisions arriving almost continuously in time. Hence, in the presence of heavy-tails it is natural to assume that they are approximately governed by some (Lévy) stable non-Gaussian distribution. Other leptokurtic distributions, including Student’s , Weibull, and hyperbolic, lack the attractive CLT property. RE can deal very efficiently with heavy-tailed distributions that occur in real-world complex dynamics epitomized, e.g., in biological or financial time series transferent (). Moreover, as we have already seen, RE is instrumental in uncovering a self-similarity present in the underlying distribution beck (). To better appreciate the rôle of RE in a complex data analysis and to develop our study further, it is helpful to examine some simple model situation. To this end, let us begin to assume that the PDF has the self-similarity encoded via a simple scaling relation Grigolini ()

(13)

with being a dimensionless integer (say ) measuring elapsed time by counting timer ticks that are units apart. Alternatively we could use in (13) instead of the (dimensionless) the fraction with both and dimension full. In the latter case and would be measured in seconds.

Such a scaling is known to hold, for instance, in Gaussian distribution, exponential distribution or more generally for Lévy stable distributions. For stable distributions, the relation (13) basically means that the probability , when understood as a sum of random variables is, up to a certain scale factor, identical to that of individual variables, i.e., . This is a typical signature of fractal behavior — the whole looks like its parts. One way how to operationally extract the scaling coefficient is to employ the differential (or continuous) Shannon entropy

(14)

because in such a case

(15)

with being a -independent constant. So can be decoded from log-lin plot in the () plane. Note that for the Brownian motion this gives which implies the well-known Brownian scaling, and for (Lévy) stable distribution is the -exponent related with the Lévy parameter via relation .

Although the above single-scale models capture much of the essential dynamics involved, e.g., in financial markets, they neglect the phenomenon of temporal correlation that in turn leads to both long- and short-range heterogeneities encountered in empirical time sequences. Such correlated time series typically do not obey the central limit theorem and are described either with anomalous single-scaling exponent or via multiple-scaling. As a rule, in realistic time series the scaling exponents differ at different time scales, i.e., for each time window of size333Here is again a dimensionless (time) tick count variable, i.e., integer. (within some fixed time horizon ) exponents have generally different values for different . In order to avoid technical difficulties, let us consider that such that which means, that is divisible by . One can further assume that for each fixed time scale the underlying processes are statistically independent and so the total PDF with the time horizon can be written as a convolution of respective PDF’s, i.e.

(16)

where Note in particular that (16) can be equivalently rewritten as

(17)

where

(18)

In order to obtain a non-trivial scale-dependent scaling we will assume that the PDF is the Lévy stable distribution. For simplicity’s sake we will consider only the symmetric -stable Lévy distribution, i.e., distribution of the form

(19)

where and are the Lévy scale parameter and Lévy stability index, respectively. This allows to identify with and with .

Analogously as in the single-scale case, we can pin-point a concrete scale-dependent scaling by looking at the dependence of the differential entropy on the time scale . As mentioned in Section 2, the monofractals correspond to situations with , which is also the basic reason why the RE of order (i.e., Shannon’s entropy) plays such an important rôle in the monofractal DEA. On the other hand, in the multifractal case (such as the case of scale dependent distribution (16)) we should use instead of the Shannon differential entropy the whole class of differential RE, defined as

(20)

to obtain the local scaling . Indeed, by plugging (16) with (19) to (20) we receive

(21)

As a consequence of the asymptotic behavior of for (which at large behaves as ), the formula (21) has a good mathematical sense only for . The limit case corresponds to the Gaussian distribution which is well behaved for any . We may note further that the scale is not explicitly present in (21), instead it is implicitly reflected only via the stability index . This is understandable given that the scale cannot be directly observed, while is, in principle, an observable quantity JA2 (); Bashkirov:06 (). When the time lag is small one might expect that can be analytically extended from to . It is then easy to check that

(22)

and, in particular

(23)
Figure 2: Solutions of Eq. (22) as function with constant time – Fig. a), resp. Fig. b) (for small values of ) – or as function with constant – Fig. c), resp. Fig. d) (for small values of ). These solutions show, which ’s have the ability to detect non-trivial multifractal scaling in the multi-scale model defined in Sect. 3.2. Obtained curves in a log-lin plots represent the best polynomial fits adapted to given points. The curves obtained imply that grows quickly — faster than an exponential function of , respective . On the other hand, the exponential dependence is well satisfies at small values of .

Because is a monotonically increasing function of (see, e.g., Ref. Johnson:04 () — the more leptocurtic the higher Shannon’s entropy/ignorance), the derivative term in (23) is positive. Since also (note that is an integer fulfilling ), the condition (23) implies that or equivalently that is a constant. This confirms the former observation that with Shannon’s differential entropy one can well quantify only monofractal time series. On the other hand, when we can address also the case with a non-constant scaling exponent. This is possible because the expression within can be zero for . In fact, it is not difficult to check numerically that is for monotonically increasing function of while for it is not. This in turn allows (at least for some time horizons ) for a non-trivial dependence of on . This is depicted on Fig. 2. Of course, the fact that does not, per se, preclude a possibility that is still a constant (as seen from (23)). If this would be the case, the time series would be a monofractal, which — being a -independent statement — could be verified by looking at few ’s with . If monofractality would be indeed confirmed then further analyzed could done solely via Shannon’s entropy. Whether monofractal or multifractal, from the log-lin plot in the () plane one can extract the desired scaling (cf. Ref. Renyi ()). By realizing that the considered (dimensionless) time variable is measured in units , we can the logarithmic term in (21) write as

(24)

where both and are now measured in (dimensionless) seconds. By comparing this with (7) we see that for sufficiently small time lags (much smaller than a typical auto-correlation time) the generalized dimension . Our toy model system is thus clearly indicative of the RE’s important rôle in multifractal spectrum analysis inasmuch as a particular choice of can “highlight” specific time scales .

For general heterogeneous empirical time sequences, the question remains, if we are able to evaluate the RE for distributions and parameters of interest. For instance, for negative ’s the situation is notoriously problematic, because for PDF’s with unbounded support the integral of does not converge. Of course, we can assume a truncated (or regulated) model, where we would specify minimal and maximal value of the support. This helps formally with the integral convergence, nevertheless, regarding the time dependence of the PDF, it will be very problematic to define such time-dependent bounds properly, and what more, this most certainly will affect the scaling behavior. In fact, from the information theoretical ground, one should refrain from using Rényi’s entropy with negative ’s. This is because reliability of a signal decoding (or information extraction) is corrupted for negative ’s (see, e.g., Ref. Renyi ()). In the following we will confine our attention mostly on positive valued ’s.

It should be finally stressed that in contrast to the discrete case (7), the differential RE is not(!) generally positive. In particular, a distribution which is more confined than a unit volume has less Rényi’s entropy than the corresponding entropy of a uniform distribution over a unit volume, and hence it yields a negative (see, e.g., Ref. Renyi ()).

4 Probability histogram estimation and its error

As we have seen in the previous section, when trying to estimate the scaling exponent , one needs to specify the underlying probability distribution in the form of a histogram for each fixed . Multifractality is, however, a continuous property. Any tool designed to validate the multifractal property of a time series faces difficulties implied by finite size and discretization and any technique that is supposed to verify the multifractality involves some interpolation scheme which can make it prone to some bias. Particularly, the formation of the histogram from the (unknown) underlying should be done with care because in the RE-based analysis one needs not only all ’s, but also their powers, i.e., ’s for different values of . In this section we evaluate the error when is replaced by its histogram and outline what choice of the bin-size is optimal in order to minimize the error in the RE estimation. We begin with basic definition of histogram and its properties, then analyze the situation for different values of and finally derive explicit formulas for optimal bin-widths for different values of .

4.1 Basic properties of histograms

Let assume that from the underlying PDF we generate (e.g., via sampling) a histogram with some inherent bin-width , i.e.

(25)

where is a characteristic function of the interval , is the number of bins, are the actual sampling values at time and are sampled counts, so that . In the latter equality, we formally extended the sum over all , which enlarges the domain of the histogram to the whole real axis, i.e. below and above . Indeed for is . We note further that there is a simple relationship between bin-width and number of bins , namely

(26)

where denotes the ceiling function (i.e., the smallest integer not less than ). Depending on concrete realization of sampling, i.e., on the series , frequencies acquire values from to , probability that the frequency is equal to is equal to binomial distribution after trial, each with probability , where , and indeed, in the large limit . From this point of view the histogram represents a class of possible approximations of underlying PDF, which we further denote as . The class of all histograms contains piecewise constant functions, which in each interval of length acquire values from the set , and to each such histogram is assigned a probability of occurrence (determined by probabilities of ). Construction of histogram from sampled data corresponds thus to a choice of a one particular histogram from the class . The -th power of the histogram (necessary for the RE estimation) is then simply given as

(27)

which is a simple consequence of the definition of characteristic functions . Our aim, of course, is to create such histogram that would be the best representation of the underlying distribution, in a sense, that it would yield the smallest possible error in the RE evaluation. For this purpose, it is necessary to find an appropriate measure between underlying PDF and sampled histogram that would be both computationally tractable and conceptually well motivated. The crucial question that should be then addressed is whether the optimal bin-width can be -independent or whether the minimality of the error will require some non-trivial -dependence. Next subsections are devoted to the discussion of different measures used for measuring the distance (or dissimilarity) between underlying PDF and its histogram and to the ensuing -dependence issue.

4.2 Distance between histogram and probability distribution

In statistical estimation problems, there exits a number of measures of dissimilarity between probability distributions, among these Hellinger coefficient, Jeffreys distance, Akaike’s criterion, directed divergence, and its symmetrization -divergence provide paradigmatic examples. In the context of the RE-based analysis, the most natural measure is the Rényi information divergence of from . This is defined as Renyi:76 ()

(28)

Apart from a multiplicative prefactor, the divergence coincides with the Chernoff probabilistic distance of order (see, e.g., Ref. Basseville:89 ()). From information theory it is known (see, e.g., Ref. transferent (); Renyi (); Renyi:76 ()) that the Rényi information divergence represents a measure of the information lost when the PDF is replaced (or approximated) by the PDF . Note that in the limit one recovers the usual Shannon entropy-based Kullback–Leibler divergence. In the case of histogram sampling, the expected Rényi information divergence

(29)

represents the mean loss of information. Symbol represents the ensemble average with respect to the ensemble of histograms . Unfortunately, working with the measure (28) brings about some technical difficulties related to the non-linear structure of the RE. This can be circumvented by progressing with other measures that are closely related to Rényi informational divergence and yet are computationally more tractable. Among these, we have found that the squared -measure serves best our purposes. The latter quantifies the distance between the underlying PDF and histogram as

(30)

This measure has number of desirable properties, as we shall see in the following subsections. The relation between the Rényi information divergence and -measure is provided as follows: by using Jensen’s inequality for the logarithm

(31)

(valid for any ), we obtain that

(32)

where

(33)

Eq. (32) is the -generalization of the Csiszár–Kulback inequality Csiszar:67 () known from Shannon’s information theory. Note that for we have . This is because for we can write

(34)

where the integrated probabilities are defined as

(35)

and where denotes a point in the interval . The last inequality in (34) results from Jensen’s inequality for convex functions.

The situation with is less trivial because

(36)

due to concavity of . A simple (but not very stringent) majorization of can be found by using

(37)

and hence . Finally, from previous expressions, it is easy to see that

(38)

where the second inequality follows from Hölder’s inequality gradstein:a (). Thus, we can estimate the expected Rényi information divergence via the -distance, because

(39)

A substantial advantage in using the expected squared -distance, i.e.

(40)

lies in the possibility of interchanging integral and expectation value . In the case, when the expected value stands before the integral, we have to calculate the expected value over all frequencies , such that and . After interchanging integration and , the ensemble averaging acts on only locally, i.e., the expected value is calculated only over one particular frequency , for which . This brings about a significant simplification in calculations. This is the main reason, why to prefer in practical calculations the -norm.

Let us note that in the course of calculations we have also used the -norm (see, Eq. (32)). Indeed, some authors use the -norm to measure the distance between histogram and PDF (e.g., Ref. Hall ()). In the following we will stick to the mean square distance mainly because of its computational superiority — -norms are notoriously easier to work with than -norms Scott (). Naturally, different measures can generally lead to different results, and from the previous discussion, it might seem that perhaps other measures could be even more convenient to employ. Nevertheless, discussions in Refs. Hall (); Scott (); Scott2 (); Silverman () imply that in case of histograms which can be regarded as a -parameter class of step functions, one can “reasonably” well assume that optimal bin-widths obtained from different measures will not drastically differ among themselves. This is also the case of Rényi divergence, and squared -measures, among which the -measure is definitely the most convenient.

4.3 Dependence of bin-width on

A natural question arises, if it is necessary to estimate the bin-width for each separately, or whether it is not simply enough to estimate the bin-width only for one “referential” , e.g., for , and then work with such a histogram for all other -cases. We will now briefly motivate the necessity of introduction for -dependent bin-width. More thorough discussion will be presented in the section to follow. Let us have a series sampled from . Let us also have a histogram estimated from the data, with a bin-width , that makes optimal among all histograms that can be obtained from sampled data by changing the value of . For further convenience we denote . The squared -distance between -th powers of and appearing in Eq. (40) can be further conveniently rewritten as

(41)

Assuming that the error is for each sufficiently small, we may approximate as

(42)

and therefore

(43)

where . Denoting

(44)

the total squared -distance can be expressed as

(45)

In the following we will confine our discussion only to the middle sum . This is because depends only on the choice of the histogram and hence on the value of , while expressions and dependent more on the actual underlying PDF. We shall note that with increasing the values and get closer to the respective borders of the support of . So for sufficiently large the outer errors can be omitted, and the total -error can be represented only by . The discussion of can be divided into three distinct situations, according to the value of :

for negative values of , the sum accentuates the error that is particularly pronounced for distributions with extremely small probabilities . This can be partially compensated by smaller bin-width. However, in case of extreme distributions it is very hard to decide, whether the estimated probability is only an inappropriate outlier, or a sign of presence of extreme events in the system. This error is usually the more pronounced the more negative is. Consequently, the estimation of RE for negative ’s is extremely sensitive (in fact, RE is in this case unreliable information measure Renyi ()) and many authors evaluate RE only for positive ’s.

for these values, the exponent is larger than . Even though the errors from small probabilities are again accentuated, the error is bounded, because for , so the error is not as dramatic as in the first case.

in this case is the error diminished, because the factor suppresses the error exponentially with . The pre-factor does not grow as fast, and therefore the error is reduced in this case. Against this suppression acts , which increases the error for small . It is thus generally better in this case not to over-fit the histogram too much.

From the aforementioned it should be also evident that minimization of the local squared error , or the integrated squared error does not necessarily minimize the error of . This is because we have to create histograms with different (-dependent) bin-widths, which generally does not minimize the -distance between histogram and the underlying PDF .

4.4 Optimal width of the histogram bin

As discussed above, the proper histogram formation from the underlying distribution is a crucial step in the RE estimation. The issue at stake is thus to find the optimal bin-width that minimizes the Rényi informational error in Eq. (29) (or better the mean-squared -distance (40)) and hence it provides the least biased empirical value for the RE.

There exist several approaches for optimal bin-width choices in the literature that can be well employed in our mathematical framework. In this connection one can mention, e.g., Sturges rule Sturges (), that estimates the number of bins of the histogram as , which is motivated by the histogram of binomial distributions. This rule is very useful when visualizing data, but in case of PDF approximations, one can find more effective prescriptions. Along those lines, particularly suitable is the classic mean square error (MSE) method (see, e.g., Ref. Lehmann ()) which, among others, allows to quantify the difference between and a -th power of the underlying PDF in terms of previously discussed mean-squared -distance. This leads us to the task of solving

(46)

Expression (46) implies that we minimize integrated expected local deviations between the underlying PDF and -th power of it histogram (or simply -histogram) . We begin by first calculating (for simplicity’s sake we omit the subscript ) which can be conveniently rewritten as

(47)

The first term on the right-hand-side represents local variance of the estimator and the second term represents squared local bias of the histogram, i.e., . Eq. (47) represents a local deviation of the -histogram from the -th power of underlying PDF, and so from the estimation theory point of view serves as an estimator of .

To progress with computation of , we find from Eq. (27) that it corresponds to the calculation of variance of the quantity , where labels the bin for which . Naturally, is binomially distributed (as discussed in Sect. 4.1), namely , where is the integrated underlying probability adapted to -th bin. Thus,

(48)

This leads to calculation of fractional moments of binomial distribution, which is generally an intractable task, unless is natural. Indeed, when we have enough statistics, then the CLT implies (see, e.g., Refs. Feller (); Box ()) that the binomial distribution can be approximated by the normal distribution as . In this case we get

(49)

Moment was replaced by the absolute moment