Analysis of the phase transition in the 2D Ising ferromagnet using a Lempel-Ziv string parsing scheme and black-box data-compression utilities

# Analysis of the phase transition in the 2d Ising ferromagnet using a Lempel-Ziv string parsing scheme and black-box data-compression utilities

O. Melchert    A. K. Hartmann Institut für Physik, Universität Oldenburg, Carl-von-Ossietzky Strasse, 26111 Oldenburg, Germany
July 3, 2019
###### Abstract

In this work we consider information-theoretical observables to analyze short symbolic sequences, comprising time-series that represent the orientation of a single spin in a Ising ferromagnet on a square lattice of size , for different system temperatures . The latter were chosen from an interval enclosing the critical point of the model. At small temperatures the sequences are thus very regular, at high temperatures they are maximally random. In the vicinity of the critical point, nontrivial, long-range correlations appear. Here, we implement estimators for the entropy rate, excess entropy (i.e. “complexity”) and multi-information. First, we implement a Lempel-Ziv string parsing scheme, providing seemingly elaborate entropy rate and multi-information estimates and an approximate estimator for the excess entropy. Furthermore, we apply easy-to-use black-box data compression utilities, providing approximate estimators only. For comparison and to yield results for benchmarking purposes we implement the information-theoretic observables also based on the well-established -block Shannon entropy, which is more tedious to apply compared to the the first two “algorithmic” entropy estimation procedures. To test how well one can exploit the potential of such data compression techniques, we aim at detecting the critical point the Ising ferromagnet. Among the above observables, the multi-information, which is known to exhibit an isolated peak at the critical point, is very easy to replicate by means of both efficient algorithmic entropy estimation procedures. Finally, we assess how good the various algorithmic entropy estimates compare to the more conventional block entropy estimates and illustrate a simple modification that yields enhanced results.

###### pacs:
75.40.Mg, 05.70.Jk

## I Introduction

The standard analysis of phase transitions in terms of statistical mechanics involves the analysis of order parameters and other derivatives of the free energy, related to a given model system goldenfeld1992 (). Routinely one studies model systems that involve many degrees of freedom and local interactions that nevertheless result in a non-trivial, seemingly ”complex” behavior. Implying a rather naive use of the word, such systems are regarded as being very ”complex”, particularly right at the point where a phase transitions occurs in the underlying model. From a point of view of statistical mechanics, a large degree of complexity is shown by growing correlations as one approaches the critical point by tuning a proper system parameter. Correspondingly, throughout the analysis of observables related to such model systems it is often desirable to find a measure for what is naively referred to as the ”complexity” of the underlying system Grassberger1986 (); Feldman1998 (). However, a precise definition of the term ”complexity” is often elusive. An alternative approach to the analysis of phase transitions, which has recently gained popularity in the analysis of complex systems, is based on a purely information theoretic approach Feldman2008 (); Crutchfield2003 (); crutchfield2010 (). A variety of previous studies employed such information-theoretic methods to measure the entropy rate (i.e. disorder and randomness) and statistical complexity (i.e. structure, patterns and correlations) for -dimensional systems Arnold1996 (); Crutchfield1997 (); Feldman2003 (); Feldman2008 (); Robinson2011 (); Wilms2011 (); Melchert2013 (). In particular for systems, the excess entropy constitutes a well understood information-theoretic measure of complexity, providing a well defined literal sense for that term. Effectively, the excess entropy accounts for the rapidity of entropy convergence. In order to obtain numerical values for the entropy rate and complexity, the well established information-theoretic approach presented in Ref. Crutchfield2003 () is based on the notion of a “block entropy”, see discussion below. Among several intriguing findings, it led to the analysis of complexity-entropy diagrams that allow for a characterization of the temporal and spatial dynamics of various stochastic processes, including simple maps as well as Ising spin-systems, in purely information-theoretic coordinates Feldman2008 ().

On the other hand, note that there are a variety of other measures for what is known as algorithmic entropy, as, e.g., the description size of a minimal algorithm (or computer, or circuit) which is able to generate an instance of the problem under scrutiny kolmogorov1963 (); chaitin1987 (); machta2006 (). However, such measures are often impractical when it comes to the analysis of large systems. In this regard, as discussed in the literature Ebeling1997 (); Puglisi2003 (), particular data compression algorithms might render a natural and particularly simple candidate for estimating an algorithmic entropy. The pivotal challenge of such data compression algorithms, readily available as black-box data compression utilities as, e.g., the zlib zlib_ref (), bzip2 bzip2_ref () and lzma lzma_ref () utilities, is to discover patterns (synonymous with regularities, correlations, symmetries and structure; see Sect. II of Ref. Shalizi2001 ()) in the given input data and to exploit the respective redundancies in order to minimize the space required to store the data. Interestingly, the pattern discovery and data compression process of particular data compression schemes finds application in contexts as diverse as, e.g., DNA sequence classification Loewenstern1995 (), entropy estimation Ebeling1997 (); Baronchelli2005 (); Grassberger2002 (), and, more generally, time series analysis Puglisi2003 (). However, at this point please note that not all of the applications of such methods reported in the scientific literature are without critique, see Benedetto2002 (); Khmelev2003 (); Goodman2002 ().

In the presented study we aim to assess how well algorithmic entropy (AE) estimates, obtained using a Lempel-Ziv (LZ) string parsing scheme Lesne2009 (); Liu2012 () and black-box data compression utilities, and the results obtained therewith compare to those obtained by means of the respective block entropy (BE) estimates used in the context of the information theoretic approach mentioned earlier. As raw data, to be processed further by both approaches, we consider binary sequences that represent the spin-flip dynamics, induced by single-spin-flip Metropolis updates newman1999 () of the Ising Ferromagnet (FM) on a square lattice of side length with fully periodic boundary conditions at different temperatures . The temperatures are chosen from the interval , enclosing the critical point of the model. In order to model the binary sequences, a particular spin on the lattice is chosen as a “source”, emitting symbols from the binary alphabet (after a simple transformation of the spin variables). Therefore, the orientation of the source-spin is monitored during a number of Monte Carlo (MC) sweeps to yield symbolic sequences of length up to . Before the spin orientation is recorded, a sufficient number of sweeps are performed to ensure that the system is equilibrated. In this regard, for a square lattice with side length , and starting with all spins “up”, and by analyzing the magnetization of the system, we observed an equilibration time of approximately MC sweeps for the lowest temperature. However, for each system considered we discarded the first sweeps to avoid initial transients.

The aim of this work is use computer science methods, related to the field of lossless data compression, and apply them to a pivotal model system from statistical mechanics, namely the Ising ferromagnet and the continuous ferromagnet-to-paramagnet transition found therein. In particular, we want to clarify whether the phase transition can be detected, located and analyzed numerically with high precision just by looking at entropy and complexity measures derived via data compression utilities. Being well aware that such AE estimates based on sequence parsing schemes and data compression utilities might only be used to obtain upper bounds on the actual entropies of the underlying (finite) symbolic sequences Ebeling1997 (); Schuermann1996 (); Lesne2009 (), we compare our findings to those obtained using the BE estimators that here serve as a benchmark.

The remainder of the presented article is organized as follows. In section II we introduce the information-theoretic observables obtained from the limiting behavior of block entropies and we detail the LZ string parsing and data compression based entropy measures. In section III we discuss the results obtained by applying the aforementioned entropy estimators and in section IV we conclude with a summary.

## Ii Information-theoretic observables for symbolic sequences

In subsection II.1 we introduce the basic notation from information theory, subsequently used to define the entropy rate, excess entropy and further related measures that might be associated to a one-dimensional () symbolic sequence of finite length. Regarding the definition of the entropy rate and excess entropy we follow the notation used in Refs. Shalizi2001 (); Crutchfield2003 (); Feldman2008 (), where a more elaborate discussion of the individual information-theoretic observables can be found. In subsection II.2 we further introduce the LZ string parsing scheme and data compression based entropy measures considered in the remainder.

### ii.1 Block entropy, entropy rate and complexity

Given a symbolic sequence of finite length , i.e. , where the individual symbols assume a symbolic value randomly drawn from an alphabet of finite size. Here, an individual symbol signifies the outcome of a measurement on a random variable, i.e. the orientation of a single Ising spin at a given point in (simulation) time. Therefore, unless otherwise specified, will denote the binary alphabet . For denoting a particular symbol block of length , the -block Shannon entropy, also called “block entropy” (BE), a prerequisite needed to define the subsequent information theoretic observable, reads

 HBE,M[S]≡−∑sM∈AMPr(sM)log2(Pr(sM)), (1)

wherein specifies the joint probability for blocks of consecutive symbols. Hence, considering a finite sequence, represents the empirical rate of occurrence of in the given sequence. Consequently, depends on the spin-flip dynamics of the chosen Ising spin, implemented by the simulation procedure for the Ising FM, over intervals of consecutive time steps. In the above formula, the sum runs over all possible -blocks, i.e., combinations of consecutive symbols, that might be composed by means of the alphabet (considering a binary alphabet, there are such blocks) and we imply to set . In general, is a nondecreasing function of , bounded by . The upper bound is attained in the limit of sequence length if the probability of a string factorizes and each letter has the same probability of occurrence, i.e., . In the limit of large block-sizes and sequence lengths , might thus not converge to a finite value. As a remedy, due to the above bounding value, the entropy rate

 h≡limM,N→∞HBE,M[S]/M. (2)

might be considered instead. It specifies the asymptotic rate of increase of the -block Shannon entropy with block length , and, in the limit of large block size and sequence length, indicates an upper bound on the number of bits needed to encode a symbol of the observed sequence.

In most practical applications the finite length of the underlying symbolic sequences imposes certain sampling issues related to subsequences of a long enough length , rendering it unfeasible to proceed towards very large block sizes. E.g., for sequences of length , where symbols are independent and identically distributed (iid), one might experience a severe undersampling of -blocks if, at a given alphabet size , is too large or is too short. In particular, a naive upper bound might be obtained from the constraint Lesne2009 (); Schuermann1996 (). Consequently, it is desirable to consider proper finite- approximations to the entropy rate observed for sequences of finite length , referred to as apparent entropy rates. Two such estimators are given by the per symbol -block entropy

 h′BE,M[S]=HBE,M[S]/M (3)

and the discrete derivative of Eq. 2, defining the entropy rate as a block entropy increment via

 hBE,M[S]=HBE,M[S]−HBE,M−1[S], (4)

for sequences of finite length and both equations presuming . Viewed as a function of block size, the finite- estimates of the entropy rates converge to the asymptotic value from above. Hence, considering small block sizes, the underlying symbolic sequences tend to look more random than in the limit . Finally, note that the entropy rate is a measure of randomness that might be attributed to the underlying sequences Feldman1998 (); Crutchfield2003 (). Here, for sequences of finite size and for a maximally feasible block-size we denote the block entropy based estimate of the entropy rate as

 hBE[S]≡hBE,Mmax[S]. (5)

For one-dimensional symbolic sequences, there are three different but equivalent expressions to define the excess entropy. These are based on the convergence properties of the entropy rate, the subextensive part of the block entropy in the limit of large block sizes and the mutual information between two semi-infinite blocks of variables, see Refs. Feldman2003 (); Feldman2008 (). Here, we focus on the definition of the excess entropy, also termed “(statistical) complexity”, related to the convergence properties of the entropy rate in the form

 CBE[S]=∞∑M=1(hBE,M[S]−hBE[S]). (6)

As pointed out above, the conditional entropies constitute upper bounds on the asymptotic entropy rate, allowing, in principle, for an improving estimate of for increasing . Note that since the sum in Eq. 6 extends to , it implies the limit and . However, for practical purposes, i.e. since we consider sequences of finte lenght only, the sum in Eq. 6 needs to be truncated at a maximally feasible block-size that still yields a reliable estimate of (see discussion above). Hence, for a symbolic sequence of finite length , accounts for the randomness that is present at small values of that vanishes in the limit of large block sizes. The excess entropy is considered a measure of statistical complexity Feldman1998 (); Crutchfield2003 () with the ability to detect structure within the considered sequences Grassberger1986 (). Thereby it satisfies the desirable “one-hump” criterion Crutchfield2000 (), according to which a proper measure for statistical complexity yields a small numerical value for highly ordered and highly disordered sequences. Furter, note that also other practically computable approaches exist, which indeed seem to measure complexity as expected, like mutual information Grassberger1986 (); cover2006 () or statistical complexity Crutchfield1989 ().

For symbolic sequences, a further information-theoretic observable, termed multi information Erb2004 () is given by the first summand in Eq. 6, i.e.

 IBE[S]=hBE,1[S]−hBE[S]. (7)

Albeit is closely related to the excess entropy (this holds only in the limit of large block sizes ; see Ref. Erb2004 () for a more general discussion of the multi information), it captures a particular contribution to the convergence of the entropy density. In this regard, for sequences of infinit length, i.e. in the limit , it measures the decrease of the entropy rate observed by switching from the level of single variables (-block) statistics to the statistics attained as . Recently, the multi-information was introduced and used to characterize spin configurations for the 2D Ising FM in the thermodynamic limit by analytic means Erb2004 (). It was found to exhibit an isolated global maximum right at the critical temperature and thus also satisfies the “one-hump” criterion desired for the full complexity measure.

### ii.2 String parsing and data compression based entropy measures

Below we illustrate the Lempel-Ziv string parsing scheme and the data-compression tools used to implement entropy measures by algorithmic means.

##### Lempel-Ziv string parsing scheme:

The Lempel-Ziv (LZ) string parsing scheme considered here is based on a coarse graining of the input sequence , yielding a quantity termed “Lempel-Ziv” complexity. Therefore, the input sequence is split into several independent blocks. By traversing the input sequence, a new block is completed whenever one encounters a new subsequence of consecutive symbols that does not match a subsequence of the already traversed part of the input sequence.

More formally, given the symbolic sequence , for which the subsequence might be specified as , we obtain a parsed sequence using the following procedure:

• initialize the parsed sequence and an empty auxiliary sequence . Traverse the sequence from left to right, using an index initialized as .

• In step , advance to and extend the auxiliary sequence via symbol .
E.g., after increasing one has . More general, if the auxiliary sequence already reads , it is extended to .

• Check whether the current auxiliary sequence matches any subsequence of . If not, append the auxiliary sequence as a new “block” to the parsed sequence and reset .
E.g., in step one has . If does not match the symbol , then set and reset .

• repeat (2.1) and (2.2) until and append to to yield the final parsing .

Finally, the LZ complexity associated to sequence is simply the number of consecutive blocks found after the coarse-graining procedure is completed Liu2012 (). As an example, consider the sequence . Following the above procedure yields the parsed sequences for which the LZ complexity reads .

The LZ complexity provides means to quantify the degree of order or disorder in an observed symbolic sequences Liu2012 (); Lesne2009 (). Therefore the term “complexity” is somewhat misleading in our context. As pointed out in Ref. Lesne2009 (), a proper normalization allows to relate the LZ complexity to the entropy rate of the symbolic sequence, i.e.

 hLZ[S]=limN→∞hLZ[S]=limN→∞NLZ[S]ln(N)N. (8)

Note that in Ref. Lesne2009 () two different variants of LZ parsing are considered, here we use the parsing scheme that Ref. refers to as LZ .

The above string parsing scheme might also be used to compute an observable that closely follows the definition of the block-entropy based excess entropy. Therefore, bear in mind that in the definition of the excess entropy Eq. 6, the individual terms involve the entropy-rate estimates for finite block-size , i.e. containing symbol correlations up to length , only. By using the LZ string parsing scheme, similar estimates of the entropy rate, restricted to feature correlations up to some specified length , might be obtained by preprocessing the initial sequence by applying a -block standard random shuffle procedure. Thereby, the initial length- sequence is first split into blocks of length and possibly a remaining block of length . Then, these blocks are brought into random order and merged to form the new, -block shuffled surrogate sequence . This maintains the individual symbol frequencies, destroys all correlations that extend over lengths larger than and yields a particular realization of a -block shuffled surrogate sequence. Note however that the distribution of -blocks, obtained by sliding a window of length over the initial sequence and keeping track of all overlapping subsequences of length (e.g. used to compute -block entropies), is not conserved by this procedure. The case corresponds to the standard random shuffling procedure considered in Ref. JimenezMontano2002b (). Finally, a standard random shuffle based excess entropy (or similar: standard random shuffle based complexity) that utilizes the LZ string parsing scheme might be obtained as

 Cs[S]=Mmax∑M=1(hLZ[S(M)]−hLZ[S]), (9)

where indicates a maximal feasible block-size for the shuffling procedure. Albeit the above observable is no direct analog of the excess entropy Eq. 6, it is expected to behave in a similar manner.

Similarly, the string parsing scheme might be used to compute a shuffling based equivalent of the multi-information Erb2004 () Eq. 7 as

 Is[S]=hLZ[S(1)]−hLZ[S]. (10)

Therein, simply represents a surrogate sequence obtained by a -block standard random shuffle JimenezMontano2002b () which maintains the symbol frequencies but destroys correlations on all scales. Hence, Eq. 10 reflects the entropy overestimate observed by going from a representation of the sequence with no correlations at all to its original representation including all correlations.

##### Data-compression tools:

In addition we also consider three commonly used black-box data compression tools, namely zlib zlib_ref (), bz2 bzip2_ref (), and lzma lzma_ref (), in order to compute an algorithmic entropy that is based on the compressibility of the underlying sequence according to

 halg[S]=length(compress(S))N, (11)

see Ref. Ebeling1997 (). According to the latter reference, these data-compression tool based entropy should provide an upper bound on the entropy of the underlying sequences. Note that the shuffling based (approximate) excess entropy and multi-information can also readily be computed using Eq. 11.

## Iii Results

Subsequently we will address two distinct issues: Firstly, in subsection III.1, we will assess how well the aforementioned, string-parsing and data-compression based entropy estimators might be used to characterize the ferromagnet-to-paramagnet transition in the Ising ferromagnet. Therefore we consider the information theoretic observables introduced previously in section II and capitalize on their scaling behavior as function of the system temperature. Since these estimates are obtained by algorithmic means, we here refer to the them as “algorithmic entropy estimates”. Secondly, in subsection III.2, we evaluate how well the algorithmic entropy estimates compare to the more conventional block entropy estimates and discuss further means to improve on the difference between the respective estimates.

### iii.1 Using algorithmic entropy estimates to locate the critical point

In the presented subsection we will summarize the results on the issue of how well the phase transition in the Ising FM can be resolved by means of the Lempel-Ziv (LZ) string parsing scheme Lesne2009 (); Schuermann1996 (); Liu2012 (), as well as the zlib zlib_ref (), bz2 bzip2_ref (), and lzma lzma_ref () black-box data compression utilities. Therein, we are not interested in the absolute values of the entropy estimates thus obtained, but merely in the data-curve characteristics as function of the system temperature. The final estimates for the transition points, resulting from the considered observables, and their comparison to the literature value of the critical temperature, i.e. , can be used to assess the value of the algorithmic entropy estimators as easy-to-compute utilities, that might yield valuable information on the structural change as visible in measurements of finite-length symbolic sequences.

#### iii.1.1 Lempel-Ziv string parsing scheme

##### Entropy rate:

As pointed out previously, for a given input sequence , consisting of consecutive symbols, the LZ string parsing scheme yields a parsed sequence of length , allowing to compute the respective entropy rates for sequences of finite length via , see Eq. 8. We applied this estimation procedure to ensembles of length- sequences, accounting for the orientation of a selected spin in a Ising FM at different values of the temperature parameter , considering various values of up to , see Fig. 1(a). Therein, the upper inset of Fig. 1(a) illustrates the change of the average asymptotic entropy rate at with increasing sequence length . For the extrapolation to the asymptotic limit, an empirically motivated fit function of the form

 hLZ,N=h∞+(alog2(N))/Nγ, (12)

motivated in Ref. Schuermann1996 () and also used in Ref. Lesne2009 (), was employed. The main plot of Fig. 1(a) shows the extrapolated asymptotic entropy rates . The asymptotic entropy rates are in accord with intuition: at low , the symbolic sequences exhibit a high degree of order, hence the associated entropy rate is small (vanishing in the limit of perfect order). In contrast, at high , the sequences exhibit maximal randomness, i.e. subsequent spin orientations in the underlying model are uncorrelated, and the entropy rate tends towards . Of pivotal interest is the region close to where nontrivial, long-ranged correlations between the successive orientations of the monitored spin build up. Below it will be of interest to check whether the previously introduced measures of statistical complexity are sensitive to these structural changes and can be used to locate the critical point by means of a finite-size scaling analysis using sequences of different length . The bottom inset of the figure indicates the change of the fit-parameter as function of the temperature. As evident from the figure, in the high-temperature regime above the critical point , it assumes a stationary value . In the low-temperature regime it exhibits an increasing value with decreasing temperature. Overall, the agreement between the asymptotic entropy rate , the entropy rate at and the common block-entropy at at is remarkably good, see Tab. 1. In particular, for all values of considered, seems to be a satisfactory approximation to , since both values agree within errorbars.

##### Excess Entropy:

Fig. 1(b) illustrates the block-shuffling based excess entropy Eq. 9 as function of the system temperature, averaged over a large number of input sequences. In the analysis, we restricted the sum to . As evident from the figure, assumes small values for both, the low- and high- regime. In between, i.e. in the paramagnetic phase close to the critical point , it assumes a peak value thus satisfying the naive “one-hump” criterion Crutchfield2000 (); Feldman1998 (). For an increasing sequence length the peak gets more pronounced and shifts towards . In order to assess whether the position of the peak in the asymptotic limit coincides with the critical point we performed a finite-size scaling analysis.

To accomplish this, we fitted cubic splines to the peak region of the excess entropy data curves to obtain the respective sequence length dependent, thus “effective”, peak positions . In Fig. 1(b), the fit-curves are shown as solid lines (dashed lines are a guide for the eye, only). Errorbars for the peak position are computed using bootstrap resampling of the underlying data practicalGuide2009 (). The asymptotic peak position is then extrapolated using a fit to

 Teff(N)=T∞+aN−b, (13)

yielding , , and (reduced chi-square ), see inset of Fig. 1(b), in good agreement with the literature value . Albeit not shown here, we further observe that the fluctuations are peaked directly at .

At this point, bear in mind that we study symbolic sequences that represent the time-series of the orientation of a selected spin, recorded for MC sweeps on a square lattice of finite side length . Analyzing the specific heat newman1999 () of the Ising FM we find an accentuated peak at , indicating the “effective” location of the critical point for the finite system (not shown). Note that this value is slightly larger than the asymptotic critical point . Here, we obtain the interesting result that, by performing a scaling analysis for the excess entropy peak-locations for symbolic sequences of different length (all for the finite system size ), the results seem to extrapolate towards which is in striking agreement with the asymptotic critical point . However, the results are also in reasonable agreement with the effective critical point suggested by the specific heat. Hence, within the precision reached by our current analysis we cannot completely rule out that the results extrapolate towards instead of the asymptotic critical point .

##### Multi-information:

The results for the standard random shuffle based multi-information Eq. 10 are shown in Fig. 1(c).

Here, a finite-size scaling analysis of the effective peak positions, again obtained by fitting cubic splines to the peak region of the data curves, using bootstrap resampling to compute errorbars and Eq. 13 to extrapolate to the asymptotic limit, yields the estimates , and (reduced chi-square ). Similar to the findings for the excess entropy, the estimate of is in good agreement with the known value of , indicating that is highly sensitive to the correlations that emerge close to the critical point.

##### S-measure:

For the purpose of comparing an observable computed for symbolic sequences of finite length to their surrogate counterparts, Ref. JimenezMontano2002b () employed the -measure

 S[S]=|Morig[S]−⟨Msurr[S]⟩|sDev(Msurr[S]). (14)

Therein, in order to quantify a significant deviation between both observables, it states the difference between the observable for the original sequence to the average value of the observable for an ensemble of proper surrogates, measured in units of the standard deviation found for the surrogate ensemble. Here, to probe the sensitivity of the -measure to structural changes in the symbolic sequences at different temperatures, we choose as an observable the LZ complexity based estimator for the entropy rate. I.e. for a given sequence we consider and (note that this corresponds to the construction procedure 1 for surrogate sequences in Ref. JimenezMontano2002b ()), using surrogate sequences for averaging obtained by standard random shuffling. In Eq. 14, the average and standard deviation are computed from independent surrogate sequences. Fig. 2 illustrates the -measure, averaged over different (original) sequences for various values of . As evident from the figure the -measure exhibits an isolated peak close to the critical point. This does not come as a surprise: albeit normalized by a temperature dependent quantity, the enumerator effectively matches the random shuffle based multi-information Eq. 10. Here, a finite-size scaling analysis of the system-size dependent peak locations (obtained by fitting cubic splines to the data points in the interval ) yields , , and (reduced chi-square ), in agreement with the above results.

#### iii.1.2 Common data compression utilities

In the previous subsection we analyzed three different observables that appear to be very sensitive to structural changes in the symbolic sequences of finite length, as the critical point of the underlying model is approached. These were the standard random shuffle based excess entropy (also termed “complexity”), multi-information and the -measure for the LZ parsing based entropy rate. Subsequently, we will restrict our further analysis to the shuffling based multi-information since it is very simple to compute and seems to be able to detect and quantify structural changes in the recorded sequences that might be used to locate the phase transition point of the underlying model with ease. Furthermore it has a clear cut interpretation: it yields the entropy rate difference observed for a symbolic sequence including all symbol correlations and a surrogate sequence featuring the same symbol frequencies without correlations.

As pointed out above, we here consider three commonly used black-box data compression tools, namely zlib zlib_ref (), bz2 bzip2_ref (), and lzma lzma_ref (), in order to compute an algorithmic entropy that is based on the compressibility of the underlying sequence according to Eq. 11, see Ref. Ebeling1997 ().

##### Results obtained using zlib:

The results obtained by implementing the statement in Eq. 11 by using the data compression tool zlib_ref () is shown in Fig. 3(a). Therein, by fitting the peaks using cubic splines and extrapolating to the asymptotic limit via Eq. 13 we yield , , and (reduced chi-square ), slightly overestimating the literature value of the critical point but still within a distance of . For comparison, the results obtained by fitting polynomials of order 8 to the data curves are listed in Tab. 2. Regarding the characteristics of the data curves, note that albeit the peak location monotonously decreases towards a value in decent agreement with , the peak height seems to first increase to a value for . For larger value of , the peak height seems to decrease again.

##### Results obtained using bz2:

Implementing the statement in Eq. 11 by using the data compression tool bzip2_ref () is shown in Fig. 3(b). Therein, by fitting the peaks using cubic splines and extrapolating to the asymptotic limit we yield , , and (reduced chi-square ), in agreement with the literature value of the critical point . Again, for comparison, the results obtained by fitting polynomials of order 8 to the data curves are listed in Tab. 2. As evident from Fig. 3(b), and in contrast to the results obtained using the tools, the peak of data curves behaves similar to the multi-information considered in the context of the LZ parsing scheme. I.e., the peak consistently shifts towards and the peak height also increases with increasing (however, we made no attempt to quantify the latter).

##### Results obtained using lzma:

Lastly, implementing the statement in Eq. 11 by using the data compression tool lzma_ref () is shown in Fig. 3(c). Therein, by fitting the peaks using splines and extrapolating to the limit yields , , and (reduced chi-square ), in decent agreement with the literature value of the critical point . As can be seen from Tab. 2, the results obtained using a fitting procedure by means of polynomials of order 8 compare even better to . Note that here, the peak gets narrower with increasing sequence length with the peak location shifting towards a value consistent with . However, here the effect already observed for the tool, i.e. a decreasing peak height for increasing , is even more pronounced.

### iii.2 Comparison of algorithmic entropy estimates to block entropies

A further issue we addressed is the question how well the different entropy estimates (based on the LZ sting parsing scheme and the three data compression utilities discussed earlier), compare to more conventional block-entropy estimates, obtained using well established procedures, see Sect. II and Refs. Crutchfield2003 (); Feldman2003 ().

Here, in order to prepare reference values for the entropy rate Eq. 4 using the block-entropy estimator Eq. 1 we considered sequences of length and a maximally feasible block size , i.e. . As noted earlier, for iid sequences one might experience a severe undersampling of -blocks if, at a given alphabet size, is too large or is too short. In particular, a naive upper bound might be obtained from the constraint Lesne2009 (); Schuermann1996 (). Here, using and we checked that has converged properly for all considered temperatures. Albeit this provides only an upper bound on the true entropy rate, it might nevertheless yield a reasonable approximation to the actual entropy of the considered sequences.

#### iii.2.1 Lempel-Ziv string parsing scheme

At first we compared the LZ string parsing based entropy rate estimates to those obtained using the block entropy estimator. Thereby we performed a comparison to the entropy rates observed for sequences of finite length and to the asymptotic estimates , resulting from extrapolation via Eq. 12. The results are illustrated in Fig. 4. Therein, the main plot shows the entropy rate and as function of the system temperature for the different estimators. The inset indicates the difference

 Δ(T)=hLZ(T)−hBE(T) (15)

between the respective string parsing based entropy rate to the block entropy estimates. As evident from the figure, the absolute difference is typically smaller than with the largest deviation close to the critical point . While both estimates compare similarly well to the block-entropy estimate at low temperatures , the extrapolated result is slightly closer to for .

#### iii.2.2 Common data compression utilities

As reported in the previous paragraph and illustrated in Fig. 4, the LZ string parsing based entropy rate compares astonishingly well to those estimates obtained using the block-entropy estimator (even for sequences of finite length ). Now, by considering the algorithmic entropy rate estimator Eq. 11, implemented using the three data compression utilities discussed earlier, we find that the entropy rate estimates for binary sequences of length significantly overestimate the results obtained via the block-entropy estimator, see the data curves for , i.e.  (see discussion below), in Figs. 5(a-c). As mentioned earlier and pointed out in Ref. Ebeling1997 (), the algorithmic entropy rate Eq. 11 provides only an upper bound on the true entropy rate of the underlying process. In order to explore possible routes that might support a more reliable entropy estimate we next study auxiliary sequences of length , consisting of independent and identically distributed (iid) symbols, taken from a more general alphabet of size instead of binary sequences only.

In Fig. 5, we compare the entropy rates obtained via Eq. 11 using the three data compression tools zlib, bz2, and lzma, to the value expected for such iid sequences. As evident from the figure, the respective ratio approaches unity for increasing alphabet size and for not too small. E.g., for and for large , the respective estimates read (zlib), (bz2), (lzma). For the larger alphabet-size the respective estimates read (zlib), (bz2), (lzma). Albeit these results are valid only for iid sequences we might nevertheless expect to find a tighter upper bound on the entropy rate for symbolic sequences with possibly long ranged correlations by simply increasing the alphabet-size .

This can be achieved in the following manner: instead to monitor the orientation of a single spin during the simulation of the Ising FM, we monitor the orientation of a number of, say, spins, located within a neighbor-template of size as introduced in Ref. Robinson2011 () (to analyze the local entropy in a frustrated spin system) and used in Ref. Melchert2013 () to systematically parse a configuration of spins into sequences of length . These can then be interpreted as a binary representation of a particular symbol from an alphabet of size . Following this approach, we show in Figs. 4(a-c) the estimates for the entropy rates obtained by considering neighborhood templates of size , i.e. alphabet-sizes , to construct symbolic sequences of length . So as to be able to compare these values to those obtained by means of the block-entropy based estimates (monitoring the orientation of a single spin), we normalize the plain algorithmic entropy estimates resulting from Eq. 11 using . I.e. we compute

 hnormalg[S]=halg[S]log2(|A|)≡length(compress(S))length(compress(Siid)), (16)

where signifies a iid sequence with the same length and alphabet-size as . The insets illustrates the difference between and for an alphabet-size as function of the system temperature. As evident from the Figs. 4(a-c), the estimates for increasing alphabet size indeed approach the block-entropy based estimates. Thereby, the difference seems to be smallest at low temperatures and increases towards higher temperatures, fluctuating around a plateau value for . Overall, the lzma estimator seems to perform best, exhibiting for , while the zlib based estimator performs worst, exhibiting for . For comparison, close to the critical point, i.e. at , we find , (zlib), (bz2), (lzma).

## Iv Summary

In the presented article we considered information theoretical observables to analyze short symbolic sequences, comprising binary time-series that represent the orientation of a single spin in the Ising ferromagnet, for different system temperatures . The latter were chosen from the interval , enclosing the critical point of the model. Here, our focus was set on the estimation of the entropy rate via (i) a Lempel-Ziv based string-parsing scheme, and, (ii) common data compression utilities (in particular: zlib zlib_ref (), bz2 bzip2_ref (), and lzma lzma_ref ()). These approaches requiere a much smaller computational effort compared to the standard block entropy approach. Furthermore, they can be considered as simple yet useful versions of “algorithmic” entropy calculations which in principal seek the shortest of all programs generating a given sequence.

In a first analysis we demonstrated that certain standard random shuffle based variants of the excess entropy, multi information as well as an entropy-rate related -measure might be used to obtain reasonable estimates for the critical point of the underlying model. Albeit we obtained good results for all three observables when considering the LZ string parsing scheme, we restricted our analysis of the common data-compression tools to the multi information since it was easy to compute by means of black-box data compression utilities. As evident from Tab. 2, the estimated critical temperatures, obtained by an extrapolation of the multi-information peak-location via Eq. 13, compare well to the known critical temperature. As pointed out earlier, we here obtain the interesting result that, by performing a scaling analysis for the multi-information peak-locations for symbolic sequences of different length (all for the finite system size ), the results seem to extrapolate towards values (see Tab. 2) which are in striking agreement with the asymptotic critical point . However, the results are also in reasonable agreement with the effective critical point , indicated by the accentuated peak of the specific heat for the square lattice. Hence, within the precision reached by our current analysis we cannot completely rule out that the results extrapolate towards instead of the asymptotic critical point . Note that an unrelated study Ref. Vogel2012 (), where a data-compression tool for the recoginition of magnetic phases was designed (based on a different algorithmic procedure and using different observables to locate the critical point), found (for ) and (for ) and loosely concludes that the findings extrapolate to the known asymptotic critical point. Also note that conceptually similar analyses, considering block-entropy based observables carried out on two-dimensional configurations of spins obtained from a simulation of the Ising FM, reported in Ref. Feldman2008 (), conclude that the excess entropy is peaked at a temperature in the paramagnetic phase slightly above the true critical temperature. Similar results on the mutual information Crutchfield2003 () for the Ising FM (and more general classical spin models) where recently presented in Ref. Wilms2011 (). Therein, the authors conclude that the mutual information reaches a maximum in the high-temperature paramagnetic phase close to the system parameter (for this corresponds to ). Our new results and analyses, which go beyond the cited literature are presented in our main result part Sec. III.

In a second analysis we first prepared benchmark data curves for the asymptotic entropy rate of the symbolic sequences via a block-entropy based approach. Subsequently we compared the results of the various algorithmic entropy estimators to the latter. We found that the LZ string parsing scheme yields entropy rate estimates (either, for finite sequence length and extrapolated to the asymptotic limit) that compare surprisingly well to the benchmark data curves. Further, for the data-compression based estimators we discussed an approach that allows to increase the size of the alphabet from which symbols are drawn by monitoring a neighborhood-template Robinson2011 (); Melchert2013 (), instead of a single spin, only. This was motivated by the observation that data-compression based estimators strongly overestimate the entropy rates used as a benchmark. Consequently, symbolic sequences obtained by means of the amended approach encode temporal as well as spatial correlations between the orientation of the spins within the chosen neighborhood-template. We found that for an increasing alphabet size, the normalized entropy rates for the sequences approach get closer to the benchmark estimates, supporting the intuition previously gained by analyzing iid sequences. However, the observed difference between both might be due to the finite dictionary size employed during the data-compression procedures and hence the inability of the data-compression based estimators to take advantage of long ranged correlations in the symbolic sequence. In principle we found that the lzma (zlib) based entropy rate estimator performs best (worst).

###### Acknowledgements.
OM acknowledges financial support from the DFG (Deutsche Forschungsgemeinschaft) under grant HA3169/3-1. The simulations were performed at the HPC Cluster HERO, located at the University of Oldenburg (Germany) and funded by the DFG through its Major Instrumentation Programme (INST 184/108-1 FUGG) and the Ministry of Science and Culture (MWK) of the Lower Saxony State.

## References

• [1] N. Goldenfeld. Lectures On Phase Transitions And The Renormalization Group. Westview Press, Jackson, 1992.
• [2] P. Grassberger. How to measure self-generated complexity. Physica A, 140:319–325, 1986.
• [3] D. P. Feldman and J. P. Crutchfield. Measures of statistical complexity: Why? Phys. Lett. A, 238:244–252, 1998.
• [4] D. P. Feldman, C. S. McTague, and J. P. Crutchfield. The organization of intrinsic computation: Complexity-entropy diagrams and the diversity of natural information processing. CHAOS, 18:043106, 2008.
• [5] J. P. Crutchfield and D. P. Feldman. Regularities unseen, randomness observed: Levels of entropy convergence. CHAOS, 13:25–54, 2003.
• [6] J. P. Crutchfield and K. Wiesner. Simplicity and complexity. Physics World, pages 36–38, February 2010.
• [7] D. V. Arnold. Information-theoretic Analysis of Phase Transitions. Complex Systems, 10:143–155, 1996.
• [8] James P. Crutchfield and David P. Feldman. Statistical complexity of simple one-dimensional spin systems. Phys. Rev. E, 55:R1239–R1242, 1997.
• [9] D. P. Feldman and J. P. Crutchfield. Structural information in two-dimensional patterns: Entropy convergence and excess entropy. Phys. Rev. E, 67:051104, 2003. A summary of this article is available at papercore.org, see http://www.papercore.org/Feldman2003.
• [10] M. D. Robinson, D. P. Feldman, and S. R. McKay. Local entropy and structure in a two-dimensional frustrated system. CHAOS, 2011(21):037114, 2011.
• [11] J. Wilms, M. Troyer, and F. Verstraete. Mutual information in classical spin models. J. Stat. Mech., 2011(10):P10011, 2011.
• [12] O. Melchert and A. K. Hartmann. Information-theoretic approach to ground-state phase transitions for two- and three-dimensional frustrated spin systems. Phys. Rev. E, 87:022107, 2013.
• [13] A. N. Kolmogorov. On tables of random numbers. Sankhya, the Indian Journal of Statistics A, 25:369–376, 1963.
• [14] G. Chaitin. Algorithmic Information Theory. Cambridge University Press, New York, 1987.
• [15] J. Machta. Complexity, parallel computation and statistical physics. Complexity, 11:46–64, 2006.
• [16] W. Ebeling. Prediction and entropy of nonlinear dynamical systems and symbolic sequences with LRO. Physica D, 109:42–52, 1997.
• [17] A. Puglisi, D. Benedetto, E. Caglioti, V. Loreto, and A. Vulpiani. Data compression and learning in time sequence analysis. Physica D, page 92, 2003.
• [18] The zlib utilities comprise a general purpose data compression library. Here, we used zlib version 1.2.3, see http://zlib.net.
• [19] The bzip2 utilities comprise a general purpose data compression library. Here, we used bzip2 version 1.0.3, see http://bzip.org .
• [20] The LZMA utilities comprise a general purpose data compression library, wich, in practice, is reported to yield a high compression ratio. Here, we used the python module pylzma 0.4.4 as a wrapper to the lzma software development kit, see https://pypi.python.org/pypi/pylzma.
• [21] C. R. Shalizi and J. P. Crutchfield. Computational Mechanics: Pattern and Prediction, Structure and Simplicity. J. Stat. Phys., 104:817–878, 2001.
• [22] D. Loewenstern, H. Hirsh, P. Yianilos, and M. Noordewier. DNA Sequence Classification Using Compression-Based Induction. Technical Report DIMACS Tech. Rep. 95–04, DIMACS Center – Rutgers University, 1977.
• [23] A. Baronchelli, E. Caglioti, and V. Loreto. Measuring complexity with zippers. Eur. J. Phys., page S69, 2005.
• [24] P. Grassberger. Data Compression and Entropy Estimates by Non-sequential Recursive Pair Substitution. preprint arXiv:physics/0207023v1, 2002.
• [25] D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. Phys. Rev. Lett., 88:048702, 2002.
• [26] D. V. Khmelev and W. J. Teahan. Comment on “language trees and zipping”. Phys. Rev. Lett., 90:089803, 2003.
• [27] J. Goodman. Extended Comment on Language Trees and Zipping. eprint arXiv:cond-mat/0202383, 2002.
• [28] A. Lesne, J.-L. Blanc, and L. Pezard. Entropy estimation of very short symbolic sequences. Phys. Rev. E, 79:046208, 2009.
• [29] L. Liu, D. Li, and F. Bai. A relative lempelâziv complexity: Application to comparing biological sequences. Chem. Phys. Lett., 530:107–112, 2012.
• [30] M. E. J. Newman and G. T. Barkema. Monte Carlo Methods in Statistical Physics. Clarendon Press, Oxford, 1999.
• [31] T. Schürmann and P. Grassberger. Entropy estimation of symbol sequences. CHAOS, 6:414, 1996.
• [32] James P. Crutchfield, David P. Feldman, and Cosma Rohilla Shalizi. Comment I on “Simple measure for complexity”. Phys. Rev. E, 62:2996–2997, 2000.
• [33] T. M. Cover and J. A. Thomas. Elements of InformationTheory. Wiley, New York, 2006.
• [34] J. P. Crutchfield and K. Young. Inferring statistical complexity. Phys. Rev. Lett., 63:105–108, 1989.
• [35] I. Erb and N. Ay. Mulit-Information in the thermodynamic Limit. J. Stat. Phys., 115:949, 2004.
• [36] M. A Jimenez-Montano, W. Ebeling, T. Pohl, and P. E. Rapp. Entropy and complexity of finite sequences as fluctuating quantities. BioSystems, 64:23–32, 2002.
• [37] A. K. Hartmann. Practical Guide to Computer Simulations. World Scientific, Singapore, 2009.
• [38] E. E. Vogel, G. Saravia, and L. V. Cortez. Data compressor designed to improve recogition of magnetic phases. Physica A, 391:1591–1601, 2012.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters