Learning from Complex Systems: On the Roles of Entropy and Fisher Information in Pairwise Isotropic Gaussian Markov Random Fields
Markov Random Field models are powerful tools for the study of complex systems. However, little is known about how the interactions between the elements of such systems are encoded, especially from an information-theoretic perspective. In this paper, our goal is to enlight the connection between Fisher information, Shannon entropy, information geometry and the behavior of complex systems modeled by isotropic pairwise Gaussian Markov random fields. We propose analytical expressions to compute local and global versions of these measures using Besag’s pseudo-likelihood function, characterizing the system’s behavior through its Fisher curve, a parametric trajectory accross the information space that provides a geometric representation for the study of complex systems. Computational experiments show how the proposed tools can be useful in extrating relevant information from complex patterns. The obtained results quantify and support our main conclusion, which is: in terms of information, moving towards higher entropy states (A –> B) is different from moving towards lower entropy states (B –> A), since the Fisher curves are not the same given a natural orientation (the direction of time).
With the increasing value of information in modern society and the massive volume of digital data that is available, there is an urgent need of developing novel methodologies for data filtering and analysis in complex systems. In this scenario, the notion of what is informative or not is a top priority. Sometimes, patterns that at first may appear to be locally irrelevant may turn out to be extremely informative in a more global perspective. In complex systems, this is a direct consequence of the intricate non-linear relationship between the pieces of data along different locations and scales.
Within this context, information theoretic measures play a fundamental role in a huge variety of applications once they represent statistical knowledge in a sistematic, elegant and formal framework. Since the first works of Shannon shannon1949, and later with many other generalizations renyi1961; tsallis1988; bashkirov2006, the concept of entropy has been adapted and successfully applied to almost every field of science, among which we can cite physics jaynes1957, mathematics grad1961; adler1965; goodwyn1972, economics samuelson1972 and fundamentally, information theory costa1983; cover1991; cover1994. Similarly, the concept of Fisher information Frieden2004; Frieden2006 has been shown to reveal important properties of statistical procedures, from lower bounds on estimation methods Lehmann1983; Bickel1991; Casella2002 to information geometry Amari; Kass1989. Roughly speaking, Fisher information can be thought as the likelihood analog of entropy, which is a probability-based measure of uncertainty.
In general, classical statistical inference is focused on capturing information about location and dispersion of unknown parameters of a given family of distribution and studying how this information is related to uncertainty in estimation procedures. In typical situations, exponential family of distributions and independence hypothesis (independent random variables) are often assumed, giving the likelihood function a series of desirable mathematical properties Lehmann1983; Bickel1991; Casella2002.
Although mathematically convenient for many problems, in complex systems modeling, independence assumption is not reasonable because much of the information is somehow encoded in the relations between the random variables Anandkumar2009; Villegas. In order to overcome this limitation, Markov Random Field (MRF) models appear as a natural generalization of the classical approach by the replacement of the independence assumption by a more realistic conditional independence assumption. Basically, in every MRF, knowledge of a finite-support neighborhood aroung a given variable isolates it from all the remaining variables. A further simplification consists in considering a pairwise interaction model, constraining the size of the maximum clique to be two (in other words, the model captures only binary relationships). Moreover, if the MRF model is isotropic, which means that the parameter controlling the interactions between neighboring variables is invariant to change in the directions, all the information regarding the spatial dependence structure of the system is conveyed by a single parameter, from now on denoted by (or simply, the inverse temperature).
In this paper, we assume an isotropic pairwise Gaussian Markov Random Field (GMRF) model Moura1992; Moura1997, also known as auto-normal model or conditional auto-regressive model Besag1974; Besag1975. Basically, the question that motivated this work and we are trying to elucidate here is: What kind of information is encoded by the parameter in such a model? We want to know how this parameter, and as a consequence, the whole spatial dependence structure of a complex system modelled by a Gaussian Markov random field, is related to both local and global information theoretic measures, more precisely the observed and expected Fisher information as well as self-information and Shannon entropy.
In searching for answers for our fundamental question, investigations led us to an exact expression for the asymptotic variance of the maximum pseudo-likelihood (MPL) estimator of in an isotropic pairwise GMRF model, suggesting that asymptotic efficiency is not granted. In the context of statistical data analysis, Fisher information plays a central role in providing tools and insights for modeling the interactions between complex systems and their components. The advantage of MRF models over the traditional statistical ones is that MRF’s take into account the dependence between pieces of information as a function of the system’s temperature, which may even be variable along the time. Briefly speaking, this investigation aims to explore ways to measure and quantify distances between complex systems operating in different thermodynamical conditions. By analyzing and comparing the behavior of local patterns observed throughout the system (defined over a regular 2D lattice), it is possible to measure how informative are those patterns for a given inverse temperature, or simply (which encodes the expected global behavior).
The remaining of the paper is organized as follows: Section 2 discusses a technique for estimation called maximum pseudo-likelihood (MPL) and provides derivations for the observed Fisher information in an isotropic pairwise GMRF model. Intuitive interpretations for the two versions of this measure are discussed. In Section 3 we derive analytical expressions for the computation of the expected Fisher information. In Section 4 an expression for the global entropy in a GMRF model is shown. The results suggest a connection between maximum pseudo-likelihood and minimum entropy criteria in GMRF’s. Section 5 discusses the asymptotic variance of ’s maximum pseudo-likelihood estimator. In Section 6 the definition of Fisher curve of a system as a parametric trajectory in the information space is proposed. Section 7 shows the experimental setup. Computational simulations with both Markov Chain Monte Carlo algorithms and real data were conducted, showing the effectiveness of the proposed tools in extracting relevant information from complex systems. Finally, Section 8 presents our conclusions, final remarks and possibilities for future works.
Ii Fisher Information in Isotropic Pairwise GMRF’s
The remarkable Hammersley-Clifford theorem hammersley1971 states the equivalence between Gibbs Random Fields (GRF) and Markov Random Fields (MRF), which implies that any MRF can be defined either in terms of a global (joint Gibbs distribution) or a local (set of local conditional density functions) model. For our purposes, we will choose the later representation.
An isotropic pairwise Gaussian Markov random field regarding a local neighborhood system defined on a lattice is completely characterized by a set of local conditional density functions , given by:
with , where and are the expected value and the variance of the random variables, and is the parameter that controls the interaction between the variables (inverse temperature). Note that, for , the model degenerates to the usual Gaussian distribution. From an information geometry perspective Amari; Kass1989, it means that we are constrained to a sub-manifold within the Riemmanian manifold of probability distributions, where the natural Riemmanian metric (tensor) is given by the Fisher information. It has been shown that the geometric structure of exponential family distributions exhibit constant curvature. However, little is known about information geometry on more general statistical models, such as GMRF’s. For , some degree of correlation between the observations is expected, making the interactions grow stronger. Typical choices for are the first and second order non-causal neighborhood systems, defined by the sets of 4 and 8 nearest neighbors, respectively.
ii.1 Maximum Pseudo-Likelihood Estimation
Maximum likelihood estimation is intractable in MRF parameter estimation due to the existence of the partition function in the joint Gibbs distribution. An alternative, proposed by Besag Besag1974, is maximum pseudo-likelihood estimation, which is based on the conditional independence principle. The pseudo-likelihood function is defined as the product of the LCDF’s for all the variables of the system, modeled as a random field.
Let an isotropic pairwise GMRF be defined on a lattice with a neighborhood system . Assuming that denotes the set corresponding to the observations at time , the pseudo-likelihood function of the model is defined by:
Note that the pseudo-likelihood function is a function of the parameters. For better mathematical tractability, it is usual to take the logarithm of . Plugging equation (1) into equation (2) and taking the logarithm, leads to:
By differentiating equation (3) with respect to each parameter and properly solving the pseudo-likelihood equations we obtain the following maximum pseudo-likelihood estimators for the parameters , and :
where denotes the cardinality of the non-causal neighborhood set . Note that if , the MPL estimators of both and become the widely known sample mean and sample variance.
Since the cardinality of the neighborhood system, , is spatially invariant (we are assuming a regular neighborhood system) and each variable is dependent on a fixed number of neighbors on a lattice, can be rewritten in terms of cross covariances:
where denotes the sample covariance between the central variable and . Similarly, denotes the sample covariance between two variables belonging to the neighbohood system (the definition of the neighborhood system does not include the the location ).
ii.2 Fisher information of spatial dependence parameters
Basically, Fisher information measures the amount of information a sample conveys about an unknown parameter. It can be thought as the likelihood analog of entropy, which is a probability-based measure of uncertainty. Often, when we are dealing with independent and identically distributed (i.i.d) random variables, the computation of the global Fisher Infomation presented in a random sample is quite straighforward, since each observation , , brings exactly the same amout of information (when we are dealing with independent samples, the superscript is usually supressed since the underlying dependence struture does not change through time). However, this is not true for spatial dependence parameters in MRF’s, since different configuration patterns () provide distinct contributions to the local observed Fisher information, which can be used to derive a reasonable approximation to the global Fisher information Efron1978.
ii.3 The Information Equality
It is widely known from statistical inference theory that information equality holds in case of independent observations in the exponential family Lehmann1983; Bickel1991; Casella2002. In other words, we can compute the Fisher information of a random sample regarding a parameter of interest by:
where denotes the likelihood function at a time instant . In our investigations, to avoid the joint Gibbs distribution, often intractable due to the presence of the partition function (global Gibbs field), we replace the usual likelihood function by Besag’s pseudo-likelihood function and then we work with the local model instead (local Markov field).
However, given the intrinsic spatial dependence struture of Gaussian Markov random field models, information equilibrium is not a natural condition. As we will discuss later, in general, information equality fails. Thus, in a GMRF model we have to consider two kinds of Fisher information, from now on denoted by type-I (due to the first derivative of the pseudo-likelihood function) and type-II (due to the second derivative of the pseudo-likelihood function). Eventually, when certain conditions are satisfied, these two values of information will converge to a unique bound. Essentially, is the parameter responsible to control whether both forms of information converge or diverge. Knowing the role of (inverse temperature) in a GMRF model, it is expected that for (or ) information equilibrium prevails. In fact, we will see in the following sections that as deviates from zero (and long-term correlations start to emerge), the divergence between the two kinds of information increases.
ii.4 Observed Fisher information
In order to quantify the amount of information conveyed by a local configuration pattern in a complex system, the concept of observed Fisher information must be defined.
Consider a MRF defined on a lattice with a neighborhood system . The type-I local observed Fisher information for the observation regarding the spatial dependence parameter is defined in terms of its local conditional density function as:
Hence, for an isotropic pairwise GMRF model, the type-I local observed Fisher information regarding for the observation is given by:
Consider a MRF defined on a lattice with a neighborhood system . The type-II local observed Fisher information for the observation regarding the spatial dependence parameter is defined in terms of its local conditional density function as:
In case of an isotropic pairwise GMRF model, the type-II local observed Fisher information regarding for the observation is given by:
Note that does not depend on , only on the neighborhood system .
Consider a MRF defined on a lattice with a neighborhood system . The type-I observed Fisher information regarding the spatial dependence parameter for a given global configuration is defined as:
An unbiased estimator for the quantity can be obtained by invoking the law of large numbers and approximating equation (13) by a sample average of the type-I local observed Fisher information along the field:
Consider a MRF defined on a lattice with a neighborhood system . The type-II observed Fisher information regarding the spatial dependence parameter for a given global configuration is defined as:
Similarly to the previous situation, a reasonable approximation for is obtained by taking the sample average of of the type-II local observed Fisher information along the field:
Therefore, we have two local measures, and that can be assigned to every element of a system modeled by an isotropic pairwise GMRF. Besides, two other global mesures, and , provide the same information but in a larger scale. In the following, we will discuss some interpretations for what is really being measured with the proposed tools.
ii.5 The Role of Fisher information in GMRF models
At this point, a relevant issue is the interpretation of these Fisher information measures in a complex system modeled by an isotropic pairwise GMRF. Roughly speaking, is the quadratic rate of change of the logarithm of the local likelihood function at , given a global value of . As this global value of determines what would be the expected global behavior (if is large, it is expected a high degree of correlation among the observations and if is close to zero the observations are independent), it is reasonable to admit that configuration patterns showing values of close to zero are more likely to be observed throughout the field, once their likelihood values are high (close to the maximum local likelihood condition). In other words, these patterns are more “aligned” to what is considered to be the expected global behavior and therefore the convey little information about the spatial dependence struture (these samples are not informative once they are expected to exist in a system operating at that particular value of inverse temperature).
Now, let us move on to configuration patterns showing high values of . Those samples can be considered landmarks, because they convey a large amount of information about the global spatial dependence structure. Roughly speaking, those points are very informative once they are not expected to exist for that particular value of (which guides the expected global behavior of the system). Therefore, type-I local observed Fisher information minimization in GMRF’s can be a useful tool in producing novel configuration patterns that are more likely to exist given that chosen value of inverse temperature. Basically, tell us how informative a given pattern is for that specific global behavior (represented by a single parameter in an isotropic pairwise GMRF model). In summary, this measure quantifies the degree of agreement between an observation and the configuration defined by its neighborhood system for a given .
As we will see later in the experiments section, typical informative patterns (those showing high values of ) in an organized system are located at the boundaries of the regions defining homogeneous areas (since these boundary samples show an unexpected behavior for large , which is: there is no strong agreement between and its neighbors).
Let us analyze the type-II local observed Fisher information . Informally speaking, this measure can be interpreted as a curvature measure, that is, how curved is the local likelihood function at . Thus, patterns showing low values of tend to have a nearly flat local likelihood function. It means that we are dealing with a pattern that could have been observed for a variety of values (a large set of values have approximatelly the same likelihood). An implication of this fact is that in a system dominated by this kind of patterns (patterns for which is close to zero), small perturbations may cause a sharp change in (and therefore in the expected global behavior). In other words, these patterns are more susceptible to changes once they do not have a “stable” configuration (it raises our uncertainty about the true value of ).
On the other hand, if the global configuration is mostly composed by patterns exhibiting large values of , changes on the global structure are unlikely to happen (uncertainty on is sufficiently small). Basically, measures the degree of agreement or dependence among the observations belonging to the same neighborhood system. If at a given , the observations belonging to are totally symmetric around the mean value, would be zero. It is reasonable to expect that in this situation as there is no information about the induced spatial dependence struture (it means that there is no contextual information available at this point). Notice that the role of is not the same of . Actually, these two measures are almost inversely related, since if at the value of is high (it is a landmark or boundary pattern), then it is expected that be low (in decision boundaries or edges the uncertainty about is higher, causing to be small). In fact, we will observe this behavior in some computational experiments conducted in future sections of the paper.
It is important to mention that these rather informal arguments define the basis for understanding the meaning of the asymptotic variance of maximum pseudo-likelihood estimators, as we will discuss ahead. In summary, is a measure of how sure or confidente we are about the local spatial dependence structure (at a given point ), since a high average curvature is desired for predicting the system’s global behavior in a reasonable manner (reducing the uncertainty of estimation).
Iii Expected Fisher Information
In order to avoid the use of approximations in the computation of the global Fisher information in an isotropic pairwise GMRF, in this section we provide an exact expression for and as type-I and type-II expected Fisher information. One advantage of using the expected Fisher information instead of its global observed counterpart is the faster computing time. As we will see, instead of computing a single local measure for each observation and then take the average, both and expressions depend only on the covariance matrix of the configuration patterns observed along the random field.
iii.1 The Type-I Expected Fisher Information
Recall that the type-I expected Fisher information, from now on denoted by , is given by:
The type-II expected Fisher information, from now on denoted by , is given by:
Hence, the expression for is composed by four main terms, each one of them involving a summation of higher-order cross moments. According to the Isserlis’ theorem isserlis1918, for normally distributed random variables, we can compute higher order moments in terms of the covariance matrix through the following identity:
Then, the first term of (21) is reduced to:
where denotes the covariance between variables and . (note that in a MRF we have if ). We now proceed to the expansion of the second main term of (21). Similarly, by applying the Isserslis’ identity we have:
The thrid term of (21) can be rewritten as:
Finally, the fourth term of is:
Therefore, by combining expressions (23), (24), (25) and (26) we have the complete expression for , the type-I expected Fisher information for an isotropic pairwise GMRF model regarding the inverse temperature parameter, as:
However, since we are interested in studying how the spatial correlations change as the system evolves, we need to estimate a value for given a single global state . Hence, to compute from a single static configuration (a photograph of the system at a given moment), we consider in the previous equation, which means, among other things, that (which implies ) and observations belonging to different neighborhoods are independent from each other (since we are dealing with a pairwise interaction Markovian process).
Before proceeding, we would like to clarify some points regarding the estimation of the parameter and the computation of the expected Fisher information in the isotropic pairwise GMRF model. Basically, there are two main possibilities: 1) the parameter is spatially-invariant, which means that we have a unique value for a global configuration of the system (this is our assumption); or 2) the parameter is spatially-variant, which means that we have a set of values, for , each one of them estimated from (we are observing the outcomes of a random pattern along time in a fixed position of the lattice). When we are dealing with the first model ( is spattialy-invariant), all possible observation patters (samples) are extracted from the global configuration by a sliding window (with the shape of the neighborhood system) that moves through the lattice at a fixed time instant . In this case, we are interested in studying the spatial correlations, not the temporal ones. In other words, we would like to investigate how the the spatial structure of a GMRF model is related to Fisher information (this is exactly the scenario described above, for which ). Our motivation here is to characterize, via information-theoretic measures, the behavior of the system as it evolves from states of minimum entropy to states of maximum entropy (and vice versa) by providing a geometrical tool based on the definition of the Fisher curve, which will be introduced in the following sections.
Therefore, in our case (), equation (27) is simplified to (unifiyng to express the covariances between the random variables in the neighborhood system):
iii.2 The Type-II Expected Fisher Information
Following the same methodology of replacing the likelihood function by the pseudo-likelihood function of the GMRF model, a closed form expression for is developed. Pluging equation (3) into (20) leads us to:
Note that unlike , does not depend explicity on (inverse temperature). As we have seen before, is a quadratic function of the spatial dependence parameter.
In order to simplify the notations and also to make computations easier, the expressions for and can be rewritten in a matrix-vector form. Let be the covariance matrix of the random vectors , obtained by lexicographic ordering the local configuration patterns . Thus, considering a neighborhood system of size , we have given by a symmetric matrix (for odd, i.e., ):
Let be the submatrix of dimensions obtained by removing the central row and central column of (the covariances between and each one of its neighbors ). Then for odd, we have:
Thus, is a matrix that stores only the covariances among the neighboring variables. Also, let be the vector of dimensions formed by all the elements of the central row of , excluding the middle one (which is a variance actually), that is:
Therefore, we can rewrite equation (28) (for ) using Kronecker products. The following definition provides a fast way to compute exploring these tensor products.
Let an isotropic pairwise GMRF be defined on a lattice with a neighborhood system of size (usual choices for are even values: 4, 8, 12, 20 or 24). Assuming that denotes the global configuration of the system at time , and and are defined as equations (31) and (30), the type-I expected Fisher information for this state is:
where denotes the summation of all the entries of the matrix (not to be confused with a matrix norm) and denotes the Kronecker (tensor) product. From an information geometry perspective, the presence of tensor products indicates the intrinsic differential geometry of a manifold in the form of the Riemman curvature tensor Amari. Note that all the necessary information for computing the Fisher information is somehow encoded in the covariance matrix of the local configuration patterns, , as it would be expected in case of Gaussian variables (second-order statistics). The same procedure is applied to the type-II expected Fisher information.
Let an isotropic pairwise GMRF be defined on a lattice with a neighborhood system of size (usual choices for are 4, 8, 12, 20 or 24). Assuming that denotes the global configuration of the system at time and is defined as equation (30), the type-II expected Fisher information for this state is given by:
iii.3 Information Equilibrium in GMRF models
From the definition of both and , a natural question that raises would be: under what conditions do we have in an isotropic pairwise GMRF model? As we can see from equations (32) and (33), the difference between and , from now on denote by is simply:
Then, intuitively, the condition for information equality is achieved when . As is a simple quadratic function of the inverse temperature parameter , we can easily find that the value for which is:
provided that and . Note that if , then one solution for the above equation is . In other words, when (no correlation between and its neighbors ), information equilibrium is achieved for , which in this case is the maximum pseudo-likelihood estimative of , since in this matrix-vector notation is given by:
In the isotropic pairwise GMRF model, if them we have and as a consequence . However, the opposite is not necessarily true, that is, we may observe that for a non-zero . One example is for , a solution of .
Iv Entropy in Isotropic Pairwise GMRF’s
Our definition of entropy is done by repeating the same process employed to derive and . Knowing that the entropy of random variable is defined by the expected value of self-information, given by , it can be thought as a probability-based counterpart to the Fisher information.
Let an isotropic pairwise GMRF be defined on a lattice with a neighborhood system . Assuming that denotes the global configuration of the system at time , then the entropy for this state is given by: