Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems

Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems

Adam B. Barrett111adam.barrett@sussex.ac.uk

Sackler Centre for Consciousness Science and Department of Informatics
University of Sussex, Brighton BN1 9QJ, UK

Department of Clinical Sciences, University of Milan, Milan 20157, Italy
Abstract

To fully characterize the information that two ‘source’ variables carry about a third ‘target’ variable, one must decompose the total information into redundant, unique and synergistic components, i.e. obtain a partial information decomposition (PID). However Shannon’s theory of information does not provide formulae to fully determine these quantities. Several recent studies have begun addressing this. Some possible definitions for PID quantities have been proposed, and some analyses have been carried out on systems composed of discrete variables. Here we present the first in-depth analysis of PIDs on Gaussian systems, both static and dynamical. We show that, for a broad class of Gaussian systems, previously proposed PID formulae imply that: (i) redundancy reduces to the minimum information provided by either source variable, and hence is independent of correlation between sources; (ii) synergy is the extra information contributed by the weaker source when the stronger source is known, and can either increase or decrease with correlation between sources. We find that Gaussian systems frequently exhibit net synergy, i.e. the information carried jointly by both sources is greater than the sum of informations carried by each source individually. Drawing from several explicit examples, we discuss the implications of these findings for measures of information transfer and information-based measures of complexity, both generally and within a neuroscience setting. Importantly, by providing independent formulae for synergy and redundancy applicable to continuous time-series data, we open up a new approach to characterizing and quantifying information sharing amongst complex system variables.

1 Introduction

Shannon’s information theory [1] has provided extremely successful methodology for understanding and quantifying information transfer in systems conceptualized as receiver/transmitter, or stimulus/response [2, 3]. Formulating information as reduction in uncertainty, the theory quantifies the information that one variable holds about another variable as the average reduction in the surprise of the outcome of when knowing the outcome of compared to when not knowing the outcome of . (Surprise is defined by how unlikely an outcome is, and is given by the negative of the logarithm of the probability of the outcome. This quantity is usually referred to as the mutual information since it is symmetric in and .) Recently, information theory has become a popular tool for the analysis of so-called complex systems of many variables, for example, for attempting to understand emergence, self-organisation and phase transitions, and to measure complexity [4]. Information theory does not however, in its current form, provide a complete description of the informational relationships between variables in a system composed of three or more variables. The information that two ‘source’ variables and hold about a third ‘target’ variable should decompose into four parts:222It is our convenient convention of terminology to refer to variables as ‘sources’ and ‘targets’, with and always being the ‘sources’ that contribute information about the ‘target’ variable . These terms relate to the status of the variables in the informational quantities that we compute, and should not be considered as describing the dynamical roles played by the variables. (i) , the unique information that only (out of and ) holds about ; (ii) , the unique information that only holds about ; (iii) , the redundant information that both and hold about ; and (iv) , the synergistic information about that only arises from knowing both and (see Figure 1). The set of quantities is called a ‘partial information decomposition’ (PID). Information theory gives us the following set of equations for them:

(1)
(2)
(3)

However, these equations do not uniquely determine the PID. One can not obtain synergy or redundancy in isolation, but only the ‘net synergy’ or ‘Whole-Minus-Sum’ (WMS) synergy:

(4)

An additional ingredient to the theory is required, specifically, a definition that determines one of the four quantities in the PID. A consistent and well-understood approach to PIDs would extend Shannon information theory into a more complete framework for the analysis of information storage and transfer in complex systems.

Figure 1: The general structure of the information that two ‘source’ variables and hold about a third ‘target’ variable . The ellipses indicate , and as labelled, and the four distinct regions enclosed represent the redundancy , the synergy and the unique informations and as labelled.

In addition to the four equations above, the minimal further axioms that a PID of information from two sources should satisfy are: (i) that the four quantities , , and should always all be greater than or equal to zero; (ii) that redundancy and synergy are symmetric with respect to and [5][10]. Interestingly, several distinct PID definitions have been proposed, each arising from a distinct idea about what exactly should constitute redundancy and/or synergy. These previous studies of PIDs have focused on systems composed of discrete variables. Here, by considering PIDs on Gaussian systems, we provide the first study of PIDs that focuses on continuous random variables.

One might naively expect that for sources and target being jointly Gaussian, the linear relationship between the variables would imply zero synergy, and hence a trivial PID with the standard information theory equations (1)–(3) determining the redundant and unique information. However, this is not the case; net synergy (4), and hence synergy, can be positive [11, 12]. We begin this study (Section 3) by illustrating the prevalence of jointly Gaussian cases for which net synergy (4) is positive. Of particular note is the fact that there can be positive net synergy when sources are uncorrelated. After this motivation for the study, in Section 4 we introduce three distinct previously proposed PID procedures: (i) that of Williams and Beer [5]; (ii) that of Griffith et al. [6, 9] and Bertschinger et al. [8, 10]; and (iii) that of Harder et al. [7]. In addition to satisfying the minimal axioms above, these PIDs have the further commonality that redundant and unique information depend only on the pair of marginal distributions of each individual source with the target, i.e. those of and , while only the synergy depends on the full joint distribution of all three variables . Bertschinger et al. [10] have argued for this property by considering unique information from a game-theoretic view point. Our key result, that we then demonstrate, is that for a jointly Gaussian system with a univariate target and sources of arbitrary dimension, any PID with this property reduces to simply taking redundancy as the minimum of the mutual informations and , and letting the other quantities follow from (1)–(3). This common PID, which we call the MMI (minimum mutual information) PID (i) always assigns the source providing less information about the target as providing zero unique information; (ii) yields redundancy as being independent of the correlation between sources; and (iii) yields synergy as the extra information contributed by the weaker source when the stronger source is known. In Section 5 we proceed to explore partial information in several example dynamical Gaussian systems, examining (i) the behaviour of net synergy, which is independent of any assumptions on the particular choice of PID, and (ii) redundancy and synergy according to the MMI PID. We then discuss implications for the transfer entropy measure of information flow (Section 6), and measures that quantify the complexity of a system via information flow analysis (Section 7). We conclude with a discussion of the shortcomings and possible extensions to existing approaches to PIDs and the measurement of information in complex systems.

This paper provides new tools for exploring information sharing in complex systems, that go beyond what standard Shannon information theory can provide. By providing a PID for triplets of Gaussian variables, it will enable one to study synergy amongst continuous time-series variables, for the first time independently of redundancy. In the Discussion we consider possible application to the study of information sharing amongst brain variables in neuroscience. More generally, there exists possibility of application to complex systems in any realm, e.g. climate science, financial systems, computer networks, amongst others.

2 Notation and preliminaries

Let be a continuous random variable of dimension . We denote the probability density function by , the mean by , and the matrix of covariances by . Let be a second random variable of dimension . We denote the matrix of cross-covariances by . We define the ‘partial covariance’ of with respect to as

(5)

If is multivariate Gaussian (we use the symbol ‘’ to denote vertical concatenation of vectors), then the partial covariance is precisely the covariance matrix of the conditional variable , for any :

(6)

where .

Entropy characterizes uncertainty, and is defined as

(7)

(Note, strictly, Eq. (7) is the differential entropy, since entropy itself is infinite for continuous variables. However, considering continuous variables as continuous limits of discrete variable approximations, entropy differences and hence information remain well-defined in the continuous limit and may be consistently measured using Eq. (7) [2]. Moreover, this equation assumes that has a density with respect to the Lebesgue measure ; this assumption is upheld whenever we discuss continuous random variables.) The conditional entropy is the expected entropy of given , i.e.,

(8)

The mutual information between and is the average information, or reduction in uncertainty (entropy), about , knowing the outcome of :

(9)

Mutual information can also be written in the useful form

(10)

from which it follows that mutual information is symmetric in and [2]. The joint mutual information that two sources and share with a target satisfies a chain rule:

(11)

where the conditional mutual information is the expected mutual information between and given . For Gaussian,

(12)

and for Gaussian

(13)
(14)

For a dynamical variable evolving in discrete time, we denote the state at time by , and the infinite past with respect to time by . The past states with respect to time are denoted by .

3 Synergy is prevalent in Gaussian systems

Figure 2: The correlational structure of two example systems of univariate Gaussian variables for which and exhibit positive net synergy with respect to information about . Variables are shown as circles, and the variables that are correlated are joined by lines. (a) and are uncorrelated and yet show synergy. (b) and are uncorrelated and yet contributes synergistic information about in conjunction with . See main text for details.

In this section we demonstrate the prevalence of synergy in jointly Gaussian systems, and hence that the PIDs for such systems are typically non-trivial. We do this by computing the ‘Whole-Minus-Sum’ (WMS) net synergy, i.e. synergy minus redundancy (4). Since the axioms for a PID impose that and are greater than or equal to zero, this quantity provides a lower bound on synergy, and in particular a sufficient condition for non-zero synergy is . Some special cases have previously been considered in [11, 12]. Here we consider, for the first time, the most general three-dimensional jointly Gaussian system (here we use normal rather than bold type face for the random variables since they are one-dimensional). Setting means and variances of the individual variables to 0 and 1 respectively preserves all mutual informations between the variables, and so without loss this system can be specified with a covariance matrix of the form

(15)

where , and satisfy , and

(16)

(a covariance matrix must be non-singular and positive definite).

Using (5) and (14), the mutual informations between and and are given by

(17)
(18)
(19)

and thus the general formula for the net synergy is

(20)

This quantity is often greater than zero. Two specific examples illustrate the prevalence of net synergy in an interesting way. Consider first the case and , i.e. the sources each have the same correlation with the target, but the two sources are uncorrelated [see Figure 2(a)]. Then there is net synergy since

(21)

It is remarkable that there can be net synergy when the two sources are not correlated. However, this can be explained by the concave property of the logarithm function. If one instead quantified information as reduction in covariance, the net synergy would be zero in this case. That is, if we were to define etc., and , then we would have

(22)

which gives the output of zero whenever the correlation between sources is zero. This is intuitive: the sum of the reductions in covariance of the target given each source individually equals the reduction in covariance of the target given both sources together, for the case of no correlation between sources. There is net synergy in the Shannon information provided by the sources about the target because this quantity is obtained by combining these reductions in covariance non-linearly via the concave logarithm function. This suggests that perhaps would actually be a better measure of information for Gaussian variables than Shannon information (although unlike standard mutual information is not symmetric). Note that Angelini et al [12] proposed a version of Granger causality (which is a measure of information flow for variables that are at least approximately Gaussian [13]) based on straightforward difference of variances without the usual logarithm precisely so that for a linear system the Granger causality from a group of variables equals the sum of Granger causalities from members of the group (see Section6 for a recap of the concept of Granger causality).

Figure 3: Illustrative examples of net synergy WMS and synergy between Gaussian variables. (a) Net synergy in Shannon information that sources and share about the target , as a function of the correlation between and for (black) correlations between and and and equal and both positive (); (grey) correlations between and and and equal and opposite (). (b) The same as (a) but using information defined as reduction in variance instead of reduction in Shannon entropy. (c) Synergy according to the MMI PID for the same parameters as (a). Here the dashed line shows redundancy according to the MMI PID, which does not depend on the correlation between and . (d) Example of net synergy as a function of the correlation between and for (black) correlations between and and and unequal and both positive (, ); (grey) correlations between and and and unequal and of opposite sign (, ). (e) The same as (d) but using information defined as reduction in variance instead of reduction in Shannon entropy. (f) Synergy according to the MMI PID for the same parameters as (d). Here the dashed line shows redundancy according to the MMI PID, which does not depend on the correlation between and . See text for full details of the parameters. In all panels dotted vertical lines indicate boundaries of the allowed parameter space, at which the measures go to infinity, and horizontal dotted lines indicate zero.

Second, we consider the case , i.e. in which there is no correlation between the target and the second source [see Figure 2(b)]. In this case we have

(23)

Hence, the two sources and exhibit synergistic information about the target even though and are uncorrelated, and this is modulated by the correlation between the sources and . Although this is perhaps from a naive point of view counter-intuitive, it can be explained by thinking of as providing information about why has taken the value it has, and from this one can narrow down the range of values for , beyond what was already known about just from knowing . Note that in this case there would be net synergy even if one quantified information as reduction in covariance via defined above.

Fig. 3(a,b,d,e) shows more generally how net synergy depends on the correlation between source variables and . For correlations and between the two sources and the target being equal and positive, net synergy is a decreasing function of the correlation between the sources, while for correlations and being equal but opposite net synergy is an increasing function of the correlation between sources [Fig. 3(a)]. Net synergy asymptotes to infinity as the correlation values approach limits at which the covariance matrix becomes singular. This makes sense because in those limits becomes completely determined by and . More generally, when and are unequal, net synergy is a U-shaped function of correlation between sources [Fig. 3(d)]. In Fig. 3(b,e) the alternative measure, , of net synergy based on information as reduction in variance is plotted. As described above, this measure behaves more elegantly, always taking the value 0 when the correlation between sources is zero. Taken together these plots show that net redundancy (negative net synergy) does not necessarily indicate a high degree of correlation between source variables.

This exploration of net synergy demonstrates that it would be useful to obtain explicit measures of synergy and redundancy for Gaussian variables. As mentioned in the Introduction, several measures have been proposed for discrete variables [5][10]. In the next section we will see that, for a broad class of jointly Gaussian systems, these all reduce essentially to redundancy being the minimum of and .

4 Partial information decomposition on Gaussian systems

In this section we first revise the definitions of three previously proposed PIDs. We note that all of them have the property that redundant and unique information depend only on the pair of marginal distributions of each individual source with the target, i.e. those of and , while only the synergy depends on the full joint distribution of all three variables . Bertschinger et al. [10] have argued for this property by considering unique information from a game-theoretic view point. We then prove our key result, namely that any PID satisfying this property reduces, for a jointly Gaussian system with a univariate target and sources of arbitrary dimension, to simply taking redundancy as the minimum of the mutual informations and , and letting the other quantities follow from (1)–(3). We term this PID the MMI (minimum mutual information) PID, and give full formulae for it for the general fully univariate case considered in Section 3. In Section 5 we go on to apply the MMI PID to dynamical Gaussian systems.

4.1 Definitions of previously proposed PIDs

Williams and Beer’s proposed PID uses a definition of redundancy as the minimum information that either source provides about each outcome of the target, averaged over all possible outcomes [5]. This is obtained via a quantity called the specific information. The specific information of outcome given the random variable is the average reduction in surprise of outcome given :

(24)

The mutual information is recovered from the specific information by integrating it over all values of . Redundancy is then the expected value over all of the minimum specific information that and provide about the outcome :

(25)

Griffith et al. [6, 9] consider synergy to arise from information that is not necessarily present given the marginal distributions of source one and target and source two and target . Thus

(26)

where

(27)

and , and are subject to the constraints and . The quantity is referred to as the ‘union information’ since it constitutes the whole information minus the synergy. Expressed alternatively, is the minimum joint information provided about by an alternative and with the same relations with but different relations to each other. Bertschinger et al. [10] independently introduced identically the same PID, but starting from the equation

(28)

They then derive (27) via the conditional mutual information chain rule (11) and the basic PID formulae (1) and (3).

Harder, Salge and Polani’s PID [7] define redundancy via the divergence of the conditional probability distribution for given an outcome for from linear combinations of conditional probability distributions for given an outcome for . Thus, the following quantity is defined:

(29)

where is the Kullback-Leibler divergence, defined for continuous probability density functions and by

(30)

Then the projected information is defined as:

(31)

and the redundancy is given by

(32)

Thus, broadly, the closer the conditional distribution of given is to the conditional distribution of given , the greater the redundancy.

4.2 The common PID for Gaussians

While the general definitions of the previously proposed PIDs are quite distinct, one can note that for all of them the redundant and unique informations depend only on the pair of marginal distributions of each individual source with the target, i.e. those of and . Here we derive our key result, namely the following. Let , and be jointly multivariate Gaussian, with univariate and and of arbitrary dimensions and . Then there is a unique PID of such that the redundant and unique informations , and depend only on the marginal distributions of and . The redundancy according to this PID is given by

(33)

The other quantities follow from (1)–(3), assigning zero unique information to the source providing least information about the target, and synergy as the extra information contributed by the weaker source when the stronger source is known. We term this common PID the MMI (minimum mutual information) PID. It follows that all of the previously proposed PIDs reduce down to the MMI PID for this Gaussian case.

Proof: We first show that the PID of Griffith et al. [6, 9] (equivalent to that of Bertschinger et al. [10]) reduces to the MMI PID. Without loss, we can rotate and normalise components of , and such that the general case is specified by the block covariance matrix

(34)

where and are respectively the - and -dimensional identity matrices. We can also without loss just consider the case . From (5) we have

(35)
(36)

and hence . Note then that for a subject to and

(37)

Now the covariance matrix of a is given by

(38)

where is a matrix. The residual (partial) covariance of given and can thus be calculated using (5) as

(39)
(40)

It follows from (40) and (36) that if we could find a that satisfied , and for which the corresponding were a valid covariance matrix, then would reduce to and hence we would have , and thus we would have

(41)

by (37) and the definition (27) of .

We now demonstrate that there does indeed exist a satisfying and for which the corresponding is positive definite and hence a valid covariance matrix. First note that since there exists a satisfying for which for all . Suppose we have such a . Then the matrix

(42)

is positive definite: For any ,

(43)
(44)

Since it is also symmetric, it therefore has a Cholesky decomposition:

(45)

where is lower triangular. Hence, from equating blocks (2,2) on each side of this equation, we deduce that there exists a lower triangular matrix satisfying

(46)

We use this to demonstrate that the corresponding is positive definite by constructing the Cholesky decomposition for a rotated version of it. Rotating leads to the candidate covariance matrix becoming

(47)

The Cholesky decomposition would then take the form

(48)

where is a lower triangular matrix, is a scalar and is a vector satisfying

(49)
(50)
(51)

these equations coming respectively from equating blocks (2,2), (2,3) and (3,3) in (47) and (48) (the other block equations are satisfied trivially and don’t constrain , and ). There exists a to satisfy the first equation since by virtue of it being (36) and the original being a valid covariance matrix. The second equation is satisfied by since . And finally, the third equation is then satisfied by , where is that of (46). It follows that the Cholesky decomposition exists, and hence is a valid covariance matrix, and thus (41) holds.

Now, given the definition (26) for the union information and our expression (41) for it we have

(52)

Thus by the expression (4) for synergy minus redundancy in terms of mutual information we have

(53)
(54)
(55)

and hence we have reduced this PID to the MMI PID.

Now to show that this is the only PID for this Gaussian case satisfying the given conditions on the marginals of and we invoke Lemma 3 in Ref. [10]. In Bertschinger et al.’s notation [10], the specific PID that we have been considering is denoted with tildes, while possible alternatives are written without tildes. It follows from (55) that the source that shares the smaller amount of mutual information with the target has zero unique information. But according to the Lemma this provides an upper bound on the unique information provided by that source on alternative PIDs. Thus alternative PIDs give the same zero unique information between this source and the target. But according to the Lemma if the unique informations are the same, then the whole PID is the same. Hence, there is no alternative PID. QED.

Note that this common PID does not extend to the case of a multivariate target. For a target with dimension greater than 1, the vectors and above are replaced with matrices and with more than one column (these being respectively and ). Then to satisfy (41) one would need to find a satisfying , which does not in general exist. We leave consideration of this more general case to future work.

4.3 The MMI PID for the univariate jointly Gaussian case

It is straightforward to write down the MMI PID for the univariate jointly Gaussian case with covariance matrix given by (15). Taking without loss of generality we have from (17)–(19) and (33):

(56)
(57)
(58)
(59)

It can then be shown that (and also ) at the singular limits , and also that, at , reaches the minimum value of 0. For all in between values there is positive synergy. It is intuitive that synergy should grow largest as one approaches the singular limit, because in that limit is completely determined by and . On this PID, plots of synergy against correlation between sources take the same shape as plots of net synergy against correlation between sources, because of the independence of redundancy from correlation between sources [Fig. 3(c,f)]. Thus, for equal (same sign) and , decreases with correlation between sources, for equal magnitude but opposite sign and , increases with correlation between sources, and for unequal magnitude and has a U-shaped dependence on correlation between sources.

5 Dynamical systems

In this section we explore synergy and redundancy in some example dynamical Gaussian systems, specifically multivariate autoregressive (MVAR) processes, i.e., discrete time systems in which the present state is given by a linear combination of past states plus noise.333These are the standard stationary dynamical Gaussian systems. In fact they are the only stationary dynamical Gaussian systems if one assumes that the present state is a continuous function of the past state [14]. Having demonstrated (Section 4) that the MMI PID is valid for multivariate sources we are able to derive valid expressions for redundancy and synergy in the information that arbitrary length histories of sources contain about the present state of a target. We also compute the more straightforward net synergy.

Figure 4: Connectivity diagrams for example dynamical systems. Variables are shown as circles, and directed interactions as arrows. The systems are animated as Gaussian MVAR processes of order 1. (a) Example 1. In this system receives inputs from its own past and from the past of . There is positive net synergy between the information that the immediate pasts of and provide about the future of , but zero net synergy between the information provided by the infinite pasts of and about the future of . (b) Example 2. In this system there is bidirectional connectivity between and . There is zero net synergy between the information provided by the immediate pasts of and about the future of , and negative net synergy (i.e. positive net redundancy) between the information provided by the infinite pasts of and about the future of . (c) Example 3. Here and are sources that influence the future of . Depending on the correlation between and , there can be synergy between the information provided by the pasts of and about the future of (independent of the length of history considered).

5.1 Example 1: Synergistic two-variable system

The first example we consider is a two-variable MVAR process consisting of two variables and , with receiving equal inputs from its own past and from the past of (see Fig. 4(a)). The dynamics are given by the following equations:

(60)
(61)

where the ’s are all independent identically distributed Gaussian variables of mean 0 and variance 1. The variables and have a stationary probability distribution as long as . The information between the immediate pasts of and and the present of can be computed analytically as follows. First, the stationary covariance matrix satisfies

(62)

where is the two-dimensional identity matrix and is the connectivity matrix,

(63)

This is obtained by taking the covariance matrix of both sides of (60) and (61). Hence

(64)

The one-lag covariance matrix is given by

(65)

From these quantities we can obtain the following variances:

(66)
(67)
(68)
(69)

Then from these we can compute the mutual informations between the present of and the immediate pasts of and :

(70)
(71)
(72)

And thus from these we see that there is net synergy between the immediate pasts of and in information about the present of :

(73)

The infinite pasts of and do not however exhibit net synergistic information about the present of . While and , we have . This is because the restricted regression of on the past of is infinite order:

(74)

Hence,

(75)

Therefore

(76)
(77)
(78)

and

(79)

Thus the synergy equals the redundancy between the infinite pasts of and in providing information about the present state of .

According to the MMI PID, at infinite lags synergy is the same compared to for one lag, but redundancy is less. We have the following expressions for redundancy and synergy:

(80)
(81)
(82)

5.2 Example 2: An MVAR model with no net synergy

Not all MVAR models exhibit positive net synergy. The following for example (see Fig. 4(b)):

(83)
(84)

where again the ’s are all independent identically distributed random variables of mean 0 and variance 1, and for stationarity. A similar calculation to that for Example 1 shows that the one-lag mutual informations satisfy

(85)
(86)
(87)

and thus synergy and redundancy are the same for one-lag mutual information:

(88)

For infinite lags one has:

(89)
(90)
(91)

and thus

(92)

so there is greater redundancy than synergy.

For the MMI decomposition we have for 1-lag

(93)

while for infinite lags

(94)
(95)

It is intuitive that for this example there should be zero synergy. All the information contributed by the past of to the present of is mediated via the interaction with , so no extra information about the present of is gained from knowing the past of given knowledge of the past of .

It is interesting to note that for both this example and Example 1 above,

(96)

That is there is less synergy relative to redundancy when one considers information from the infinite past compared with information from the immediate past of the system. This can be understood as follows. The complete MVAR model is order 1 in each example (that is the current state of the system depends only on the immediate past), so