Measuring Integrated Information: Comparison ofCandidate Measures in Theory and Simulation

# Measuring Integrated Information: Comparison of Candidate Measures in Theory and Simulation

## Abstract

Integrated Information Theory (IIT) is a prominent theory of consciousness that has at its centre measures that quantify the extent to which a system generates more information than the sum of its parts. While several candidate measures of integrated information (’’) now exist, little is known about how they compare, especially in terms of their behaviour on non-trivial network models. In this article we provide clear and intuitive descriptions of six distinct candidate measures. We then explore the properties of each of these measures in simulation on networks consisting of eight interacting nodes, animated with Gaussian linear autoregressive dynamics. We find a striking diversity in the behaviour of these measures – no two measures show consistent agreement across all analyses. Further, only a subset of the measures appear to genuinely reflect some form of dynamical complexity, in the sense of simultaneous segregation and integration between system components. Our results help guide the operationalisation of IIT and advance the development of measures of integrated information that may have more general applicability.

## 1 Introduction

Since the seminal work of Tononi, Sporns and Edelman [45], and more recently, of Balduzzi and Tononi [5], there have been many valuable contributions in neuroscience towards understanding and quantifying the dynamical complexity of a wide variety of systems. A system is said to be dynamically complex if it shows a balance between two competing tendencies, namely

• integration, i.e. the system behaves as one; and

• segregation, i.e. the parts of the system behave independently.

The notion of dynamical complexity has also been variously described as a balance between order and disorder, or between chaos and synchrony, and has been related to criticality and metastability [31]. Many quantitative measures of dynamical complexity have been proposed, but a theoretically-principled, one-size-fits-all measure remains elusive.

A prominent framework highlighting the extent of simultaneous integration and segregation is Integrated Information Theory (IIT), which studies dynamical complexity from information-theoretic principles. Measures of integrated information attempt to quantify the extent to which the whole system is generating more information than the ‘sum of its parts’. The information to be quantified is typically the information that the current state contains about a past state (for the information integrated over time window , the past state to be considered is that at time from the present). The partitioning is done such that one considers the parts with the weakest links between them, in other words, the partition across which integrated information is computed is the ‘minimum information partition.’ There are many ways one can operationalise this concept of integrated information. Consequently, there now exists a range of distinct integrated information measures.

Proponents of IIT claim that measures of integrated information potentially relate to the quantity of consciousness generated by any physical system [34]. This is however controversial, and empirical evidence of a relationship between any particular measure of integrated information and consciousness remains scarce [15]. Here, we do not focus on the connections of IIT to consciousness, although we do comment on the application of IIT to neural data (see Discussion). We instead consider measures of integrated information more generally as useful operationalisations of notions of dynamical complexity.

We have two goals. First, to provide a unified source of explanation of the principles and practicalities of the various candidate measures of integrated information. Second, to examine the behaviour of candidate measures on non-trivial network models, in order to shed light on their comparative practical utility.

In a recent related paper, Tegmark [41] developed a theoretical taxonomy of all integrated information measures that can be written as a distance between a probability distribution pertaining to the whole and that obtained from the product of probability distributions pertaining to the parts. Here we review in detail five distinct and prominent proposed measures of integrated information, including two ( and ) that were not covered in Tegmark’s taxonomy. These are: whole-minus-sum integrated information [5]; integrated stochastic interaction [11]; integrated synergy [19]; decoder-based integrated information [35]; geometric integrated information [37]. We also consider, for comparison, the measure causal density (CD) [39], which can be considered as the sum of independent information transfers in the system (without reference to a minimum information partition). This measure has previously been discussed in conjunction with integrated information measures [40, 39].

All of the measures have the potential to behave in ways which are not obvious a priori, and in a manner difficult to express analytically. While some simulations of some of the measures (, and CD) on networks have been performed [11, 39], other measures ( and ) have not previously been computed on any model consisting of more than two components. This paper provides a comparison of the full suite of measures on non-trivial network models. We consider eight-node networks with a range of different architectures, animated with basic noisy vector autoregressive dynamics. We examine how network topology as well as coupling strength and correlation of noise inputs affect each measure. We also plot the relation between each measure and the global correlation (a simple dynamical control). Based on these comparisons we discuss the extent to which each measure appears genuinely to capture the co-existence of integration and segregation central to the concepts of dynamical complexity and integrated information.

After covering the necessary preliminaries in Section 2, Section 3 sets out the intuition behind the measures, and summarises the mathematics behind the definition of each measure. In Section 4 we present the simulations. Then Section 5 is the Discussion. In the Appendix, Section A.1, we derive new formulae for computing the decoder-based integrated information for Gaussian systems, correcting the previous formulae in Ref. [35]. Other Appendices contain further derivations of mathematical properties of the measures.

## 2 Notation, convention and preliminaries

In this section we review the fundamental concepts needed to define and discuss the candidate measures of integrated information. In general, we will denote random variables with uppercase letters (e.g. , ) and particular instantiations with the corresponding lowercase letters (e.g. , ). Variables can be either continuous or discrete, and we assume that continuous variables can take any value in and that a discrete variable can take any value in the finite set . Whenever there is a sum involving a discrete variable we assume the sum runs for all possible values of (i.e. the whole ). A partition divides the elements of system into non-overlapping, non-empty sub-systems (or parts), such that and , for any . We denote each variable in as , and the total number of variables in as . When dealing with time series, time will be indexed with a subscript, e.g. .

Entropy quantifies the uncertainty associated with random variable – i.e. the higher the harder it is to make predictions about – and is defined as

 H(X)=:−∑xp(x)logp(x) . (1)

In many scenarios, a discrete set of states is insufficient to represent a process or time series. This is the case, for example, with brain recordings, which come in real-valued time series and with no a priori discretisation scheme. In these cases, using a continuous variable we can similarly define the differential entropy,

 H[p]=:−∫p(x)logp(x)dx . (2)

However, differential entropy is not as interpretable and well-behaved as its discrete-variable counterpart. For example, differential entropy is not invariant to rescaling or other transformations on . Moreover, it is only defined if has a density with respect to the Lebesgue measure ; this assumption will be upheld throughout this paper. We can also define the conditional and joint entropies as

 H(X|Y)=:∑yp(y)H(X|Y=y)=−∑yp(y)∑xp(x|y)logp(x|y) (3)
 H(X,Y)=:−∑x,yp(x,y)logp(x,y) , (4)

respectively. Conditional and joint entropies can be analogously defined for continuous variables by appropriately replacing sums with integrals.

The Kullback-Leibler (KL) divergence quantifies the dissimilarity between two probability distributions and :

 DKL(p∥q)=:∑xp(x)logp(x)q(x) . (5)

The KL divergence represents a notion of (non-symmetric) distance between two probability distributions. It plays an important role in information geometry, which deals with the geometric structure of manifolds of probability distributions.

Finally, mutual information quantifies the interdependence between two random variables and . It is the KL divergence between the full joint distribution and the product of marginals, but it can also be expressed as the average reduction in uncertainty about when is given:

 I(X;Y)=:DKL(p(X,Y) ∥ p(X)p(Y))=H(X)+H(Y)−H(X,Y)=H(X)−H(X|Y) . (6)

Mutual information is symmetric in the two arguments and . We make use of the following properties of mutual information:

1. ,

2. , and

3. for any injective functions .

We highlight one implication of property 3: is upper-bounded by the entropy of both and . This means that the entropy of a random variable is the maximum amount of information can have about any other variable (or another variable can have about ).

Mutual information is defined analogously for continuous variables and, unlike differential entropy, it retains its interpretability in the continuous case.1 Furthermore, one can track how much information a system preserves during its temporal evolution by computing the time-delayed mutual information (TDMI) .

Next, we introduce notation and several useful identities to handle Gaussian variables. Given an -dimensional real-valued system , we denote its covariance matrix as . Similarly, cross-covariance matrices are denoted as . We will make use of the conditional (or partial) covariance formula,

 Σ(X|Y)=:Σ(X)−Σ(X,Y)Σ(Y)−1Σ(Y,X) . (7)

For Gaussian variables,

 H(X) =12log(detΣ(X))+12nlog(2πe), (8) H(X|Y=y) =12log(detΣ(X|Y))+12nlog(2πe),∀y, (9) I(X;Y) (10)

All systems we deal with in this article are stationary and ergodic, so throughout the paper for any .

## 3 Integrated information measures

### 3.1 Overview

In this section we review the theoretical underpinnings and practical considerations of several proposed measures of integrated information, and in particular how they relate to intuitions about segregation, integration and complexity. These measures are:

• Whole-minus-sum integrated information, ;

• Integrated stochastic interaction, ;

• Integrated synergy, ;

• Decoder-based integrated information, ;

• Geometric integrated information, ; and

• Causal density, CD.

All of these measures (besides CD) have been inspired by the measure proposed by Balduzzi and Tononi in [5], which we call 2. was based on the information the current state contains about a hypothetical maximum entropy past state. In practice, this results in measures that are applicable only to discrete Markovian systems [11]. For broader applicability, it is more practical to build measures based on the ongoing spontaneous information dynamics – that is, based on without applying a perturbation to the system. Measures are then well-defined for any stochastic system (with a well-defined Lebesgue measure across the states), and can be estimated for real data using empirical distributions if stationarity can be assumed. All of the measures we consider in this paper are based on a system’s spontaneous information dynamics.

Table 1 contains a brief description of each measure and a reference to the original publication that introduced it. We refer the reader to the original publications for more detailed descriptions of each measure. Table 2 contains a summary of properties of the measures considered, proven for the case in which the system is ergodic and stationary, and the spontaneous distribution is used.

### 3.2 Minimum information partition

Key to all measures of integrated information is the notion of splitting or partitioning the system to quantify the effect of such split on the system as a whole. In that spirit, integrated information measures are defined through some measure of effective information, which operationalises the concept of “information beyond a partition” . This typically involves splitting the system according to and computing some form of information loss, via (for example) mutual information (), conditional entropy (), or decoding accuracy () (see Table 1). Integrated information is then the effective information with respect to the partition that identifies the “weakest link” in the system, i.e. the partition for which the parts are least integrated. Formally, integrated information is the effective information beyond the minimum information partition (MIP), which, given an effective information measure , is defined as

 PMIP=argPminf[X;τ,P]K(P) , (11)

where is a normalisation coefficient. In other words, the MIP is the partition across which the (normalised) effective information is minimum, and integrated information is the (unnormalised) effective information beyond the MIP. The purpose of the normalisation coefficient is to avoid biasing the minimisation towards unbalanced bipartitions (recall that the extent of information sharing between parts is bounded by the entropy of the smaller part). Balduzzi and Tononi [5] suggest the form

 K(P)=(r−1)minkH(Mkt) . (12)

However, not all contributions to IIT have followed Balduzzi and Tononi’s treatment of the MIP. Of the measures listed above, and share this partition scheme, defines the MIP through an unnormalised effective information, and , and CD are defined via the atomic partition without any reference to the MIP. These differences are a confounding factor when it comes to comparing measures – it becomes difficult to ascertain whether differences in behaviour of various measures are due to their definitions of effective information, to their normalisation factor (or lack thereof), or to their partition schemes. We return to this discussion in Sec. 5.1.

In the following we present all measures as they were introduced in their original papers (see Table 1), although it is trivial to combine different effective information measures with different partition optimisation schemes. However, all results presented in Sec. 4 are calculated by minimising each unnormalised effective information measure over even-sized bipartitions – i.e. bipartitions in which both parts have the same number of components. This is to avoid conflating the effect of the partition scan method with the effect of the integrated information measure itself.

### 3.3 Whole-minus-sum integrated information Φ

We next turn to the different measures of integrated information. As highlighted above, a primary difference among them is how they define the effective information beyond a given partition. Since most measures were inspired by Balduzzi and Tononi’s , we start there.

For , the effective information is given by the KL divergence between and , where (and analogously ) is the conditional distribution for given under the perturbation at time 0 into all states with equal probability – i.e. given that the joint distribution is given by , where is the uniform (maximum entropy) distribution4.

Averaging over all states , the result can be expressed as either

 I(X0;X1)−r∑k=1I(Mk0;Mk1) , (13)

or

 −H(X0|X1)+r∑k=1H(Mk0|Mk1) . (14)

These two expressions are equivalent under the uniform perturbation, since they differ only by a factor that vanishes if is the uniform distribution. However, they are not equivalent if the spontaneous distribution of the system is used instead – i.e. if is used instead of . This means that for application to spontaneous dynamics (i.e. without perturbation) we have two alternatives that give rise to two measures that are both equally valid analogs of .

We call the first alternative whole-minus-sum integrated information ( in [11]). The effective information is defined as the difference in time-delayed mutual information between the whole system and the parts. The effective information of the system beyond a certain partition is

 φ[X;τ,P]=:I(Xt−τ;Xt)−r∑k=1I(Mkt−τ;Mkt) . (15)

We can interpret as how good the system is at predicting its own future or decoding its own past5. Then here can be seen as the loss in predictive power incurred by splitting the system according to . The details of the calculation of (and the MIP) are shown in Box 3.3.

is often regarded as a poor measure of integrated information because it can be negative [35]. This is indeed conceptually awkward if is seen as an absolute measure of integration between the parts of a system, though it is a reasonable property if is interpreted as a “net synergy” measure [9] – quantifying to what extent the parts have shared or complementary information about the future state. That is, if we infer that the whole is better than the parts at predicting the future (i.e., is a sufficient condition), but a negative or zero does not imply the opposite. Therefore, from an IIT perspective a negative can lead to the understandably confusing interpretation of a system having “negative integration,” but through a different lens (net synergy) it can be more easily interpreted as (negative) overall redundancy in the evolution of the system. See Section 3.5 and Ref. [9] for further discussion on whole-minus-sum measures.

{mdframed}

[frametitle=Box 3.1: Calculating whole-minus-sum integrated information , skipabove=10pt, nobreak=true]

 Φ[X;τ]=φ[X;τ,BMIB] (16a) BMIB=argBminφ[X;τ,B]K(B) (16b) φ[X;τ,B]=I(Xt−τ;Xt)−2∑k=1I(Mkt−τ;Mkt) (16c) K(B)=min{H(M1t),H(M2t)} (16d)
1. For discrete variables:

 I(Xt−τ;Xt)=∑x,x′p(Xt−τ=x,Xt=x′)log(p(Xt−τ=x,Xt=x′)p(Xt−τ=x) p(Xt=x′))
2. For continuous, linear-Gaussian variables:

 I(Xt−τ;Xt)=12log(detΣ(Xt)detΣ(Xt|Xt−τ))
3. For continuous variables with an arbitrary distribution, we must resort to the nearest-neighbour methods introduced by [25]. See reference for details.

### 3.4 Integrated stochastic interaction ~Φ

We next consider the second alternative for for spontaneous information dynamics: integrated stochastic interaction . Also introduced in Barrett and Seth [11], this measure embodies similar concepts as , with the main difference being that utilises a definition of effective information in terms of an increase in uncertainty instead of in terms of a loss of information.

is based on stochastic interaction , introduced by Ay [4]. Akin to Eq. (15), we define stochastic interaction beyond partition as

 ~φ[X;τ,P]=:r∑k=1H(Mkt−τ|Mkt)−H(Xt−τ|Xt) . (17)

Stochastic interaction quantifies to what extent uncertainty about the past is increased when the system is split in parts, compared to considering the system as a whole. The details of the calculation of are similar to those of and are described in Box 3.4.

The most notable advantage of over as a measure of integrated information is that is guaranteed to be non-negative. In fact, as mentioned above and are related through the equation

 ~φ[X;τ,P]=φ[X;τ,P]+I(M1t;M2t;…;Mrt) , (18)

where

 I(M1t;M2t;…;Mrt)=r∑k=1H(Mkt)−H(Xt) . (19)

This measure is also linked to information destruction, as presented in Wiesner et al. [48]. The quantity measures the amount of irreversibly destroyed information, since indicates that more than one possible past trajectory of the system converged on the same present state, making the system irreversible and indicating a loss of information about the past states. From this perspective, can be understood as the difference between the information that is considered destroyed when the system is observed as a whole, or split into parts. Note however that this measure is time-symmetric when applied to a stationary system; for stationary systems total instantaneous entropy does not increase with time.

{mdframed}

[frametitle=Box 3.2: Calculating integrated stochastic interaction , skipabove=10pt, nobreak=true]

 ~Φ[X;τ]=~φ[X;τ,BMIB] (20a) BMIB=argBmin~φ[X;τ,B]K(B) (20b) ~φ[X;τ,B]=r∑k=1H(Mkt−τ|Mkt)−H(Xt−τ|Xt) (20c) K(B)=min{H(M1t),H(M2t)} (20d)
1. For discrete variables:

 H(Xt−τ|Xt)=−∑x,x′p(Xt−τ=x,Xt=x′)log(p(Xt−τ=x,Xt=x′)p(Xt=x′))
2. For continuous, linear-Gaussian variables:

 H(Xt−τ|Xt)=12logdetΣ(Xt−τ|Xt)+12nlog(2πe)
3. For continuous variables with an arbitrary distribution, we must resort to the nearest-neighbour methods introduced by [25]. See reference for details.

### 3.5 Integrated synergy ψ

Originally designed as a “more principled” integrated information measure [19], shares some features with and but is grounded in a different branch of information theory, namely the Partial Information Decomposition (PID) framework, as described by Williams and Beer [49]. In the PID, the information that two (source) variables provide about a third (target) variable is decomposed into four non-negative terms as

 I(X,Y;Z)=UX(X;Z)+UY(Y;Z)+R(X,Y;Z)+S(X,Y;Z) ,

where is the unique information of source , is the redundancy between both sources and is their synergy. Figure 1 illustrates the involved quantities in a Venn diagram.

Integrated synergy is the information that the parts provide about the future of the system that is exclusively synergistic – i.e. cannot be provided by any combination of parts independently:

 ψ[X;τ,P]=:I(Xt−τ;Xt)−maxPI∪(M1t−τ,M2t−τ,…,Mrt−τ;Xt) , (21)

where

 I∪(M1t−τ,…,Mrt−τ;Xt)=:   ∑\makebox[0.0pt]$S⊆{M1,…,Mr}$   (−1)|S|+1I∩(S1t−τ,…,S|S|t−τ;Xt) , (22)

and denotes the redundant information sources have about target . The main problem of PID is that it is underdetermined. For example, for the case of two sources, Shannon’s information theory specifies three quantities (, , ) whereas PID specifies four (, , , ). Therefore, a complete operational definition of requires a definition of redundancy from which to construct the partial information components [49]. In this sense, the main shortcoming of , inherited from PID, is that there is no agreed consensus on a definition of redundancy [9, 12].

Here, we take Griffith’s conceptual definition of and we complement it with available definitions of redundancy. For the linear-Gaussian systems we will be studying in Sec. 4, we use the minimum mutual information PID presented in [9]6. Although we do not show any discrete examples here, for completeness we provide complete formulae to calculate for discrete variables using Griffith and Koch’s redundancy measure [20]. Note that alternatives are available for both discrete and linear-Gaussian systems [38, 23, 49, 13, 24].

{mdframed}

[frametitle=Box 3.3: Calculating integrated synergy , skipabove=10pt, nobreak=true]

 ψ[X;τ,P]=I(Xt−τ;Xt)−maxPI∪(M1t−τ,…,Mrt−τ;Xt) (23)
1. For discrete variables: (following Griffith and Koch’s [20] PID scheme)

 I∪(M1t−τ,…,Mrt−τ;Xt)= minq∑x,x′q(x,x′)log(q(x,x′)q(x) q(x′)) s.t. q(Mit−τ,Xt)=p(Mit−τ,Xt)
2. For continuous, linear-Gaussian variables:

 I∪(M1t−τ,…,Mrt−τ;Xt)=maxkI(Mkt−τ;Xt)
3. For continuous variables with an arbitrary distribution: unknown.

### 3.6 Decoder-based integrated information Φ∗

Introduced by Oizumi et al. in Ref. [35], decoder-based integrated information takes a different approach from the previous measures. In general, is given by

 Φ∗[X;τ,P]=:I(Xt−τ;Xt)−I∗[X;τ,P] , (25)

where is known as the mismatched decoding information, and quantifies how much information can be extracted from a variable if the receiver is using a suboptimal (or mismatched) decoding distribution [27, 33]. This mismatched information has been used in neuroscience to quantify the contribution of neural correlations in stimulus coding [36], and can similarly be used to measure the contribution of inter-partition correlations to predictive information.

To calculate we formulate a restricted model in which the correlations between partitions are ignored,

 q(Xt|Xt−τ)=∏ip(Mit|Mit−τ) , (26)

and we calculate for the case where the sender is using the full model as an encoder and the receiver is using the restricted model as a decoder. The details of the calculation of and are shown in Box 3.6. Unlike the previous measures shown in this section, does not have an interpretable formulation in terms of simpler information-theoretic functionals like entropy and mutual information.

Calculating involves a one-dimensional optimisation problem, which is straightforwardly solvable if the optimised quantity, , has a closed form expression [27]. For systems with continuous variables, it is in general very hard to estimate . However, for continuous linear-Gaussian systems and for discrete systems has an analytic closed form as a function of if the covariance or joint probability table of the system are known, respectively. In Appendix A we derive the formulae. (Note the version written down in [35] is incorrect, although their simulations match our results; we checked results from our derived version of the formulae versus results obtained from numerical integration, and confirmed that our derived formulae are the correct ones.) Conveniently, in both the discrete and the linear-Gaussian case is concave in (proofs in [27] and in Appendix A, respectively), which makes the optimisation significantly easier.

{mdframed}

[frametitle=Box 3.4: Calculating decoder-based integrated information , skipabove=10pt, nobreak=true]

 Φ∗[X;τ,P]=I(Xt−τ;Xt)−I∗[X;τ,P] (27a) I∗[X;τ,P]=maxβ~I(β;X,τ,P) (27b)
1. For discrete variables:

 ~I(β;X,τ,P)=−∑x′p(Xt=x′)log∑xp(Xt−τ=x)q(Xt=x′|Xt−τ=x)β+∑x,x′p(Xt−τ=x,Xt=x′)logq(Xt=x′|Xt−τ=x)β
2. For continuous, linear-Gaussian variables: (see appendix for details)

 ~I(β;X,τ,P)=12log(|Q||Σx|)+12tr(ΣxR)+βtr(Π−1x|~xΠx~xΠ−1xΣ~xx)
3. For continuous variables with an arbitrary distribution: unknown.

### 3.7 Geometric integrated information ΦG

In [37], Oizumi et al. approach the notion of dynamical complexity via yet another formalism. Their approach is based on information geometry [2, 1]. The objects of study in information geometry are spaces of families of probability distributions, considered as differentiable (smooth) manifolds. The natural metric in information geometry is the Fisher information metric, and the KL divergence provides a natural measure of (asymmetric) distance between probability distributions. Information geometry is the application of differential geometry to the relationships and structure of probability distributions.

To quantify integrated information, Oizumi et al. [37] consider the divergence between the complete model of the system under study and a restricted model in which links between the parts of the system have been severed. This is known as the M-projection of the system onto the manifold of restricted models , and

 ΦG[X;τ,P]=:minq∈QDKL(p(Xt−τ,Xt)∥q(Xt−τ,Xt)) . (28)

Key to this measure is that in considering the partitioned system, it is only the connections that are cut; correlations between the parts are still allowed on the partitioned system. Although conceptually simple, is very hard to calculate compared to all other measures we consider here (see Box 3.7). There is no known closed form solution for any system, and we can only find approximate numerical estimates for some systems. In particular, for discrete and linear-Gaussian variables we can formulate as the solution of a pure constrained multivariate optimisation problem, with the advantage that the optimisation objective is differentiable and convex [14].

{mdframed}

[frametitle=Box 3.5: Calculating geometric integration , skipabove=10pt, nobreak=true]

 ΦG[X;τ,P]= minqDKL(p∥q) (29a) s.t. q(Mit+τ|Xt)=q(Mit+τ|Mit) (29b)
1. For discrete variables: numerically optimise the objective subject to the constraints

2. For continuous, linear-Gaussian variables: numerically optimise the objective

 ΦG[X;τ,P]=minΣ(E)′12log|Σ(E)′||Σ(E)| ,

where , and subject to the constraints

 Σ(E)′=Σ(E)+(A−A′)Σ(X)(A−A′)Tand (Σ(X)(A−A′)Σ(E)′−1)ii=0
3. For continuous variables with an arbitrary distribution: unknown.

### 3.8 Causal density

Causal density (CD) is somewhat distinct from the other measures considered so far, in the sense that it is a sum of information transfers rather than a direct measure of the extent to which the whole is greater than the parts. Nevertheless, we include it here because of its relevance and use in the dynamical complexity literature.

CD was originally defined in terms of Granger causality [18], but here we write it in terms of Transfer Entropy (TE) which provides a more general information-theoretic definition [6]. The conditional transfer entropy from to conditioned on is defined as

 TEτ(X→Y|Z)=:I(Xt;Yt+τ|Zt,Yt) . (30)

With this definition of TE we define CD as the average pairwise conditioned TE between all variables in ,

 CD[X;τ,P]=:1r(r−1)∑i≠jTEτ(Mi→Mj|M[ij]), (31)

where is the subsystem formed by all variables in except for those in parts and .

In a practical sense, CD has many advantages. It has been thoroughly studied in theory [7] and applied in practice, with application domains ranging from complex systems to neuroscience [28, 29, 32]. Furthermore, there are off-the-shelf algorithms that calculate TE in discrete and continuous systems [8]. For details of the calculation of CD see Box 3.8.

Causal density is a principled measure of dynamical complexity, as it vanishes for purely segregated or purely integrated systems. In a highly segregated system there is no information transfer at all, and in a highly integrated system there is no transfer from one variable to another beyond the rest of the system [39]. Furthermore, CD is non-negative and upper-bounded by the total time-delayed mutual information (proof in Appendix B), therefore satisfying what other authors consider an essential requirement for a measure of integrated information [37].

{mdframed}

[frametitle=Box 3.6: Calculating causal density CD, skipabove=10pt, nobreak=true]

 CD[X;τ,P]=1r(r−1)∑i≠jTEτ(Mi→Mj|M[ij]) (32)
1. For discrete variables:

 TEτ(Xi→Xj|X[ij])=∑x,x′p(Xjt+τ=x′j,Xt=x)log⎛⎜ ⎜⎝p(Xjt+τ=x′j|Xt=x)p(Xjt+τ=x′j|Xjt=xj,X[ij]t=x[ij])⎞⎟ ⎟⎠
2. For continuous, linear-Gaussian variables:

 TEτ(Xi→Xj|X[ij])=12log⎛⎜ ⎜⎝detΣ(Xjt+τ|Xjt⊕X[ij]t)detΣ(Xjt+τ|Xt)⎞⎟ ⎟⎠
3. For continuous variables with an arbitrary distribution, we must resort to the nearest-neighbour methods introduced by [25]. See reference for details.

### 3.9 Other measures

As already mentioned, all the measures reviewed here (besides CD) were inspired by the measure, which arose from the version of IIT laid out in Ref. [5]. The most recent version of IIT [34] is conceptually distinct, and the associated “-3.0” is consequently different to the measures we consider here. The consideration of perturbation of the system, as well as all of its subsets, in both the past and the future renders -3.0 considerably more computationally expensive than other measures. We do not here attempt to consider the construction of an analogue of -3.0 for spontaneous information dynamics. Such an undertaking lies beyond the scope of this paper.

Recently, Tegmark [41] developed a comprehensive taxonomy of all integrated information measures that can be written as a distance between a probability distribution pertaining to the whole and one obtained as a product of probability distributions pertaining to the parts. Tegmark further identified a shortlist of candidate measures, based on a set of explicit desiderata. This shortlist overlaps with the measures we consider here, and also contains other measures which are minor variants. Of Tegmark’s shortlisted measures, is equivalent to under the system’s spontaneous distribution, is its state-resolved version, is transfer entropy (which we cover here through CD), and is not defined for continuous variables. The measures and are outside of Tegmark’s classification scheme.

## 4 Results

All of the measures of integrated information that we have described have the potential to behave in ways which are not obvious a priori, and in a manner difficult to express analytically. While some simulations of , and CD on networks have been performed [11, 39], and have not previously been computed on models consisting of more than two components, and hasn’t previously been explored at all on systems with continuous variables. In this section, we study all the measures together on small networks. We compare the behaviour of the measures, and assess the extent to which each measure is genuinely capturing dynamical complexity.

To recap, we consider the following 6 measures:

• Whole-minus-sum integrated information, .

• Integrated stochastic interaction, .

• Decoder-based integrated information, .

• Geometric integrated information, .

• Integrated synergy, .

• Causal density, CD.

We use models based on stochastic linear auto-regressive (AR) processes with Gaussian variables. These constitute appropriate models for testing the measures of integrated information. They are straightforward to parameterise and simulate, and are amenable to the formulae presented in Section 3. Mathematically, we define an AR process (of order 1) by the update equation

 Xt+1=AXt+εt, (33)

where is a serially independent random sample from a zero-mean Gaussian distribution with given covariance , usually referred to as the noise or error term. A particular AR process is completely specified by the coupling matrix or network and the noise covariance matrix . An AR process is stable, and stationary, if the spectral radius of the coupling matrix is less than 1 [30]. (The spectral radius is the largest of the absolute values of its eigenvalues.) All the example systems we consider are calibrated to be stable, so the measures can be computed from their stationary statistics.

We shall consider how the measures vary with respect to: (i) the strength of connections, i.e. the magnitude of non-zero terms in the coupling matrix; (ii) the topology of the network, i.e the arrangement of the non-zero terms in the coupling matrix; (iii) the density of connections, i.e. the density of non-zero terms in the coupling matrix; and (iv) the correlation between noise inputs to different system components, i.e. the off diagonal terms in . The strength and density of connections can be thought of as reflecting, in different ways, the level of integration in the network. The correlation between noise inputs reflects (inversely) the level of segregation, in some sense. We also, in each case, compute the control measures

• Time-delayed mutual information (TDMI), ; and

• Average absolute correlation , defined as the average absolute value of the non-diagonal entries in the system’s correlation matrix.

These simple measures quantify straightforwardly the level of interdependence between elements of the system, across time and space respectively. TDMI captures the total information generated as the system transitions from one time-step to the next, and is another basic measure of the level of integration.

We report the unnormalised measures minimised over even-sized bipartitions – i.e. bipartitions in which both parts have the same number of components. In doing this we avoid conflating the effects of the choice of definition of effective information with those of the choice of partition search (see Sec. 3.2). See Discussion (Sec. 5.1) for more on this.

### 4.1 Key quantities for computing the integrated information measures

To compute the integrated information measures, the stationary covariance and lagged partial covariance matrices are required. By taking the expected value of with Eq. (33) and given that is white noise, uncorrelated in time, one obtains that the stationary covariance matrix is given by the solution to the discrete-time Lyapunov equation,

 Σ(Xt)=A Σ(Xt) AT+Σ(ϵt) . (34)

This can be easily solved numerically, for example in Matlab via use of the dlyap command. The lagged covariance can also be calculated from the parameters of the AR process as

 Σ(Xt−1,Xt)=⟨Xt(AXt+εt)T⟩=Σ(Xt)AT , (35)

and partial covariances can be obtained by applying Eq. (7). Finally, we obtain the analogous quantities for the partitions by the marginalisation properties of the Gaussian distribution. Given a bipartition , we write the covariance and lagged covariance matrices as

 Σ(Xt)=(Σ(Xt)mmΣ(Xt)mnΣ(Xt)nmΣ(Xt)nn) ,Σ(Xt−1,Xt)=(Σ(Xt−1,Xt)mmΣ(Xt−1,Xt)mnΣ(Xt−1,Xt)nmΣ(Xt−1,Xt)nn) , (36)

and we simply read the partition covariance matrices as

 Σ(Mt)=Σ(Xt)mm ,Σ(Mt−1,Mt)=Σ(Xt−1,Xt)mm . (37)

### 4.2 Two-node network

We begin with the simplest non-trivial AR process,

 A =(aaaa) , (38a) Σ(ϵ) =(1cc1) . (38b)

Setting we obtain the same model as depicted in Fig. 3 in Ref. [35]. We simulate the AR process with different levels of noise correlation and show results for all the measures in Fig. 2. Note that as approaches 1 the system becomes degenerate, so some matrix determinants in the formulae become zero causing some measures to diverge.

Inspection of Figure 2 immediately reveals a wide variability of behaviour among the measures, in both value and trend, even for this minimally simple model. Nevertheless, some patterns emerge. Both TDMI and are unaffected by noise correlation, and both and grow monotonically with . In fact, diverges to infinity as . The measures , , and CD decrease monotonically to 0 when the effect of the coupling cannot be distinguished from the noise. On the other hand, also decreases monotonically but becomes negative for large enough .

In Fig. 3 we analyse the same system, but now varying both noise correlation and coupling strength . As per the stability condition presented above, any value of makes the system’s spectral radius greater than or equal to 1, so the system becomes non-stationary and variances diverge. Hence in these plots we evaluate all measures for values of below the limit .

Again, the measures behave very differently. In this case TDMI and remain unaffected by noise correlation, and grow with increasing coupling strength as expected. In contrast, and increase with both and . decreases with but shows non-monotonic behaviour with . Of all the measures, , , and CD show desirable properties consistent with capturing conjoined segregation and integration – they monotonically decrease with noise correlation and increase with coupling strength.

### 4.3 Eight-node networks

We now turn to networks with eight nodes, enabling examination of a richer space of dynamics and topologies.

We first analyse a network optimised using a genetic algorithm to yield high [11]. The noise covariance matrix has ones in the diagonal and everywhere else, and now is a global factor applied to all edges of the network. The adjacency matrix is scaled such that its spectral radius is 1 when . Similar to the previous section, we evaluate all measures for multiple values of and and show the results in Fig. 4.

Moving to a larger network mostly preserves the features highlighted above. TDMI is unaffected by ; behaves like and diverges for large ; and and CD have the same trend as before, although now the decrease with is less pronounced. Interestingly, and increase slightly with , and does not show the instability and negative values seen in Fig. 3. Overall, in this more complex network the effect of increasing noise correlation on , , , and CD is not as pronounced as in simpler networks, where these measures decrease rapidly towards zero with increasing .

Thus far we have studied the effect of AR dynamics on integrated information measures, keeping the topology of the network fixed and changing only global parameters. We next examine the effect of network topology, on a set of 6 networks:

A

A fully connected network without self-loops.

B

The -optimal binary network presented in [11].

C

The -optimal weighted network presented in [11].

D

A bidirectional ring network.

E

A “small-world” network, formed by introducing two long-range connections to a bidirectional ring network.

F

An unidirectional ring network.

In each network the adjacency matrix has been normalised to a spectral radius of . As before, we simulate the system following Eq. (33), and here set noise input correlations to zero so the noise input covariance matrix is just the identity matrix. Figure 5 shows connectivity diagrams of the networks for visual comparison, and Fig. 6 shows the values of all integrated information measures evaluated on all networks.

As before, there is substantial variability in the behaviour of all measures, but some general patterns are apparent. Intriguingly, the unidirectional ring network is consistently judged by all measures (except for ) as the most complex, followed in most cases by the weighted -optimal network.7 On the other end of the spectrum, the fully connected network A is also consistently judged as the least complex network, which is explained by the large correlation between its nodes as shown by .

The results here can be summarised by comparing the relative complexity assigned to the networks by each measure – that is, to what extent do measures agree on which network is more complex than which. For convenience, we show the measure-dependent ranking of the network complexity in Table 3.

Inspecting this table reveals a remarkable alignment between TDMI, , , and , especially given how much their behaviour diverges when varying and . Although the particular values are different, the measures largely agree on the ranking of the networks based on their integrated information. This consistency of ranking is initially encouraging with regard to empirical application. However, the ranking is not what might be expected from topological complexity measures from network theory. If we ranked these networks by e.g. small-world index, we expect networks B, C, and E to be at the top and networks A, D, and F to be at the bottom – very different from any of the rankings in Table 3.8 In fact, the Spearman correlation between the ranking by small-world index and those by TDMI, , , and is around , leading to the counterintuitive conclusion that more complex networks in fact integrate less information. We note that these rankings are very robust to noise correlation (results not shown) for all measures except . Indeed, across all simulations in this study the behaviour of is erratic, undermining prospects for empirical application. (This behaviour is even more prevalent if is optimised over all bipartitions, as opposed to over even bipartitions.)

### 4.4 Random networks

We next perform a more general analysis of the performance of measures of integrated information, using Erdős-Rényi random networks. We consider Erdős-Rényi random networks parametrised by two numbers: the edge density of the network and the noise correlation (defined as above), both in the interval. To sample a network with a given , we generate a matrix in which each possible edge is present with probability and then remove self-loops. The stochasticity in the construction of the Erdős-Rényi network induces fluctuations on the integrated information measures, such that for each we calculate the mean and variance of each measure.

First, we generate 50 networks for each point in the plane and take the mean of each integrated information measure evaluated on those 50 networks. As before, the adjacency matrices are normalised to a spectral radius of . Results are shown in Fig. 7.

increases markedly with and moderately with , increases sharply with both and the rest of the measures can be divided in two groups, with , and CD that decrease with and TDMI, and that increase. Notably, all integrated information measures except show a band of high value at an intermediate value of . This demonstrates their sensitivity to the level of integration. The decrease when is increased beyond a certain point is due to the weakening of the individual connections in that case (due to the fixed overall coupling strength, as quantified by spectral radius).

Secondly, in Fig. 8 we plot each measure against the average correlation of each network, following the rationale that a good complexity index should peak at an intermediate value of – i.e. it should reach its maximum value in the middle range of . To obtain this figure we sampled a large number of Erdős-Rényi networks with random , and evaluated all integrated information measures, as well as their average correlation .

Fig. 8 shows that some of the measures have this intermediate peak, in particular: , , , and CD. Although also showing a modest intermediate peak, has a stronger overall positive trend with , and an overall negative trend. These analyses further support , , , and CD as valid complexity measures, although the relation between them remains unclear and not always consistent in other scenarios.

One might worry that these peaks could be due to a biased sampling of the axis – if our sampling scheme were obtaining many more samples in, say, the range, then the points with high we see in that range could be explained by the fact that the high- tails of the distribution are sampled better in that range than in the rest of the axis. However, the histogram at the bottom of Fig. 8 shows this is not the case – on the contrary, the samples are relatively uniformly spread along the axis. Therefore, the peaks shown by , , , and CD are not sampling artefacts.

## 5 Discussion

In this study we compared several candidate measures of integrated information in terms of their theoretical construction, and their behaviour when applied to the dynamics generated by a range of non-trivial network architectures. We found that no two measures had precisely the same basic mathematical properties, see Table 2. Empirically, we found a striking variability in the behaviour among the measures even for simple systems, see Table 4 for a summary. Of the measures we have considered, , and CD best capture conjoined segregation and integration on small networks, when animated with Gaussian linear AR dynamics (Fig. 2). These measures decrease with increasing noise input correlation and increase with increasing coupling strength (Fig. 4). Further, on random networks with fixed overall coupling strength (as quantified by spectral radius), they achieve their highest scores when an intermediate number of connections are present (Fig. 7). They also obtain their highest scores when the average correlation across components takes an intermediate value (Fig. 8).

In terms of network topology, none of the measures strongly reflect complexity of the network structure in a graph theoretic sense. At fixed overall coupling strength, a simple ring structure (Fig. 5) leads in most cases to the highest scores. Among the other measures: is largely determined by the level of correlation amongst the noise inputs, and is not very sensitive to changes in coupling strength; depends mainly on the overall coupling strength, and is not very sensitive to changes in noise input correlation; and generally behaves erratically.

Considered together, our results motivate the continued development of , and CD as theoretically sound and empirically adequate measures of integrated information.

### 5.1 Partition selection

Integrated information is typically defined as the effective information beyond the minimum information partition [5, 44]. However, when a particular measure of integrated information has been first introduced, it is often with a new operationalisation of both effective information and the minimum information partition. In this paper we have restricted attention to comparing different choices of measure of effective information, while keeping the same partition selection scheme across all measures. Specifically, we restricted the partition search to even-sized bipartitions, which has the advantage of obviating the need for introducing a normalisation factor when comparing bipartitions with different sizes, see Section 3.2. For uneven partitions, normalisation factors are required to compensate for the fact that there is less capacity for information sharing as compared to even partitions. However, such factors are known to introduce instabilities, both under continuous parameter changes, and in terms of numerical errors [11]. Further research is needed to compare different approaches to defining the minimum information partition, or finding an approximation to it in reasonable computation time [42].

In terms of computation time, performing the most thorough search, through all partitions, as in the early formulation of by Balduzzi and Tononi [5] requires time 9. Restricting attention to bipartitions reduces this to , whilst restricting to even bipartitions reduces this further to . These observations highlight a trade-off between computation time and comprehensive consideration of possible partitions. Future comparisons of integrated information measures may benefit from more advanced methods for searching among a restricted set of partitions to obtain a good approximation to the minimum information partition. For example, Toker and Sommer use graph modularity, stochastic block models or spectral clustering as informed heuristics to suggest a small number of partitions likely to be close to the MIP, and then take the minimum over those. With these approximations they are able to calculate the MIP of networks with hundreds of nodes [42, 43]. Alternatively, Hidaka and Oizumi make use of the submodularity of mutual information to perform efficient optimisation and find the bipartition across which there is the least instantaneous mutual information of the system [21]. Presently, however, their method is valid only for instantaneous mutual information and is therefore not applicable to finding the bipartition that minimises any form of normalised effective information as described in Section 3.2.

Further, each measure carries special considerations regarding partition search. For example, for , taking the minimum across all partitions is equivalent to taking it across bipartitions only, thanks to the properties of [49, 9, 38]. Arsiwalla and Verschure [3] used and suggested always using the atomic partition on the basis that it is fast, well-defined, and for specifically it can be proven to be the partition of maximum information; and thus it provides a quickly computable upper bound for the measure.

### 5.2 Continuous variables and the linear Gaussian assumption

We have compared the various integrated information measures only on systems whose states are given by continuous variables with a Gaussian distribution. This is motivated by measurement variables being best characterised as continuous in many domains of potential application. Future research should continue the comparison of these measures on a test-bed of systems with discrete variables. Moreover, non-Gaussian continuous systems should also be considered because the Gaussian approximation is not always a good fit to real data. For example, the spiking activity of populations of neurons typically exhibit exponentially distributed dynamics [17]. Systems with discrete variables are in principle straightforward to deal with, since calculating probabilities (following the most brute-force approach) amounts simply to counting occurrences of states. General continuous systems, however, are less straightforward. Estimating generic probability densities in a continuous domain is challenging, and calculating information-theoretic quantities on these is difficult [25, 46]. The AR systems we have studied here are a rare exception, in the sense that their probability density can be calculated and all relevant information-theoretic quantities have an analytical expression. Nevertheless, the Gaussian assumption is common in biology, and knowing now how these measures behave on these Gaussian systems will inform further development of these measures, and motivate their application more broadly.

### 5.3 Empirical as opposed to maximum entropy distribution

We have considered versions of each measure that quantify information with respect to the empirical, or spontaneous, stationary distribution for the state of the system. This constitutes a significant divergence from the supposedly fundamental measures of intrinsic integrated information of IIT versions 2 and 3 [5, 34]. Those measures are based on information gained about a hypothetical past moment in which the system was equally likely to be in any one of its possible states (the ‘maximum entropy’ distribution). However, as pointed out previously [11], it is not possible to extend those measures, developed for discrete Markovian systems, to continuous systems. This is because there is no uniquely defined maximum entropy distribution for a continuous random variable (unless it has hard-bounds, i.e. a closed and bounded set of states). Hence, quantification of information with respect to the empirical distribution is the pragmatic choice for construction of an integrated information measure applicable to continuous time-series data.

The consideration of information with respect to the empirical, as opposed to maximum entropy, distribution does however have an effect on the concept underlying the measure of integrated information – it results in a measure not of mechanism, but of dynamics [10]. That is, what is measured is not information about what the possible mechanistic causes of the current state could be, but rather what the likely preceding states actually are, on average, statistically; see [11] for further discussion. Given the diversity of behaviour of the various integrated information measures considered here even on small networks with linear dynamics, one must remain cautious about considering them as generalisations or approximations of the proposed ‘fundamental’ measures of IIT versions 2 or 3 [5, 34].

A remaining important challenge, in many practical scenarios, is the identification of stationary epochs. For a relatively long data segment, it can be unrealistic to assume that all the statistics are constant throughout. For shorter data segments, one can not be confident that the system has explored all the states that it potentially would have, given enough time.

## 6 Final remarks

The further development, and empirical application of Integrated Information Theory requires a satisfactory informational measure of dynamical complexity. During the last few years several measures have been proposed, but their behaviour in any but the simplest cases has not been extensively characterised or compared. In this study, we have reviewed several candidate measures of integrated information, and provided a comparative analysis on simulated data, generated by simple Gaussian dynamics applied to a range of network topologies.

Assessing the degree of dynamical complexity, integrated information, or co-existing integration and segregation exhibited by a system remains an important outstanding challenge. Progress meeting this challenge will have implications not only for theories of consciousness, such as Integrated Information Theory, but more generally in situations where relations between local and global dynamics are of interest. The review presented here identifies promising theoretical approaches for designing adequate measures of integrated information. Further, our simulations demonstrate the need for empirical investigation of such measures, since measures that share similar theoretical properties can behave in substantially different ways, even on simple systems.

## Acknowledgements

The authors would like to thank Michael Schartner for advice. ABB is funded by EPSRC grant EP/L005131/1. ABB and AKS are grateful to the Dr. Mortimer and Theresa Sackler Foundation, which supports the Sackler Centre for Consciousness Science. AKS is additionally grateful to the CIFAR Azrieli programme on Mind, Brain, and Consciousness.

## Appendix A Derivation and concavity proof of I∗

### a.1 Derivation of I∗ in Gaussian systems

Here we provide a closed-form expression for the mismatched decoding information in a Gaussian dynamical system. See Section 3.6 for more information. For clarity, we omit the arguments of and write it as a function of only. The formula for for a stationary continuous random process is

 ~I(β)=−∫dxp(x)log∫d~xp(~x)q(x|~x)β+∫d~x∫dxp(x,~x)logq(x|~x)β, (39)

where is the distribution for , is the joint distribution for , and is the conditional distribution for given under the partitioning in question. The function also depends on , and , but for the sake of clarity we omit all arguments except for , which is the parameter of interest here. When is Gaussian with covariance matrix (and mean 0 without loss of generality), we have

 p(x)=(2π)−n/2|ΣX|−1/2exp[−12ψ(x,Σ−1X)], (40)

where we define

 ψ(x,M)=:xTMx (41)

for a vector and a matrix . Further

 q(x|~x) = (2π)−n/2|ΠX|~X|−1/2exp[−12ψ(x−ΠX~XΠ−1X~x,Π