The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency

The Design of Global Correlation Quantifiers and Continuous Notions of Statistical Sufficiency


Using first principles from inference, we design a set of functionals for the purposes of ranking joint probability distributions with respect to their correlations. Starting with a general functional, we impose its desired behaviour through the Principle of Constant Correlations (PCC), which constrains the correlation functional to behave in a consistent way under statistically independent inferential transformations. The PCC guides us in choosing the appropriate design criteria for constructing the desired functionals. Since the derivations depend on a choice of partitioning the variable space into disjoint subspaces, the general functional we design is the -partite information (NPI), of which the total correlation and mutual information are special cases. Thus, these functionals are found to be uniquely capable of determining whether a certain class of inferential transformations, , preserve, destroy or create correlations. This provides conceptual clarity by ruling out other possible global correlation quantifiers. Finally, the derivation and results allow us to quantify non-binary notions of statistical sufficency. Our results express what percentage of the correlations are preserved under a given inferential transformation or variable mapping.

1 Introduction

The goal of this paper is to quantify the notion of global correlations as it pertains to inductive inference. This is achieved by designing a set of functionals from first principles to rank entire probability distributions according to their correlations. Because correlations are relationships defined between different subspaces of propositions (variables), the ranking of any distribution , and hence the type of correlation functional one arrives at, depends on the particular choice of “split” or partitioning of the variable space. Each choice of “split” produces a unique functional for quantifying global correlations, which we call the -partite information (NPI).

The term correlation may be defined colloquially as being a relation between two or more “things”. While we have a sense of what correlations are, how do we quantify this notion more precisely? If correlations have to do with “things” in the real world, are correlations themselves “real?” Can correlations be “physical?” One is forced to address similar questions in the context of designing the relative entropy as a tool for updating probability distributions in the presence of new information (e.g. “What is information?”) [8]. In the context of inference, correlations are broadly defined as being statistical relationships between propositions. In this paper we adopt the view that whatever correlations may be, their effect is to influence our beliefs about the natural world. Thus, the natural setting for the discussion is that of inductive inference.

When one has incomplete information, the tools one must use for reasoning objectively are probabilities [12, 8]. The relationships between diffrent propositions and are quantified by a joint probability density, , where the conditional distribution quantifies what one should believe about given information about , and vice-versa for . Intuitively, correlations should have something to do with these conditional dependencies.

In this paper, we seek to quantify a global amount of correlation for an entire probability distribution. That is, we desire a scalar functional for the purpose of ranking distributions according to their correlations. Such functionals are not unique since many examples, e.g. covariance, correlation coefficient [28], distance correlation [1], mutual information [11], total correlation [47], etc., measure correlations in different ways. What we desire is a principled approach to designing a family of measures according to specific design criteria [33, 34, 6].

The idea of designing a functional for ranking probability distributions was first discussed in Skilling [34]. In his paper, Skilling designs the relative entropy as a tool for ranking posterior distributions, , with respect to a prior, , in the presence of new information that comes in the form of constraints. The ability of the relative entropy to provide a ranking of posterior distributions allows one to choose the posterior that is closest to the prior while still incorporating the new information that is provided by the constraints. Thus, one can choose to update the prior in the most minimalist way possible. This feature is part of the overall objectivity that is incorporated into the design of relative entropy and in later versions is stated as the guiding principle [7, 41, 42].

Like relative entropy, we desire a method for ranking joint distributions with respect to their correlations. Whatever the value of our desired quantifier gives for a particular distribution , we expect that if we change , that our quantifier also changes , and that this change of reflects the change in the correlations, i.e. if changes in a way that increases the correlations, then should also increase. Thus, our quantifier should be an increasing functional of the correlations, i.e. it should provide a ranking of ’s.

The type of correlation functional one arrives at depends on a choice of the splits within the proposition space , and thus the functional we seek is . For example, if one has a proposition space , consisting of variables, then one must specify which correlations the functional should quantify. Do we wish to quantify how the variable is correlated with the other variables? Or do we want to study the correlations between all of the variables? In our design derivation, each of these questions represent the extremal cases of the family of quantifiers , the former being a bi-partite correlation (or mutual information) functional and the latter being a total correlation functional.

In the main design derivation we will focus on the the case of total correlation which is designed to quantify the correlations between every variable subspace in a set of variables . We suggest a set of design criteria (DC) for the purpose of designing such a tool. These DC are guided by the Principle of Constant Correlations (PCC), which states that “the amount of correlations in should not change unless required by the transformation, .” This implies our design derivation requires us to study equivalence classes of within statistical manifolds under the various transformations of distributions that are typically performed in inference tasks. We will find, according to our design criteria, that the global quantifier of correlations we desire in this special case is equivalent to the total correlation [47].

Once one arrives at the TC as the solution to the design problem in this article, one can then form special cases such as the mutual information [11] or, as we will call them, any -partite information NPI which measures the correlations shared between generic -partitions of the proposition space. The NPI can also be derived using the same principles as the TC except with one modification, as we will discuss in Section V.

The special case of NPI when is the bipartite (or mutual) information, which quantifies the amount of correlations present between two subsets of some proposition space . Mutual information (MI) as a measure of correlation has a long history, beginning with Shannon’s seminal work on communication theory [32] in which he first defines it. While Shannon provided arguments for the functional form of his entropy [32], he did not provide a derivation of (MI). Despite this, there has still been no principled approach to the design of MI or for the total correlation TC.

The idea of designing a tool for the purpose of inference and information theory is not new. Beginning in [12], Cox showed that probabilities are the functions that are designed to quantify “reasonable expectation” [13], of which Jaynes [20] and Caticha [6] have since improved upon as “degrees of rational belief”. Inspired by the method of maximum entropy [20, 21, 22], there have been many improvements on the derivation of entropy as a tool designed for the purpose of updating probability distributions in the decades since Shannon [32]. Most notably they are by Shore and Johnson [33], Skilling [34], Caticha [7], and Vanslette [41, 42]. The entropy functionals in [7, 41, 42] are designed to follow the Principle of Minimal Updating (PMU), which states, for the purpose of enforcing objectivity, that “a probability distribution should only be updated to the extent required by the new information.” In these articles, information is defined operationally as that which induces the updating of the probability distributions, .

An important consequence of deriving the various NPI as tools for ranking is their immediate application to the notion of statistical sufficiency. Sufficiency is a concept that dates back to Fisher, and some would argue Laplace [35], both of whom were interested in finding statistics that contained all relevant information about a sample. Such statistics are called sufficient, however this notion is only a binary label, so it does not quantify an amount of sufficiency. Using the result of our design derivation, we can propose a new definition of sufficiency in terms of a normalized NPI. Such a quantity gives a sense of how close a set of functions are to being sufficient statistics. This topic will be discussed in Section VI.

In Section II we will lay out some mathematical preliminaries and discuss the general transformations in statistical manifolds we are interested in. Then in Section III, we will state and discuss the design criteria used to derive the functional form of TC and the NPI in general. In Section IV we will complete the proof of the results from Section III. In Section V we discuss the -partite (NPI) special cases of TC of which the bipartite case is the mutual information. In Section VI we will discuss sufficiency and its relation to the Neyman-Pearson lemma [26]. It should be noted that throughout this article we will be using a probabilistic framework in which denotes propositions of a probability distribution rather than a statistical framework in which denotes random numbers.

2 Mathematical preliminaries

The arena of any inference task consists of two ingredients, the first of which is the subject matter, or what is often called the universe of discourse. This refers to the actual propositions that one is interested in making inferences about. Propositions tend to come in two classes, either discrete or continuous. Discrete proposition spaces will be denoted by caligraphic uppercase latin letters, , and the individual propositions will be lowercase latin letters indexed by some variable , where is the number of distinct propositions in . In this paper we will mostly work in the context of continuous propositions whose spaces will be denoted by bold faced uppercase latin letters, , and whose elements will simply be lowercase latin letters with no indices, . Continuous proposition spaces have a much richer structure than discrete spaces and help to generalize concepts such as relative entropy and information geometry [6]1.
The second ingredient that one needs to define for general inference tasks is the space of models, or the space of probability distributions which one wishes to assign to the underlying proposition space. These spaces can often be given the structure of a manifold, which in the literature is called a statistical manifold [6]. A statistical manifold , is a manifold in which each point is an entire probability distribution, i.e. is a space of maps from subsets of to the interval , . The notation denotes the power set of , which is the set of all subsets of , and has cadinality equal to .
In the simplest cases, when the underlying propositions are discrete, the manifold is finite dimensional. A common example that is used in the literature is the three-sided die, whose distribution is determined by three probability values . Due to positivity, , and the normalization constraint, , the point lives in the -simplex. Likewise, a generic discrete statistical manifold with possible states is an -simplex. In the continuum limit, which is often the case explored in physics, the statistical manifold becomes infinite dimensional and is defined as2,


When the statistical manifold is parameterized by the densities , the zeroes always lie on the boundary of the simplex. In this representation the statistical manifolds have a trivial topology; they are all simply connected. Without loss of generality, we assume that the statistical manifolds we are interested in can be represented as (1), so that is simply connected and does not contain any holes. The space in this representation is also smooth.
The symbol defines what we call a state of knowledge about the underlying propositions . It is, in essense, the quantification of our degrees of belief about each of the possible propositions [12]. The correlations present in any distribution necessarily depend on the conditional relationships between various propositions. For instance, consider the binary case of just two proposition spaces and , so that the joint distribution factors,


The correlations present in will necessarily depend on the form of and since the conditional relationships tell us how one variable is statistically dependent on the other. As we will see, the correlations defined in the above eq. are quantified by the mutual information. For situations of many variables however, the global correlations are defined by the total correlation, which we will design first. All other measures which break up the joint space into conditional distributions (including (2)) are special cases of the total correlation.

2.1 Some classes of inferential transformations

There are four main types of transformations we will consider that one can enact on a state of knowledge . They are: coordinate transformations, entropic updating3, marginalization, and products. This set of transformations is not necessarily exhaustive, but is sufficient for our discussion in this paper. We will indicate whether or not each of these types of transformations can presumably cause changes to the amount of global correlations, or not, by evaluating the response of the statistical manifold under these transformations. Our inability to describe how much the amount of correlations changes under these transformations motivates the design of such an objective global quantifier.

The types of transformations we will explore can be identified either with maps from a particular statistical manifold to itself, (type I), to a subset of the original manifold (type II), or from one statistical manifold to another, (type III and IV).

Type I: Coordinate transformations

Type I transformations are coordinate transformations. A coordinate transformation , is a special type of transformation of the proposition space that respects certain properties. It is essentially a continuous version of a reparameterization4. For one, each proposition must be identified with one and only one proposition and vice versa. This means that coordinate transformations must be bijections on proposition space. The reason for this is simply by design, i.e. we would like to study the transformations that leave the proposition space invariant. A general transformation of type I on which takes to , is met with the following transformation of the densities,


Like we already mentioned, the coordinate transforming function must be a bijection in order for (3) to hold, i.e. the map is such that and . While the densities and are not necessarily equal, the probabilities defined in (3) must be (according to the rules of probability theory, see the Appendix B). This indicates that is in the same location in the statistical manifold. That is, the global state of knowledge has not changed – what has changed is the way in which the local information in has been expressed, which must be invertible in general.
For a coordinate transformation (3) involving two variables, and , we also have that type one transformations give,


A few general properties of these type I transformations are as follows: First, the density is expressed in terms of the density ,


where is the Jacobian [6] that defines the transformation,


For a finite number of variables , the general type I transformations are written,


and the Jacobian becomes,


One can also express the density in terms of the original density by using the inverse transform,


Split Invariant Coordinate Transformations Consider a class of coordinate transformations that result in a diagonal Jacobian matrix, i.e.,


These transformations act within each of the variable spaces independently, and hence they are guaranteed to preserve the definition of the split between any -partitions of the propositions, and because they are coordinate transformations, they are invertible and do not change our state of knowledge, . We call such special types of transformations (10) split invariant coordinate transformations. The marginal distributions of are preserved under split invariant coordinate transformations,


If one allows generic coordinate transformations of the joint space, then the marginal distributions may depend on variables outside of their original split. Thus, if one redefines the split after a coordinate transformation to new variables , the original problem statement changes as to what variables we are considering correlations between and thus eq. (11) no longer holds. This is apparent in the case of two variables , where , since,


which depends on . Redefining the split after this coordinate transformation breaks the original independence since a distribution which originally factors, , would be made to have conditional dependence in the new coordinates, i.e. if and , then,


So, even though the above transformation satisfies (3), this type of transformation may change the correlations in by allowing for the potential redefinition of the split . Hence, when designing our functional, we identify split invariant coordinate transformations as those which preserve correlations. These restricted coordinate transformations help isolate a single functional form for our global correlation quantifier.

Type II: Entropic updating

Type II transformations are those induced by updating [6], in which one maximizes the relative entropy,


subject to constraints and relative to the prior, . Constraints often come in the form of expectation values [21, 22, 20, 6],


A special case of these transformations is Bayes’ rule [10, 17],


In (14) and throughout the rest of the paper we will use base (natural log) for all logarithms, although the results are perfectly well defined for any base (the quantities and will simply differ by an overall scale factor when using different bases). Maximizing (14) with respect to constraints such as (15) induces a jump in the statistical manifold.

Type II transformations, while well defined, are not necessarily continuous, since in general one can map nearby points to disjoint subsets in . Type II transformations will also cause in general as it jumps within the statistical manifold. This means, because different ’s may have different correlations, that type II transformations can either increase, decrease, or leave the correlations invariant.

Type III: Marginalization

Type III transformations are induced by marginalization,


which is effectively a quotienting of the statistical manifold, , i.e. for any point , we equivocate all values of . Since the distribution changes under type III transformations, , the amount of correlations can change.

Type IV: Products

Type IV transformations are created by products,


which are a kind of inverse transformation of type III, i.e. the set of propositions becomes the product . There are many different situations that can arise from this type, a most trivial one being an embedding,


which can be useful in many applications. The function in the above equation is the Dirac delta function [14] which has the following properties,


We will denote such a transformation as type IVa. Another trivial example of type IV is,


which we will call type IVb. Like type II, generic transformations of type IV can potentially create correlations, since again we are changing the underlying distribution.

Remarks on inferential transformations There are many practical applications in inference which make use of the above transformations by combining them in a particular order. For example, in machine learning and dimensionality reduction, the task is often to find a low-dimensional representation of some proposition space , which is done by combining types I,III and IVa in the order, . Neural networks are a prime example of this sequence of transformations [5]. Another example of IV,I,III transformations are convolutions of probability distributions, which takes two proposition spaces and combines them into a new one [11].

In Appendix D we discuss how our resulting design functionals behave under the aforementioned transformations.

3 Designing a global correlation quantifier

In this section we seek to achieve our design goal for the special case of the total correlation,

Design Goal: Given a space of variables and a statistical manifold , we seek to design a functional 5 which ranks distributions according to their total amount of correlations.  

Unlike deriving a functional, designing a functional is done through the process of eliminative induction. Derivations are simply a means of showing consistency with a proposed solution whereas design is much deeper. In designing a functional, the solution is not assumed but rather achieved by specifying design criteria that restrict the functional form in a way that leads to a unique or optimal solution. One can then interpret the solution in terms of the original design goal. Thus, by looking at the “nail”, we design a “hammer”, and conclude that hammers are designed to knock in and remove nails. We will show that there are several paths to the solution of our design criteria.

Our design goal requires that be scalar valued such that we can rank the distributions according to their correlations. Considering a continuous space of variables, the functional form of is the functional,


which depends on each of the possible probability values for every .

Given the types of transformations that may be enacted on , we state the main guiding principle we will use to meet our design goal,

Principle of Constant Correlations (PCC): The amount of correlations in should not change unless required by the transformation, .  

While simple, the PCC is incredibly constraining. By stating when one should not change the correlations, i.e. , it is operationally unique (i.e. that you don’t do it) rather than stating how one is required to change them, , of which there are infinitely many choices. The PCC therefore imposes an element of objectivity into . If we are able to complete our design goal, then we will be able to uniquely quantify how transformations of type I-IV affect the amount of correlations in .

The discussion of type transformations indicate that split invariant coordinate transformations do not change . This is because we want to not only maintain the relationship among the joint distribution (3), but also the relationships among the marginal spaces,


Only then are the relationships between the -partitions guaranteed to remain fixed and hence the distribution remains in the same location in the statistical manifold. When a coordinate transformation of this type is made, because it does not change , we are not explicitly required to change , so by the PCC we impose that it does not.

The PCC together with the design goal implies that,

Corollary 1 (Split Coordinate Invariance).

The coordinate systems within a particular split are no more informative about the amount of correlations than any other coordinate system for a given .

  This expression is somewhat analogous to the statement that “coordiantes carry no information”, which is usually stated as a design criteria for relative entropy [33, 7, 34]6.

To specify the functional form of further, we will appeal to special cases in which it is apparent that the PCC should be imposed [34]. The first involves local, subdomain, transformations of . If a subdomain of is transformed then one may be required to change its amount of correlations by some specified amount. Through the PCC however, there is no explicit requirement to change the amount of correlations outside of this domain, hence we impose that those correlations outside are not changed. The second special case involves transformations of an independent subsystem. If a transformation is made on an independent subsystem then again by the PCC, because there is no explicit reason to change the amount of correlations in the other subsystem, we impose that they are not changed. We denote these two types of transformation independences as our two design criteria (DC).

Surprisingly, the PCC and the DC are enough to find a general form for (up to an irrelevant scale constant). As we previously stated, the first design criteria concerns local changes in the probability distribution .

Design Criteria 1 (Locality).

Local transformations of contribute locally to the total amount of correlations.

  Essentially, if new information does not require us to change the correlations in a particular subdomain , then we don’t change the probabilities over that subdomain. While simple, this criterion is incredibly constraining and leads (22) to the functional form,


where is some undetermined function of the probabilities and possibly the coordinates. We have used to denote the measure for brevity. To constrain further, we first use the corollary of split coordinate invariance (1) among the subspaces and then apply special cases of particular coordinate transformations. This leads to the following functional form,


which demonstrates that the integrand is independent of the actual coordinates themselves. Like coordinate invariance, the axiom DC1 also appears in the design derivations of relative entropy [33, 7, 34, 41, 42]7.
This leaves the function to be determined, which can be done by imposing an additional design criteria.  

Design Criteria 2 (Subsystem Independence).

Transformations of in one independent subsystem can only change the amount of correlations in that subsystem.

  The consequence of DC2 concerns independence among subspaces of . Given two subsystems which are independent, the joint distribution factors,


We will see that this leads to the global correlations being additive over each subsystem,


Like locality (DC1), the design criteria concerning subsystem independence appears in all four approaches to relative entropy [7, 34, 33, 41, 42]8; however, due to the difference in the design goal here, we end up imposing DC2 closer to that of the work of [41, 42] as we do not explicitly have the Lagrange multiplier structure in our design space.
Imposing DC2 leads to the final functional form of ,


with being the split dependent marginals. This functional is what is typically referred to as the total correlation9 and is the unique result obtained from imposing the PCC and the corresponding design criteria.
As was mentioned throughout, these results are usually implemented as design criteria for relative entropy as well. Shore and Johnson’s approach [33] presents four axioms, of which III and IV are subsystem and subset independence. Subset independence in their framework corresponds to eq. (24) and to the Locality axiom of Caticha [7]. It also appears as an axiom in the approaches by Skilling [34] and Vanslette [41, 42]. Subsystem independence is given by axiom three in Caticha’s work [7], axiom two in Vanslette’s [41, 42] and axiom three in Skilling’s [34]. While coordinate invariance was invoked in the approaches by Skilling, Shore and Johnson and Caticha, it was later found to be unnecessary in the work by Vanslette [41, 42] who only required two axioms. Likewise, we find that it is an obvious consequence of the PCC and does not need to be stated as a separate axiom in our derivation of the total correlation.

4 Proof of the main result

We will prove the results summarized in the previous section. Let a proposition of interest be represented by – an dimensional coordinate that lives somewhere in the discrete and fixed proposition space , with being the cardinality of (i.e. the number of possible combinations). The joint probability distribution at this generic location is and the entire distribution is the set of joint probabilities defined over the space , i.e., .

4.1 Locality - DC1

We begin by imposing DC1 on . Consider changes in induced by some transformation , where the change to the state of knowledge is,


for some arbitrary change in that is required by some new information. This implies that the global correlation function must also change according to (22),


where is the change to induced by (29). To impose DC1, consider that the new information requires us to change the distribution in one subdomain , , that may change the correlations, while leaving the probabilities in the complement domain fixed, .10 Let the subset of the propositions in be relabeled as . Then the variations in with respect to the changes of in the subdomain are,


for small changes . In general the derivative,


could potentially depend on the entire distribution . We impose DC1 by constraining (32) to only depend on the probabilities within the subdomain since the variation (32) should not cause changes to the amount of correlations in the complement , i.e.,


This condition must also hold for arbitrary choices of subdomians , thus by further imposing DC1 in the most restrictive case of local changes (),


guarantees that it will hold in the general case. In this most restrictive case of local changes, the functional has vanishing mixed derivatives,


Integrating (34) leads to,


where are undetermined functions of the probabilities. As this functional is designed for ranking, nothing prevents us from setting the irrelevant constant to zero, which we do. Extending to the continuum, we find eq. (24),


where for brevity we have also condensed the notation for the continuous dimensional variables . It should be noted that has the capacity to express a large variety of potential measures of correlation including Pearson’s [28] and Szekely’s [1] correlation coefficients. Our new objective is to use eliminative induction until only a unique functional form for remains.

Split coordinate invariance – PCC

The PCC and the corollary (1) state that , and thus , should be independent of transformations that keep fixed. As discussed, split invariant coordinate transformations (10) satisfy this property. We will further restrict the functional so that it obeys these types of transformations.

We can always rewrite the expression (37) by introducing densities and so that,


Then, instead of dealing with the function directly, we can instead deal with a new definition ,


where is defined as,


Now we further restrict the functional form of by appealing to the PCC. Consider the functional under a split invariant coordinate transformation,


which amounts to sending to,


where is the Jacobian for the transformation from to . Consider the special case in which the Jacobian . Then due to the PCC we must have,


However this would suggest that since correlations could be changed under the influence of the new variables . Thus in order to maintain the global correlations the function must be independent of the coordinates,


To constrain the form of further, we can again appeal to split coordinate invariance but now with arbitrary Jacobian , which causes to transform as,


But this must hold for arbitrary split invariant coordinate transformations, for when the Jacobian factor . Hence, the function must also be independent of the second and third argument,


We then have that the split coordinate invariance suggested by the PCC together with DC1 gives,


This is similar to the steps found in the relative entropy derivation [33, 7], but differs from the steps in [41, 42].

– Design Goal and PCC

Split coordinate invariance, as realized in eq. (47), provides an even stronger restriction on which we can find by appealing to a special case. Since all distributions with the same correlations should have the same value of by the Design Goal and PCC, then all independent joint distributions will also have the same value, which by design takes a unique minimum value,


Inserting this into (47) we find,


But this expression must be independent of the underlying distribution , since all independent distributions, regardless of the joint space , must give the same value . Thus we conclude that the density must be the product marginal ,


so it is guaranteed that,


Thus, by design, expression (47) becomes (25),


4.2 Subsystem Independence – DC2

In the following subsections we will consider two approaches for imposing subsystem independence via the PCC and DC2. Both lead to identical functional expressions for . The analytic approach assumes the functional form of may be expressed as a Taylor series. The algebraic approach reaches the same conclusion without this assumption.

Analytical Approach

Let us assume that the function is analytic, so that it can be Taylor expanded. Since the argument, is defined over , we can consider the expansion over some open set of for any particular value as,


where are real coefficients. For in the neighborhood of , the series (53) converges to . The Taylor expansion of about when its propositions are nearly independent, i.e. , is




The 0th term is by definition of the design goal, which leaves,


where the in refers to .

Consider the independent subsystem special case in which is factorizable into , for all . We can represent with an analogous two-dimensional Taylor expansion in and , which is,


where the mixed derivative term is,


Since transformations of one independent subsystem, or , must leave the other invariant by the PCC and subsystem independence, then imposing that the mixed derivatives should necessarily be set to zero, , imposes DC2. This gives a functional equation for ,


Including from the and cases we have in total that,


To determine the solution of this equation we can appeal to the special case in which both subsystems are independent,