Characterising Probability Distributions via Entropies

# Characterising Probability Distributions via Entropies

Satyajit Thakor, Terence Chan and Alex Grant
Indian Institute of Technology Mandi
University of South Australia
Myriota Pty Ltd
###### Abstract

Characterising the capacity region for a network can be extremely difficult, especially when the sources are dependent. Most existing computable outer bounds are relaxations of the Linear Programming bound. One main challenge to extend linear program bounds to the case of correlated sources is the difficulty (or impossibility) of characterising arbitrary dependencies via entropy functions. This paper tackles the problem by addressing how to use entropy functions to characterise correlation among sources.

## I Introduction

This paper begins with a very simple and well known result. Consider a binary random variable such that

 pX(0)=p and pX(1)=1−p.

While the entropy of does not determine exactly what the probabilities of are, it essentially determines the probability distribution (up to relabelling). To be precise, let such that where

 hb(q)≜−qlogq−(1−q)log(1−q).

Then either or . Furthermore, the two possible distributions can be obtained from each other by renaming the random variable outcomes appropriately. In other words, there is a one-to-one correspondence between entropies and distribution (when the random variable is binary).

The basic question now is: How “accurate” can entropies specify the distribution of random variables? When is not binary, the entropy alone is not sufficient to characterise the probability distribution of . In , it was proved that if is a random scalar variable, its distribution can still be determined by using auxiliary random variables subject to alphabet cardinality constraint. The results can also be extended to random vector if the distribution is positive. However, the proposed approach cannot be generalised to the case when the distribution is not positive. In this paper, we take a different approach and generalise the result to any random vectors. Before we continue answering the question, we will briefly describe an application (based on network coding problems) of characterising distributions (and correlations) among random variables by using entropies.

Let the directed acyclic graph serve as a simplified model of a communication network with error-free point-to-point communication links. Edges have finite capacity . Let be an index set for a number of multicast sessions, and be the set of source random variables. These sources are available at the nodes identified by the mapping (a source may be available at multiple nodes) . Similarly, each source may be demanded by multiple sink nodes, identified by the mapping . For all assume that . Each edge carries a random variable which is a function of incident edge random variables and source random variables.

Sources are i.i.d. sequences Hence, each has the same joint distribution, and is independent across different . For notation simplicity, we will use to denote a generic copy of the sources at any particular time instance. However, within the same “time” instance , the random variables may be correlated. We assume that the distribution of is known.

Roughly speaking, a link capacity tuple is achievable if one can design a network coding solution to transmit the sources to their respective destinations such that 1) the probability of decoding error is vanishing (as goes to infinity), and 2) the number of bits transmitted on the link is at most . The set of all achievable link capacity tuples is denoted by .

###### Theorem 1 (Outer bound )

For a given network, consider the set of correlated sources with underlying probability distribution . Construct any auxiliary random variables by choosing a conditional probability distribution function . Let be the set of all link capacity tuples such that there exists a polymatroid satisfying the following constraints

 =0 (1) h(Ue|Xs:a(s)→e,Uf:f→e) =0 (2) h(Ys:u∈b(s)|Xs′:u∈a(s′),Ue:e→u) =0 (3) Ce−h(Ue) ≥0 (4)

for all and . Then

 R⊆R′ (5)

where the notation means is incident to and can be an edge or a node.

###### Remark 1

The region will depend on how we choose the auxiliary random variables . In the following, we give an example to illustrate this fact.

Consider the following network coding problem depicted in Figure 1, in which three correlated sources are available at node 1 and are demanded at nodes respectively. Here, are defined such that , and for some independent and uniformly distributed binary random variables . Furthermore, the edges from node to nodes have sufficient capacity to carry the random variable available at node 2.

We consider two outer bounds obtained from Theorem 1 for the above network coding problem. In the first scenario, we use no auxiliary random variables, while in the second scenario, we use three auxiliary random variables such that

 K0=b0,K1=b1,K2=b2.

Let be respectively the outer bounds for the two scenarios. Then is a proper subset of . In particular, the link capacity tuple is in the region . This example shows that by properly choosing auxiliary random variables, one can better capture the correlations among the sources, leading to a strictly tighter/better outer bound for network coding. Construction of auxiliary random variables from source correlation was also considered in  to improve cut-set bounds.

## Ii Main results

In this section, we will show that by using auxiliary random variables, the probability distribution of a set of random variables (or a random vector) can be uniquely characterised from the entropies of these variables.

### Ii-a Random Scalar Case

Consider any ternary random variable . Clearly, entropies of and probability distributions are not in one-to-one correspondence. In , auxiliary random variables are used to in order to exactly characterise the distribution.

Suppose is ternary, taking values from the set . Suppose also that for all . Define random variables , and such that

 Ai={1 if X=i0 otherwise. (6)

Clearly,

 H(Ai|X) =0, (7) H(Ai) =hb(pX(i)). (8)

Let us further assume that for all . Then by (8) and strict monotonicity of in the interval , it seems at the first glance that the distribution of is uniquely specified by the entropies of the auxiliary random variables.

However, there is a catch in the argument – The auxiliary random variables chosen are not arbitrary. When we “compute” the probabilities of from the entropies of the auxiliary random variables, it is assumed to know how the random variables are constructed. Without knowing the “construction”, it is unclear how to find the probabilities of from entropies.

More precisely, suppose we only know that there exists auxiliary random variables such that (7) and (8) hold (without knowing that the random variables are specified by (6)). Then we cannot determine precisely what the distribution of is. Despite this complexity, [1, 2] showed a construction of auxiliary random variables from which the probability distribution can be characterised from entropies. The results will also be briefly restated as a necessary prerequisite for the vector case.

Let be a random variable with support and be the set of all nonempty binary partitions of . In other words, is the collection of all sets such that , and both and are nonzero. We will use to denote the set . To simplify notations, we may assume without loss of generality that is a subset of . Clearly, . Unless explicitly stated otherwise, we may assume without loss of generality that the probability that (denoted by ) is monotonic decreasing. In other words,

 p1≥…≥pn>0.
###### Definition 1 (Partition Random Variables)

A random variable with support induces random variables for such that

 A⟨α⟩≜{αif X∈ααcotherwise. (9)

We called the collection of binary partition random variables of .

###### Remark 2

If or , then there exists an element such that if and only if . Hence, is essentially a binary variable indicating/detecting whether or not. As such, we call an indicator variable. Furthermore, when , there are exactly indicator variables, one for each element in .

###### Theorem 2 (Random Scalar Case)

Suppose is a random variable with support . For any , let be the corresponding binary partition random variables. Now, suppose is another random variable such that 1) the size of its support is at most the same as that of , and 2) there exists random variables satisfying the following conditions:

 H(B⟨α⟩,α∈Δ) =H(A⟨α⟩,α∈Δ) (10) H(B⟨α⟩|X∗) =0 (11)

for all . Then there is a mapping

 σ:Nn→X∗

such that In other words, the probability distributions of and are essentially the same (via renaming outcomes).

###### Proof:

A sketch of the proof is shown in Appendix A. \qed

### Ii-B Random Vector Case

Extension of Theorem 2 to the case of random vector has also been considered briefly in our previous work . However, the extension is fairly limited in that work – the random vector must have a positive probability distribution and each individual random variable must take at least three possible values. In this paper, we overcome these restrictions and fully generalise Theorem 2 to the random vector case.

###### Example 1

Consider two random vectors and with probability distributions given in Table I.

If we compare the joint probability distributions of and , they are different from each other. Yet, if we treat and as scalars (by properly renaming), then they indeed have the same distribution (both uniformly distributed over a support of size 8). This example shows that we cannot directly apply Theorem 2 to the random vector case, by simply mapping a vector into a scalar.

###### Theorem 3 (Random Vector)

Suppose is a random vector with support of size at least 3. Again, let be the set of all nonempty binary partitions of and be the binary partition random variable of such that

 A⟨α⟩={αif X∈ααcotherwise (12)

for all .

Now, suppose is another random vector where there exists random variables

 (B⟨α⟩,⟨α⟩∈Ω)

such that for any subset of and ,

 H(B⟨α⟩,⟨α⟩∈Δ,X∗j,j∈τ)=H(A⟨α⟩,⟨α⟩∈Δ,Xj,j∈τ). (13)

Then the joint probability distributions of and are essentially the same. More precisely, there exists bijective mappings for such that

 Pr(X=(x1,…,xM))=Pr(X∗=(σ1(x1),…,σM(xM))). (14)
###### Proof:

See Appendix B.\qed

### Ii-C Applications: Network coding outer bounds

Together with Theorem 1 and the characterisation of random variable using entropies, we obtain the following outer bound on the set of achievable capacity tuples.

###### Corollary 1

For any given network, consider the set of correlated sources with underlying probability distribution . From this distribution, construct binary partition random variables for every subset as described in Theorem 1 (for scalar subsets) and Theorem 3 (for vector subsets). Let be the set of all link capacity tuples such that there exists an almost entropic function satisfying the constraints (2)-(4) and

 h(XW,BW⟨α⟩)−H(YW,AW⟨α⟩) =0 (15)

for every and . Then . Replacing by , we obtain an explicitly computable outer bound .

## Iii Conclusion

In this paper, we showed that by using auxiliary random variables, entropies are sufficient to uniquely characterise the probability distribution of a random vector (up to outcome relabelling). Yet, there are still many open questions remained to be answered. For example, the number of auxiliary random variables used are exponential to the size of the support. Can we reduce the number of auxiliary random variables? What is the tradeoff between the number of auxiliary variables used and the quality of how well entropies can characterise the distribution? To the extreme, if only one auxiliary random variable can be used, how can one pick the variable to best describe the distribution?

## References

•  S. Thakor, T. Chan, and A. Grant, “Characterising correlation via entropy functions,” in Information Theory Workshop (ITW), 2013 IEEE, pp. 1–2, Sept 2013.
•  S. Thakor, T. Chan, and A. Grant, “Bounds for network information flow with correlated sources,” in Australian Communications Theory Workshop (AusCTW), (Melbourne, Australia), pp. 43 –48, Feb. 2011.
•  A. Gohari, S. Yang, and S. Jaggi, “Beyond the cut-set bound: Uncertainty computations in network coding with correlated sources,” IEEE Trans. Inform. Theory, vol. 59, pp. 5708–5722, Sept 2013.

## Appendix A Scalar case

The main ingredients in the proofs for Theorems 2 and 3 are the properties of the partition random variables, which will be reviewed as follows. By understanding the properties, we can better understand the logic behind Theorem 2.

###### Lemma 1 (Properties)

Let be a random variable with support , and be its induced binary partition random variables. Then the following properties hold:

1. (Distinctness) For any ,

 H(A⟨α⟩|A⟨β⟩) >0, (16) H(A⟨β⟩|A⟨α⟩) >0. (17)
2. (Completetness) Let be a binary random variable such that and . Then there exists such that

 H(A∗|A⟨α⟩)=H(A⟨α⟩|A∗)=0. (18)

In other words, and are essentially the same.

3. (Basis) Let . Then there exists

 ⟨β1⟩,…,⟨βn−2⟩∈Ω

such that

 H(A⟨βk⟩|A⟨α⟩,A⟨β1⟩,…,A⟨βk−1⟩) >0 (19)

for all .

Among all binary partition random variables, we are particularly interested in those indicator random variables. The following proposition can be interpreted as “entropic characterisation” for those indicator random variables.

###### Proposition 1 (Characterising indicators)

Let be a random variable of support where . Consider the binary partition random variables induced by . Then for all ,

1. , and

2. For all such that ,

 H(A⟨i⟩) ≤H(A⟨α⟩). (20)
3. Equalities (20) hold if and only if is an indicator random variable detecting an element such that

 pℓ=pi.
4. Let . The indicator random variable is the only binary partition variable of such that

 H(A⟨α⟩|A⟨j⟩,j∈β)>0

for all proper subset of .

###### Proof:

Let be a random scalar and for are its induced partition random variables. Suppose is another random variable such that 1) the size of its support is at most the same as that of , and 2) there exists random variables satisfying (10) and (11).

Roughly speaking, (10) and (11) mean that the set of random variables satisfy most properties as ordinary partition random variables. To prove the theorem, our first immediate goal is to prove that those random variables are indeed binary partition random variables. In particular, we can prove that

1. (Distinctness) All the random variables for are distinct and have non-zero entropies.

2. (Basis) Let . Then there exists

 ⟨β1⟩,…,⟨βn−2⟩∈Ω

such that

 H(B⟨βk⟩|B⟨α⟩,B⟨β1⟩,…,B⟨βk−1⟩) >0 (21)

for all .

3. (Binary properties) For any , is a binary partition random variable of . In this case, we may assume without loss of generality that there exists such that

 B⟨α⟩={ω⟨α⟩if X∗∈ω⟨α⟩ωc⟨α⟩otherwise. (22)
4. (Completetness) Let be a binary partition random variable of with non-zero entropy. Then there exists such that

 H(B∗|B⟨α⟩)=H(B⟨α⟩|B∗)=0. (23)

Then by (10) – (11) and Proposition 1, we show that satisfies all properties which are only satisfied by the indicator random variables. Thus, we prove that is an indicator variable if . Finally, once we have determined which are the indicator variables, we can immediately determine the probability distribution. As for all , the distribution of is indeed the same as that of (subject to relabelling). \qed

## Appendix B Vector case

In this appendix, we will sketch the proof for Theorem 3, which extends Theorem 2 to the random vector case.

Consider a random vector

 X=(Xm:m∈NM). (24)

We will only consider the general case111 In the special case when the support size of is less than 3, the theorem can be proved directly. where the support size of is at least 3, i.e., .

Let be the support of . Hence, elements of is of the form such that

 Pr(Xm=xm,m∈NM)>0

if and only if .

The collection of binary partition random variables induced by the random vector is again indexed by As before, we may assume without loss of generality that

 A⟨α⟩={αif X∈ααcotherwise. (25)

Now, suppose is a set of random variables satisfying the properties as specified in Theorem 3. Invoking Theorem 2 (by treating the random vector as one discrete variable), we can prove the following.

1. The size of the support of and are the same.

2. is a binary partition variable for all .

3. The set of variables contains all distinct binary partition random variables induced by .

4. is an indicator variable for all .

According to definition, is defined as an indicator variable for detecting . However, while is an indicator variable, the subscript in is only an index. The element detected by can be any element in the support of , which can be completely different from . To highlight the difference, we define the mapping such that for any , is the element in the support of that is detected by . In other words

 A∗⟨σ(x)⟩=B⟨x⟩. (26)

The following lemma follows from Theorem 2.

###### Lemma 2

For all ,

 Pr(X=x)=Pr(X∗=σ(x)).

Let be the support of . We similarly define as the collection of all sets of the form where is a subset of and the sizes of and are non-zero. Again, we will use to denote the set and define

 A∗⟨γ⟩={γ% if X∗∈γγcotherwise. (27)

For any , is a binary partition random variable of . Hence, we may assume without loss of generality that there exists such that For notation simplicity, we may further extend222 Strictly speaking, is not precisely defined. As , can either be or . Yet, the precise choice of does not have any effects on the proof. However, we only require that when is a singleton, should also be a singleton. the mapping such that for all .

###### Proposition 2

Let . Suppose satisfies the following properties:

1. For any , if and only if .

2. For any , if and only if .

Then .

###### Proof:

Direct verification. \qed

By definition of and Proposition 2, we have the following result.

###### Proposition 3

Let . Then is the only binary partition variable of such that

1. For any , if and only if .

2. For any , if and only if .

###### Proposition 4

Let . Then , where .

###### Proof:

By Proposition 3, is the only variable such that

1. For any , if and only if .

2. For any , if and only if .

The above two properties can then be rephrased as

1. For any ,

 H(A∗⟨σ(α)⟩|A∗⟨σ(x)⟩,σ(x)∈δ(γ))=0

if and only if

2. For any ,

 H(A∗⟨σ(α)⟩|A∗⟨σ(x)⟩,σ(x)∈δ(γ))=0

if and only if .

Now, we can invoke Proposition 2 and prove that or equivalently, . The proposition then follows. \qed

###### Proposition 5

Consider two distinct elements and in . Let

 σ(x) =y=(y1,…,yM) (28) σ(x′) =y′=(y′1,…,y′M). (29)

Then if and only if .

###### Proof:

First, we will prove the only-if statement. Suppose . Consider the following two sets

 Δ={x′′=(x′′1,…,x′′M)∈X:x′′m≠xm}, (30) Δc={x′′=(x′′1,…,x′′M)∈X:x′′m=xm}. (31)

It is obvious that By (10)-(11), we have . Hence, Since , this implies .

Now, notice that and . By Proposition 4, . Therefore, and Together with the fact that , we can then prove that

 y′m≠y′′m.

Next, we prove the if-statement. Suppose such that . There exist and such that (28) and (29) hold. Again, define

 Λ≜{y′′=(y′′1,…,y′′M)∈X∗:y′′m≠ym}, (32) Λc≜{y′′=(y′′1,…,y′′M)∈X∗:y′′m=ym}. (33)

Then . Let By definition and Proposition 4, Hence, we have and consequently . On the other hand, it can be verified from definition that and . Together with that , we prove that . The proposition then follows. \qed

###### Proof:

A direct consequence of Proposition 5 is that there exists bijective mappings such that On the other hand, Theorem 2 proved that Consequently,

 Pr(X1=x1,…,XM=xM)=Pr(X∗1=σ1(x1),…,X∗M=σM(xM)). (34)

Therefore, the joint distributions of and are essentially the same (by renaming as ). \qed

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters   