Graph-Based Lossless Markov Lumpings

# Graph-Based Lossless Markov Lumpings

## Abstract

We use results from zero-error information theory to determine the set of non-injective functions through which a Markov chain can be projected without losing information. These lumping functions can be found by clique partitioning of a graph related to the Markov chain. Lossless lumping is made possible by exploiting the (sufficiently sparse) temporal structure of the Markov chain. Eliminating edges in the transition graph of the Markov chain trades the required output alphabet size versus information loss, for which we present bounds.

## 1 Introduction

Large Markov models, common in many scientific disciplines, present a challenge for analysis, model parameter learning, and simulation: Language -gram models [1, Ch. 6] and models in computational chemistry and systems biology [2], for example, belong to this category. For these models, efficient simulation methods are as important as ways to represent the model with less parameters. A popular approach for the latter is lumping, i.e., replacing the alphabet of the Markov chain by a smaller one via partitioning. This partition induces a non-injective lumping function from the large to the small alphabet. While, in general, the lumped process has a lower entropy rate than the original chain, in [3] we presented conditions for lossless lumpings, i.e., where the original Markov chain and the lumped process have equal entropy rates. Specifically, the single entry property we define in [3, Def. 3] holds if, given the previous state of the Markov chain, in the preimage of the current lumped state only a single state is realizable, i.e., has positive probability (see Fig. 1).

The emphasis on whether a state is realizable, rather than on its probability, is also common in zero-error information theory. Typical problems in zero-error information theory are error-free communication [4] (rather than communication with small error probabilities) and lossless source coding with side information [5]. Both problems admit elegant graph-theoretic approaches which we recapitulate in Section 2.

In Section 3, we use these graph-theoretic approaches to find lossless lumpings for a given Markov chain. While the current state of the Markov chain cannot be inferred from its lumped image only, we require that it can be reconstructed by using the previous state of the Markov chain as side information (cf. Fig. 1). The lumpings fulfilling this requirement correspond to the possible clique partitions of a graph derived from the Markov chain. The method is universal in the sense that it only depends on the presence, but not the precise magnitude, of state transitions of the Markov chain. In Section 4, we relax the problem and reduce the output alphabet size of the lumping function by accepting that the lumped process has an entropy rate smaller than the original chain. We furthermore present bounds on the difference between these entropy rates.

By design, lossless lumpings are not efficient source codes. Thus, it cannot be assumed that the reduced output alphabet size is related to the Markov chain’s entropy rate. Nevertheless, in Section 5, we evaluate our lossless lumping method from a source coding perspective by applying it to length- sequences of the original Markov chain. We show that the required size of the output alphabet never exceeds (and asymptotically approaches) the number of realizable length- sequences. Our lossless lumping method is thus an asymptotically optimal fixed-length, lossless source code.

Future work shall apply the presented lumping methods to practical examples from, e.g., chemical reaction networks or natural language processing. Furthermore, while the connection between lossless lumpings and zero-error information theory is interesting and revealing, searching lossless lumping functions via clique partitioning can be computationally expensive. We have reasons to believe that the search for lumping functions can be cast as a constrained optimization problem whose properties are currently under investigation. Finally, we believe that the results presented in this work can contribute to zero-error source coding of processes with memory, complementing available results on zero-error coding for channels with memory (see [6] and the references therein). Section 7 hints at first results.

## 2 Preliminaries from Zero-Error Information Theory

Throughout this work, denotes the natural logarithm, i.e., entropies and entropy rates are measured in nats.

Let be an irreducible, aperiodic, stationary Markov chain with finite alphabet , transition probability matrix , and invariant distribution vector . The adjacency matrix is defined by . We say a state can access another state , if (). We abbreviate . The -fold blocked process given by is also Markov. Every length- sequence of is a state of .

Let be a graph with vertices and edges , where is the set of two-element subsets of . A set is a clique, if , and an independent set, if . The clique number and independence number are the size of ’s largest clique and independent set respectively. A clique partition of is a partition of into cliques of . The clique partition number is the size of the smallest clique partition of . The chromatic number is the minimum number of colours needed to paint without having same-coloured neighbours.

The complement graph has vertex set and edge set . Edge-duality identifies cliques of and independent sets of and vice-versa, whence and . For further details on graph theory see [7].

Let . We consider a discrete, memoryless channel (DMC) with input alphabet and output alphabet , defined by the transition probability matrix , where . In the case of a deterministic channel, i.e., one in which and , we can describe the channel by a lumping function and call defined by the lumped process.

###### Definition 1.

Let be the (channel) confusion graph, where

 {x1,x2}∈EW⇔∃y∈Y: ⌈Wx1,y⌉⋅⌈Wx2,y⌉=1. (1)

In the case of a deterministic channel, i.e., a lumping, denote the confusion graph by .

The confusion graph connects two vertices if the channel confuses them with positive probability, i.e., if there exists at least one element in the output alphabet to which both inputs can be mapped. If the channel is deterministic, then the confusion graph has a simple structure.

###### Lemma 1.

The confusion graph consists of isolated cliques induced by the preimages of the lumping function . Hence,

 Eg=⋃y∈Y[g−1(y)]2. (2)

The confusion graph is exactly the graph used in Shannon’s original paper [4] and the complement of the graph in [8, Sec. III]. The confusion graph determines the zero-error capacity of the channel. The number of messages that can be transmitted reliably via one channel use is the independence number of its confusion graph . For channel uses, one requires the -fold normal product of with itself: , where , if for all . In the limit, one has the zero-error capacity . In the case of a deterministic channel, the number of messages that can be transmitted reliably in one channel use is , where is the cardinality of the set . Since the normal product of a graph of isolated cliques is again a graph of isolated cliques, one has , cf. [8, p. 2209]. For such channels, separating source and channel coding is optimal [9, Prop. 1].

Let and be two RVs with a joint distribution having support .

###### Definition 2.

Let be the characteristic graph of , where , if

 ∀z∈Z: Pr(X=x,Z=z)Pr(X=x′,Z=z)=0, (3)

i.e., if there is no such that and .

In other words, the characteristic graph connects two vertices, if the side information distinguishes between them. The characteristic graph is the complement of the graph defined by Witsenhausen [5]. It determines the smallest number of messages that the transmitter must send to the receiver, such that the latter can reconstruct with the help of the side information . For a single transmission, the required number of messages is the clique partition number . For independent instances of , one requires the -fold co-normal product of with itself: , where , if for at least one . In particular, , the characteristic graph of the -fold blocked process. The number of bits required to convey instances is thus .

The characteristic graph depends only on the source and connects messages that the channel may confuse, given the receiver has side information . The confusion graph depends only on the channel and connects messages that the channel confuses. If the edge set of the latter is a subset of the edge set of the former, the channel confuses only messages that can be distinguished by incorporating the side information. This is the statement of

###### Proposition 1.

.

Proposition 1, proved in Section 6.1, generalizes easily to multiple channel uses by considering the corresponding graph products.

## 3 Graph-Based Lossless Markov Lumpings

We use results from zero-error information theory to construct a lumping of a Markov chain such that the original Markov chain can be recovered without error. To this end, we assume that, for the reconstruction of , the receiver has the previous state as side information. This temporal side information determines the characteristic graph. A clique partition of this graph defines a lumping function , whose confusion graph (Definition 1) is a subset of the Markov chain’s characteristic graph. Then, Proposition 1 guarantees that the original chain can be perfectly reconstructed from its initial state and the lumped process. The remainder of this section makes these statements precise.

###### Definition 3.

Let be the characteristic graph of , where

 {x1,x2}∈EX⇔∀x∈X: Ax,x1Ax,x2=0. (4)

In other words, the characteristic graph of a Markov chain connects two states, if every state can only access one of them. Since the Markov chains considered in this work are irreducible, the invariant distribution vector is positive and Definition 3 coincides with Definition 2 for a source with side information .

###### Example 1.

Consider the Markov chain in Fig. 1. Its characteristic graph has edge set . Both edges are cliques, and together they partition .

Choose an arbitrary clique partition of , enumerate the cliques, and define such that it maps each vertex in to the index of its containing clique. This way, assigns different values to vertices within different cliques. According to Lemma 1, the confusion graph of consists exactly of the cliques of the chosen clique partition of , only that these cliques are isolated in . This ensures that . Let define the lumped process . Hence, by Proposition 1, we have

 H(Xn|Yn,Xn−1)=0. (5)

Let and be the entropy rates of and respectively. It is easy to see that the tuple fulfils the single-entry property [3, Def. 10]. Thus, the lumping is lossless in the sense of a vanishing information loss rate, i.e.,

 ¯H(X|Y):=limn→∞1nH(Xn1|Yn1)=¯H(X)−¯H(Y)=0. (6)

This follows from the chain rule , the fact that conditioning reduces entropy , and stationarity of :

 ¯H(X|Y) \lx@stackrel(a)=limn→∞1nn∑i=1H(Xi|Yn1,Xi−11) (7a) \lx@stackrel(b)≤limn→∞1nn∑i=1H(Xi|Yi,Xi−1) (7b) \lx@stackrel(c)=H(X2|Y2,X1). (7c)

The last term vanishes because is such that (5) holds for all . With this we have proven

###### Corollary 1.

If, for a given Markov chain , the lumping function satisfies , then the lumping is lossless, i.e., .

Not only is the proposed lumping method lossless in the sense of Corollary 1, the original Markov chain can be perfectly reconstructed from its initial state and from the lumped process . The initial state and the state of the lumped process together determine the state of the original Markov chain. Then, acts as side information to reconstruct from , etc.

We investigate the size of the output alphabet required for to be lossless. An optimal lumping function induces the smallest possible partition of , i.e., . From Definition 3 follows that no two states accessible from a given state can be connected in . Hence, if is the maximum out-degree of the transition graph associated with , i.e.,

 dmax:=maxx∑x′∈XAx,x′ (8)

then contains at least cliques. We recover

###### Proposition 2 ([10, Prop. 3]).

.

Witsenhausen [5, Prop. 1] showed that this lower bound can be achieved using the side information, which is available at both ends. The achievable scheme requires that, for every state of the side information , a separate lumping function is used. Our restriction to a single lumping function leads to an output alphabet size generally larger than . However, if is sufficiently sparse, then the presence of side information at the receiver helps to make the output alphabet size still strictly smaller than .

###### Example 2.

Consider the Markov chain in Fig. 1 and assume that all transitions have probability 0.5. By symmetry, it follows that and . The output alphabet size is optimal in terms of Proposition 2: .

The proposed lumping method depends only on the location of zeros in the adjacency matrix . It follows that the method is universal in the sense that the obtained lumping function is lossless for every Markov chain with adjacency matrix . Moreover, is lossless for every stationary process, for which the non-zero one-step transition probabilities are modelled by . Equations (7) do not require Markovity of , whence Corollary 1 remains valid. However, our lumping method is only useful for Markov chains (or stationary processes) with a deterministic temporal structure, i.e., for sparse matrices .

###### Example 3.

Suppose that is a positive matrix, collecting the conditional probability distribution of two consecutive samples of . Hence is a matrix of ones, and the edge set of the characteristic graph is empty. Thus, . The only lossless lumping functions are permutations, hence lumping does not reduce the alphabet size.

Note finally that instead of defining via a clique partition of , one can also define a stochastic lumping via a clique covering of . This still ensures that holds and that the statement of Corollary 1 remains valid. While clique covering leads to additional freedom in the design of the lumping, it does not reduce the required output alphabet size compared to clique partitioning: If two cliques and cover a subset of the vertices , then the two cliques and partition it.

## 4 Graph-Based Lossy Markov Lumpings

We generalize the characteristic graph of the Markov chain by eliminating edges from its transition graph (i.e., ones in its adjacency matrix ) if the transition probabilities fall below a certain threshold:

###### Definition 4.

For , the -characteristic graph of is the graph , where

 {x1,x2}∈Eε⇔∀x∈X:⌈Px,x1−ε⌉⋅⌈Px,x2−ε⌉=0. (9)

Definition 4 is equivalent to Definition 3, if is defined by . Decreasing the number of ones in can only increase the number of edges in the characteristic graph, which in turn can only make the cliques larger and the clique partition number smaller. Hence, and . By eliminating edges, one may trade information loss for alphabet size. For the former, in Section 6.2, we prove a bound depending on , the number of vertices, and the cardinality of the output alphabet :

###### Proposition 3.

Take and , then

 ¯H(X|Y)≤(N−M)ε(1−logε)≤NH2(ε), (10)

where . The first inequality already holds for .

Applying Proposition 3 to recovers Corollary 1. The following example illustrates that if the entropy rate of falls below the bound in Proposition 3, the lumped process can become trivial.

###### Example 4.

Suppose that

 P=(1−εεε1−ε). (11)

It follows that and that . Moreover, as is fully connected, is constant with . Thus, is a constant process and .

Reconstructing from (with small probability of error) requires reconstruction methods more sophisticated than those for the lossless lumping method introduced in Section 3. Given knowledge of the previous state and the current lumped state , the current state can not be reconstructed without error. Hence, the side information used for reconstructing the next state might not be correct, which leads to error propagation.

## 5 A Source Coding Perspective on Lossless Markov Lumpings

The intended application of the lumping method introduced in Section 3 – model reduction in speech/language processing [1] or systems biology [2] – imposes several restrictions. The lumping is a time-invariant, preferably deterministic mapping from the large alphabet to a smaller alphabet and operates on a symbol-by-symbol basis in order to represent a partition of the original alphabet. These restrictions – stateless, fixed-length, and symbol-by-symbol – make our proposed method an inefficient source code. Despite this apparent incompatibility, we critically evaluate our lossless lumping method from a source coding perspective.

First, our lumping method can be used as a (universal) pre-processing step, after which more sophisticated compression schemes follow. For example, it can be easily extended to a variable-length symbol-by-symbol scheme by, e.g., optimal Huffman coding of the lumped states.

Second, we may still require the lumping to be stateless and fixed-length, but define the lumping function on the -fold Cartesian product . Hence, lumps sequences of length rather than states. Due to the deterministic temporal structure of , the alphabet size for lumping these length- sequences is not larger than the number of realizable sequences of this length. In other words, our scheme is at least as good as, and asymptotically equivalent to, any fixed-length, lossless coding scheme that en-/decodes sequences independently of each other. To show this, let be the largest eigenvalue of the adjacency matrix . If the Markov chain has adjacency matrix , the logarithm of bounds the entropy rate of from above, i.e.,  [11, 12]. In Section 6.3, we prove

###### Proposition 4.

For each , let be the optimal lumping function for the Markov chain , i.e., it induces the smallest clique partition of its characteristic graph . Let . Let the set of realizable states of be

 SK:={x∈XK: Pr(X(K)n=x)>0}. (12)

Then, and

 limK→∞logMKK=logλ. (13)
###### Example 5.

If , then . If , then .

While , especially for small and sparse , the inequality may be strict. This advantage disappears for increasing due to the Markov property, and the required alphabet size approaches the number of realizable length- sequences, which for large behaves like  [12]. Thus, while our lossless lumping method is asymptotically optimal in the sense of Proposition 4, for the intended application of reducing the alphabet it seems to be most efficient when applied symbol-by-symbol.

## 6 Proofs

### 6.1 Proof of Proposition 1

Let . First, assume that . There exist triples and such that

 Pr(X=x,Y=y,Z=z)=Qx,zWx,y>0 (14a) and Pr(X=x′,Y=y,Z=z)=Qx′,zWx′,y>0. (14b)

Hence, each term of the products on right-hand sides above must be positive, from which and follows. As a consequence, by the definitions of the channel confusion graph and the characteristic graph of , we have and . Thus, .

Second, assume that . Then, there exists such that and . It follows that there exists at least one such that , and at least one such that . Hence, the two probabilities in equations (14) are positive for and . Thus, .∎

### 6.2 Proof of Proposition 3

That follows from (7). If we define , then we get

 H(X2|Y2,X1=x)=−∑y∈Y∑x′∈g−1(y)Px,x′logPx,x′Rx,y. (15)

The assumption implies that is a clique in , whence each can access at most one element in with a probability larger than . Hence, let be such that for all other , . Thus,

 Rx,y≤Px,^x+ε(∣∣g−1(y)∣∣−1). (16)

We derive the first inequality in (10):

 H(X2|Y2,X1=x) = ∑y∈YPx,^xlogRx,yPx,^x−∑y∈Y∑x′∈g−1(y)∖{^x}Px,x′logPx,x′Rx,y \lx@stackrel(a)≤ ∑y∈Y(Rx,y−Px,^x)−∑y∈Y∑x′∈g−1(y)∖{^x}εlogεRx,y \lx@stackrel(b)≤ ∑y∈Y(Rx,y−Px,^x)−∑y∈Y∑x′∈g−1(y)∖{^x}εlogε \lx@stackrel(c)≤ ∑y∈Y(ε−εlogε)(∣∣g−1(y)∣∣−1) = (N−M)ε(1−logε),

where is because , for , and increases on , follows because , and is due to (16).

For the second inequality in (10), because , we have . By assumption, , whence , for all . Thus,

 (N−M)ε(1−logε)= (N−M)ε−(N−M)εlogε ≤ Nε(1−ε)−Nεlogε ≤ −N(1−ε)log(1−ε)−Nεlogε = NH2(ε). \qed

### 6.3 Proof of Proposition 4

The set of all unrealizable length- sequences is a clique in and every vertex in this clique is connected to every vertex outside of it. To see this fact, take such that . Since this state can not be accessed, w.l.o.g. the -th column of the corresponding adjacency matrix is zero. This means that, for every , realizable or not, .

Since is a clique, and since every state in this clique is connected to an arbitrary , also is a clique. A trivial clique partition thus consists of this clique and all the trivial single vertex cliques of vertices in . This clique partition has size . Since this clique partition may not be optimal, we get .

For the asymptotic result, note that cannot be smaller than . But since is achievable, we have . Furthermore, the number of realizable length- sequences of a Markov chain behaves like as increases. Specifically,  [12]. Together with , this establishes (13).∎

## 7 Zero-Error Source Coding of Stationary Processes

Based on the classic papers [4] and [5], most results in zero-error information theory are based on memoryless channels and sources. While there exist extensions to channels with memory, see [6] and the references therein, to the best of the authors knowledge sources with memory have not been dealt with yet. We believe that applying zero-error information theory to Markov chains motivates such an extension. This section presents a first result.

Assume the source produces two jointly stationary random processes and , and assume that the support of the marginal distribution is . Furthermore, let be the support of the joint distribution of samples, i.e., the joint distribution of . Clearly, . We already mentioned that the -fold co-normal product of is the characteristic graph of the -fold blocked source, assuming that the source is iid [8]. We claim that independence is not necessary, but that suffices. As soon as , the edge set of may become a strict superset of the edge set of : Only deterministic dependence, where not all sequences are realizable, can reduce the required alphabet size as compared to the iid assumption.

If the receiver obtains the side information via a discrete, memoryless channel, the we get

###### Proposition 5.

Let be a stationary stochastic process with support of the distribution of given as in Proposition 4, and let the side information be given via a DMC , i.e.,

 Pr(XK1=xK1,ZK1=zK1)=Pr(XK1=xK1)K∏i=1Wxi,zi. (17)

Then, the characteristic graph has edge set

 E(X,Z)(K)=E∨K(X,Z)∪{{x,x′}: x∈XK,x′∈XK∖SK}. (18)

Proposition 5 states that a deterministic temporal structure of the source can only decrease the clique partition number, making compression more efficient. If , for some , then no information needs to be transmitted because all information about is already contained in the side information . We believe that this analysis can be extended to more general side information structures and to variable-length zero-error source codes as in [13, 14].

###### Proof.

By Definition 3, , iff, for all ,

 Pr(XK1=x,ZK1=z)Pr(XK1=x′,ZK1=z)=0. (19)

With the -th coordinate of , we write

 Pr(XK1=x,ZK1=z)=Pr(XK1=x)K∏i=1Wxi,zi (20)

and see that (19) holds, iff at least one of the following conditions holds:

 Pr(XK1=x) =0, (21a) Pr(XK1=x′) =0, (21b) K∏j=1K∏i=1Wxi,ziWx′j,zj =0. (21c)

Equation (21a) (and, similarly, equation (21b)) imply that if a sequence is not realizable, then (19) holds for all . Hence, in , each unrealizable state is connected to every other state. With being the set of realizable sequences, we get .

We may assume w.l.o.g. that , i.e., that all states are realizable. Then, since the -fold co-normal product of is the characteristic graph of the source emitting iid, we have , iff, for all ,

 K∏j=1K∏i=1μxiμx′jWxi,ziWx′j,zj=0. (22)

Since we assume that , this is equivalent to (21c). Hence, also . This covers all cases of (21). ∎

## Acknowledgments

The authors thank Ali Amjad, Andrei Nedelcu, and Wolfgang Utschick for fruitful discussions. The work of Bernhard C. Geiger was partially funded by the Erwin Schrödinger Fellowship J 3765 of the Austrian Science Fund.

### References

1. C. D. Manning and H. Schütze, Foundations of Statistical Natural Language Processing, 2nd ed.   Cambridge, MA: MIT Press, 2000.
2. D. Wilkinson, Stochastic Modelling for Systems Biology, ser. Chapman & Hall/CRC Mathematical & Computational Biology.   Boca Raton, FL: Taylor & Francis, 2006.
3. B. C. Geiger and C. Temmel, “Lumpings of Markov chains, entropy rate preservation, and higher-order lumpability,” J. Appl. Probab., vol. 51, no. 4, pp. 1114–1132, Dec. 2014, extended version available: arXiv:1212.4375 [cs.IT].
4. C. E. Shannon, “The zero error capacity of a noisy channel,” IEEE Trans. Inf. Theory, vol. 2, no. 3, pp. 8–19, Sep. 1956.
5. H. Witsenhausen, “The zero-error side information problem and chromatic numbers (corresp.),” IEEE Trans. Inf. Theory, vol. 22, no. 5, pp. 592–593, Sep. 1976.
6. G. Cohen, E. Fachini, and J. Körner, “Zero-error capacity of binary channels with memory,” Information Theory, IEEE Transactions on, vol. 62, no. 1, pp. 3–7, Jan 2016.
7. R. Diestel, Graph theory, 3rd ed., ser. Graduate Texts in Mathematics.   Berlin: Springer-Verlag, 2005, vol. 173.
8. J. Körner and A. Orlitsky, “Zero-error information theory,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2207–2229, Oct. 1998.
9. J. Nayak, E. Tuncel, and K. Rose, “Zero-error source-channel coding with side information,” IEEE Trans. Inf. Theory, vol. 52, pp. 4626–4629, Oct. 2006.
10. B. C. Geiger and C. Temmel, “Information-preserving Markov aggregation,” in Proc. IEEE Information Theory Workshop (ITW), Seville, Sep. 2013, pp. 258–262, extended version: arXiv:1304.0920 [cs.IT].
11. J.-C. Delvenne and A.-S. Libert, “Centrality measures and thermodynamic formalism for complex networks,” Phys. Rev. E, vol. 83, pp. 046 117–1–046 117–7, Apr. 2011.
12. Z. Burda, J. Duda, J. Luck, and B. Waclaw, “Localization of the maximal entropy random walk,” Physical Review Letters, vol. 102, no. 16, pp. 160 602–1–160 602–4, Apr. 2009.
13. P. Koulgi, E. Tuncel, S. L. Regunathan, and K. Rose, “On zero-error source coding with decoder side information,” IEEE Trans. Inf. Theory, vol. 49, pp. 99–111, Jan. 2003.
14. N. Alon and A. Orlitsky, “Source coding and graph entropies,” IEEE Trans. Inf. Theory, vol. 42, pp. 1329–1339, Sep. 1996.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters

108097

How to quickly get a good answer:
• Keep your question short and to the point
• Check for grammar or spelling errors.
• Phrase it like a question
Test
Test description