Kolmogorov complexity version of Slepian-Wolf coding

# Kolmogorov complexity version of Slepian-Wolf coding

Marius Zimand Department of Computer and Information Sciences, Towson University, Baltimore, MD. http://triton.towson.edu/~mzimand
###### Abstract

Alice and Bob are given two correlated -bit strings and, respectively, , which they want to losslessly compress and send to Zack. They can either collaborate by sharing their strings, or work separately. We show that there is no disadvantage in the second scenario: Alice and Bob, without knowing the other party’s string, can compress their strings to almost minimal description length in the sense of Kolmogorov complexity. Furthermore, compression takes polynomial time and can be made at any combination of lengths that satisfy some necessary conditions (modulo additive polylogarithmic terms). More precisely, there exist probabilistic algorithms , and deterministic algorithm , with and running in polynomial time, having the following behavior: if , are two integers satisfying , then for , on input and outputs a string of length such that on input reconstructs with high probability (where denotes the plain Kolmogorov complexity of , and is the complexity of conditioned by ). Our main result is more general, as it deals with the compression of any constant number of correlated strings. It is an analog in the framework of algorithmic information theory of the classic Slepian-Wolf Theorem, a fundamental result in network information theory, in which and are realizations of two discrete random variables representing independent draws from a joint distribution. In the classical result, the decompressor needs to know the joint distribution of the sources. In our result no type of independence is assumed and the decompressor does not have any prior information about the sources that are compressed.

## 1 Introduction

The Slepian-Wolf Theorem [SW73] is the analog of the Shannon’s Source Coding theorem for the case of distributed correlated sources. It characterizes the compression rates for such sources. To illustrate the theorem, let us consider a data transmission scheme with two senders, Alice and Bob, and one receiver, Zack (see Figure 1). Alice has as input an -bit string , Bob has an -bit string . Alice uses the encoding function to compress her -bit string to length , and sends to Zack. Bob, separately, uses the encoding function to compress his -bit string to length and sends to Zack. We assume that the communication channels Alice Zack and Bob Zack are noise-free, and that there is no communication between Alice and Bob. Zack is using a decoding function and the common goal of all parties is that , for all in the domain of interest (which is defined by the actual model or by the application). In a randomized setting, we allow the previous equality to fail with some small error probability . Of course, Alice can send the entire and Bob can send the entire , but this seems to be wasteful if and are correlated. We are interested to find what values can and take so that the goal is achieved, when the strings and are jointly correlated.

The Slepian-Wolf theorem takes the standard stance in information theory which assumes that and are realizations of some random variables and, respectively, Y. Furthermore, as it is common in information theory, are assumed to be -Discrete Memoryless Sources (-DMS), which means that , where the ’s are i.i.d. Bernoulli random variables, the ’s are also i.i.d. Bernoulli random variables, and each is drawn according to a joint distribution . In other words, consists of independent draws from a joint distributions on pair of bits. Given the joint distribution and and of the specified type, the problem amounts to finding the set of values and such that there exists and as above with with probability converging to as grows. In information theory parlance, we want to determine the set of achievable transmission rates. By the Source Coding Theorem, it is not difficult to see that it is necessary that and , where is the Shannon entropy function. The Slepian-Wolf theorem states that these relations are essentially sufficient, in the sense that any satisfying strictly the above three inequalities is a pair of achievable rates, if is sufficiently large (“strictly” means that “” replaces “”; see, for example, [CT06] for the exact statement).

What is surprising is that these optimal achievable rates can be realized with Alice and Bob doing their encoding separately. For example if , and , then any pair , with , is a pair of achievable rates, which means that Alice can compress her -bit realization of to approximately bits, without knowing Bob’s realization of , and Bob can do the same. They cannot do better even if they collaborate!

The Slepian-Wolf theorem completely characterizes the set of achievable rates for distributed lossless compression for the case of -DMS, and the result actually holds for an arbitrary number of senders (Theorem 15.4.2, [CT06]). However, the type of correlations between and given by the -DMS model is rather simple. In many applications the quantify some stochastic process at different times and it is not realistic to assume independence between the values at different ’s. The Slepian-Wolf theorem has been extended for sources that are stationary and ergodic  [Cov75], but these also capture relatively simple correlations.

Distributed correlated sources can be alternatively studied using algorithmic information theory, also known as Kolmogorov complexity, which works for individual strings without any type of independence assumption, and in fact without assuming any generative model that produces the strings. We recall that is the Kolmogorov complexity of conditioned by , i.e., the length of a shortest program that computes given in a fixed universal programming system. is also called the minimum description length of given . If is the empty string, we simply write instead of . One remarkable result in this framework is Muchnik’s theorem [Muc02] which states that there exist algorithms and such that for all and for all -bit strings and , on input , and help bits outputs a string of length , and on input , , and help bits reconstructs . Muchnik’s theorem relates to the asymmetric version of the above distributed transmission problem in which only Alice compresses her while Bob sends the entire (or, in an equivalent scenario, Zack already knows ). It says that, given , Alice can compute from her string and only additional help bits a string of minimum description length such that Zack using , and help bits can reconstruct . Muchnik’s theorem has been strengthened in several ways. Musatov, Romashchenko and Shen [MRS11] have obtained a version of Muchnik’s theorem for space bounded Kolmogorov complexity, in which both compression and decompression are space-efficient. Romashchenko [Rom05] has extended Muchnik’s theorem to the general (i.e., non-asymmetric) case. His result is valid for any constant number of senders, but, for simplicity, we present it for the case of two senders: For any two -bit strings and and any two numbers and such that , and , there exist two strings and such that and . In words, for any and satisfying the necessary conditions, Alice can compress to a string of length just slightly larger than , and Bob can compress to a string of length just slightly larger than such that Zack can reconstruct from , provided all the parties use a few help bits. These results raise the following questions: (a) can the help bits be eliminated?, 111In Muchnik’s theorem, Alice computes a program of minimum description length such that from , and help bits, where is the universal Turing machine underlying Kolmogorov complexity. One can hope to eliminate the help bits (as we ask in question (a)), but not the component. This is not possible even when is the empty string. Indeed, it is known that for some strings , the computation of from , and therefore also the computation of a short program for , requires that some information of size bits is available [BS14, Gác74]. and (b) is it possible to implement the protocol efficiently, i.e., in polynomial time?

Bauwens et al. [BMVZ13], Teutsch [Teu14] and Zimand [Zim14] have obtained versions of Muchnik’s theorem with polynomial-time compression, but in which the help bits are still present. In fact, their results are stronger in that the compression procedure on input outputs a polynomial-size list of strings guaranteed to contain a short program for given . This is called list approximation. Note that using help bits, the decoding procedure can pick the right element from the list, re-obtaining Muchnik’s theorem. The gain is that this decoding procedure halts even with incorrect help bits, even though the result may not be the desired . Next, Bauwens and Zimand [BZ14] have eliminated the help bits in Muchnik’s theorem, at the cost of introducing a small error probability. Their result can be reformulated as follows.222In Theorem 3.2 in [BZ14], is the empty string, but the proof works without modifications for any .

###### Theorem 1.1 ([Bz14]).

There exist a probabilistic algorithm and a deterministic algorithm such that runs in polynomial-time, and for all -bit strings and and for every rational number ,

1. on input , and outputs a string of length ,

2. on input and outputs , with probability ,

Thus in the asymmetric case, Alice can compress her input string in polynomial-time to length which is close to minimum description length (closeness is within a polylog additive term). The decoding algorithm does not run in polynomial time and this is unavoidable if compression is done at this level of optimality because there exist so called deep strings (these are strings that have short descriptions, but their decompression from short description takes longer than, say, polynomial time).

In this paper, we prove the analog of Theorem 1.1 for the general non-asymmetric case, i.e., the case in which the number of senders is an arbitrary constant and all senders can compress their inputs. For simplicity, let us consider again the case with two senders, Alice and Bob, and one receiver, Zack. Alice and Bob are using probabilistic encoding algorithms , and respectively , Zack is using the decoding algorithm , and they want that for all , and for all -bit strings and , with probability . We denote , the length of encoding, and , the length of ’s encoding. How large can these lengths be? By counting arguments, one can see that

 |E1(x)|≥C(x∣y)+log(1−ϵ)−O(1)|E2(y)|≥C(y∣x)+log(1−ϵ)−O(1)|E1(x)|+|E2(y)|≥C(x,y)+log(1−ϵ)−O(1).

Our result implies that the above requirements are also sufficient, except for a small overhead of polylog size. Namely, for any two integers and such that and , it is possible to achieve , . Moreover and are polynomial-time probabilistic algorithms. If we do not insist on and running in polynomial time, the overhead can be reduced to .

For the general case, we need to introduce some notation. Let be the number of senders. For any integers and , the set is denoted , and the set is denoted (if , this set is empty). If we have an tuple of strings , and , then the -tuple is denoted .

###### Theorem 1.2.

(Main Result) There exist probabilistic algorithms , a deterministic algorithm , and a function such that run in polynomial time, and for every , for every -tuple of integers , and for every -tuple of -bit strings if

 C(xV∣x[ℓ]−V)≤∑i∈Vni, for all V⊆[ℓ], (1)

then

• For all , on input and outputs a string of length at most ,

• on input outputs , with probability .

Notes

The constraints (1) are necessary up to negligible terms. For example, if there are senders, having, respectively, the -bit strings and , and they compress them, respectively, to lengths and and with probability , then it is necessary that and .

Compared to Romashchenko’s result from [Rom05], we have eliminated the help bits, and thus our encoding and decoding is effective. Moreover, encoding is done in polynomial time (however, as in Theorem 1.1 and for the same reason, decoding cannot be done in polynomial time). The cost is that the encoding procedure is probabilistic and thus there is a small error probability. The proof of Theorem 1.2 is inspired from Romashchenko’s approach, but the technique is quite different.

The models in the classical Slepian-Wolf theorem and in Theorem 1.2 are different, and therefore, strictly speaking, the results are not directly comparable. However, there is a relation between Shannon entropy for DMS random variables and the Kolmogorov complexity of the elements in their support. Namely, if is a DMS, that is it consists of independent copies of an i.i.d -valued random variable with distribution , then, for every , there exists a constant such that with probability . Using this relation, the classical theorem can be obtained from the Kolmogorov complexity version.

Here are two shortcomings of the classical Slepian-Wolf Theorem: (a) it assumes strong independence properties of the sources (i.e., the memoryless property), and (b) decompression requires the knowledge of the distributions of sources. There are versions of this theorem which improve either (a) or (b), but not both. For example, Csiszár [Csi82] has shown source coding theorems with universal coding, which means that the same compression and decompression algorithms work for a large class of sources, without “knowing” their distributions. But the proof relies on the memoryless property. Miyake and Kanaya [MK95] have obtained a version of the Slepian-Wolf theorem for general random variables, using information-spectrum methods introduced by Han and Verdú [HV93]. But their proof does not seem to allow universal coding and, moreover, it has an intrinsic asymptotical nature. Theorem 1.2 does not require any type of independence, in fact it does not assume any generative model. Also the same compression and decompression algorithms work for all strings satisfying the necessary bounds (1) i.e., there is universal coding.

In the classical Slepian-Wolf theorem, the senders and the receiver share a public string of exponential length. In Theorem 1.2, the parties do not share any information.

Theorem 1.2 is interesting even for the case of a single source compression (i.e., ). Note that, by performing an exhaustive search, we obtain a procedure that on input and outputs a shortest program for . However, any such procedure runs in time larger than any computable function [BZ14]. In contrast, Bauwens and Zimand (see Theorem 1.1) have shown that if we use randomization, one can find a short program for in polynomial time, starting with input . Thus, computing a short program for from and is an interesting example of a task that probabilistically can be done in polynomial time, but deterministically requires time larger than any computable function. However the requirement that is known exactly is quite demanding. The following corollary, which is just Theorem 1.2 with , shows that in fact it is sufficient to have an upper bound . which makes the result more amenable to applications. This solves an open question from [TZ16].

###### Corollary 1.3.

There exist a probabilistic algorithm and a deterministic algorithm such that runs in polynomial time, and for every , for every -bit string , every positive rational number , and for every integer ,

• on input , and outputs a string of length at most ,

• on input outputs with probability .

## 2 Proof of Theorem 1.2

### 2.1 Combinatorial tool: graphs with the rich owner property

The key tool in the proof is a certain type of bipartite graph, which we call graphs with the rich owner property. Similar graphs, bearing the same name, were used in [BZ14], but the graphs in this paper have a stronger property. We recall that in a bipartite graph, the nodes are partitioned in two sets, (the left nodes) and (the right nodes), and all edges connect a left node to a right node. We allow multiple edges between two nodes. In all the graphs in this paper, all the left nodes have the same degree, called the left degree. Specifically, we use bipartite graphs with , and with left degree . We label the edges outgoing from with strings . We typically work with a family of graphs indexed on and such a family of graphs is computable if there is an algorithm that on input , where and , outputs the -th neighbor of . Some of the graphs also depend on a rational . A constructible family of graphs is explicit if the above algorithm runs in time .

We now introduce informally the notions of a rich owner and of a graph with the rich owner property. Let . The -degree of a right node is the number of its neighbors that are in . Roughly speaking a left node is a rich owner with respect to , if most of its right neighbors are “well-behaved,” in the sense that their -degree is not much larger than , the average right degree when the left side is restricted to . One particularly interesting case, which is used many times in this paper, is when most of the neighbors of a left have -degree , i.e., when “owns” most of its right neighbbors. A graph has the rich owner property if, for all , most of the left nodes in are rich owners with respect to . In the formal definition, we replace the average right degree with an arbitrary value, but since in applications, this value is approximately equal to the average right degree, the above intuition should be helpful.

The precise definition of a -rich owner with respect to is as follows. There are two regimes of interest depending on how large is the size of .

###### Definition 2.1.

Let be a bipartite graph as above and let be a subset of . We say that is a -rich owner with respect to if the following holds:

• small regime case: If , then at least fraction of ’s neighbors have -degree equal to , that is they are not shared with any other nodes in . We also say that owns with respect to B if is a neighbor of and the -degree of is .

• large regime case: If , then at least a fraction of ’s neighbors have -degree at most .

If is not a -rich owner with respect to , then it is said to be a -poor owner with respect to .

###### Definition 2.2.

A bipartite graph has the -rich owner property if for every set all nodes in , except at most of them, are -rich owners with respect to .

There are several notions in the literature which are related to our Definition 2.2, the main difference being that they require some non-congestion property similar to rich ownership to hold only for some subsets . Reingold and Raz [RR99] define extractor-condenser pairs, in which only subsets with size approximately matter. As already mentioned, Bauwens and Zimand [BZ14] use a type of graph also called graphs with the rich owner property, which are close to the extractor-codenser pairs from [RR99]. Capalbo et al. [CRVW02] construct lossless expanders, which only consider the subsets in the small regime case. In our application, we need to consider subsets of any size and this leads to Definition 2.2, and the distinction between the small regime case and the large regime case.

The following theorem provides the type of graph that we use. The proof relies on the extractor from [RRV99] and uses a combination of techniques from [RR99][CRVW02], and [BZ14]. It is presented in Section 4.

###### Theorem 2.3.

For every natural numbers and and for every rational number , there exists an explicit bipartite graph that has the -rich property with the following parameters:

1. ,

2. ,

3. left degree ,

where .

### 2.2 Proof overview

For this proof sketch, we consider the case with senders, which have as input the -bit strings and, respectively, . By hypothesis, the compression lengths and satisfy

 n1≥C(x1∣x2),n2≥C(x2∣x1),n1+n2≥C(x1,x2).

The two senders use graphs and, respectively, , with the and, respectively, -rich owner property and with , obtained from Theorem 2.3. The left nodes in both graphs are the set of -bit strings, the right nodes in are the binary strings of length , and the right nodes in are the binary strings of length . Sender picks , a random neighbor of (viewed as a left node) in , and sender picks , a random neighbor of (viewed as a left node) in .

We need to explain how the receiver can reconstruct and from and . Most of the statements below hold with probability . For conciseness, when this is clear, we omit mentioning this fact. We first assume that the decompression procedure knows and (this is usually called the complexity profile of and ). We will see later how to eliminate this assumption.

The first case to analyze is when . Then can be constructed as follows. Let . This is a subset of the left nodes in , that contains , and is in the small regime case (because ). The set of poor owners in w.r.t. has size at most . Since the set of poor owners w.r.t. can be effectively enumerated given , we derive that every poor owner has complexity less than . So, is a rich owner, which implies that with probability , does not share with any other nodes in . It follows that can be constructed from by enumerating till we encounter a neighbor of . As we have seen, with probability , this neighbor is . Next, we take , and in a similar way we show that is in the small regime case in , and is a rich owner w.r.t. . Therefore, with probability , owns . Thus, if we enumerate till we encounter a neighbor of , we obtain .

The other case is when . We can show that with high probability,

 C(p2)=∗n2, (2)

where means that the equality holds up to poly-logarithmic terms; we use and in a similar way. For that, again we consider . This is a subset of the left nodes of that is now in the large regime case. In the same way as above, is a rich owner in w.r.t. , which implies that with probability , it shares with at most other nodes in . Taking into account that can be enumerated given , it follows that can be constructed from , , and ’s rank among ’s neighbors in , which implies that . So, . Since the length of is , we derive that .

The next observation is that, given and , the receiver can construct . At this moment, the receiver does not have , so actually will be constructed later, after the receiver has . However, the observation is helpful even at this stage. Let us first see why the observation is true. Consider . This is a subset of left nodes in that contains , and is in the small regime case (because ). Similarly to the argument used earlier, is a rich owner w.r.t. . So, owns w.r.t. , which implies that can be obtained by enumerating the elements of till we encounter one that is a neighbor of .

The observation implies that . Since it also holds that (because can be obtained from and its index among ’s neighbors in , which takes poly log bits to describe), we have

 C(x2,x1)=∗C(p2,x1). (3)

Then, by (2) and (3),

 C(x1∣p2)=∗C(x1,p2)−C(p2)=∗C(x1,x2)−n2, (4)

where the first follows from the chain rule. The last estimation, allows the receiver to reconstruct from and . For that, consider . Our estimation (4) of implies that is in (in this proof sketch we ignore the * in equation (4)). Next, by the same argument as above, the poor owners in have complexity conditioned by less than , and this implies that is not a poor owner. Since , is in the small regime case. This implies that with high probability owns in w.r.t. . So, if we enumerate till we encounter a neighbor of , we obtain .

With in hand, the receiver constructs , using the earlier observation.

Decompression without knowledge of the input’s complexity profile. As promised, we show how to eliminate the assumption that the decompressor knows . The idea is to let run the above procedure for all possibilities of and use hashing to isolate the correct run (or some run that produces the same output). Since and are -bit strings, there are possibilities for the triplet and hashing will add only bits. For hashing we use the following result. Alternatively, it is possible to use the almost -universal function of Naor and Naor [NN93], or Krawczyk [Kra94].

###### Lemma 2.4 ( [Bz14]).

Let be distinct -bit strings, which we view in some canonical way as integers .

Let be the -th prime number and let , where .

For every , for fraction of in , the value of is unique in the sequence .

For , Sender who has input will send in addition to (a random neighbor of in , as we have seen above), also the string hash, which is computed as follows. Taking into account that for any -bit string , , we let be an upper bound for the number of all triplets , where and are -bit strings, and let . Now, hash, where is a prime number chosen at random from the first prime numbers, The decompressor runs in parallel the procedure presented above for all guesses for and halts when the first of the parallel runs outputs with and . Note that some of the parallel runs may not halt, but the run corresponding to the correct guess of halts and yields, as we have seen, with probability . By Lemma 2.4, the probability that a run halts with or but and is at most . Consequently, this procedure reconstructs correcty with probability . Since the -th prime number is bounded by and can be found in time polynomial in , the length of each of the compressed strings increases with only bits, and the running time of compression is still polynomial.

If the number of senders is , several technical complications arise. In the case , sketched above, the decoding algorithm needs to have and to be able to enumerate the various sets . As we have seen, we can assume that the receiver knows the complexity profile of the input strings, and therefore, the decoding algorithms has these values. When , the various sets are defined in term of complexities containing certain combinations of the input strings s, and of the randomly picked right neighbors, ’s. To give just one example, the complexity is required at some point. The decoding algorithm needs to obtain, with high probability, good approximations of such complexities from the complexity profile of the input strings (see Lemma 2.7). Another technical aspect is that the approximation slacks (hidden above in the notation , and also those arising in the estimations of the complexities of “combined” tuples of ’s and ’s) cannot be ignored as we did in this proof sketch. To handle this, senders use graphs with decreasing ’s (i.e., ) and increasing overhead in the length of the right neighbors. More precisely, sender (for every ), uses a graph with the -rich neighbor property, in which the right nodes have length , where the additional is needed to handle the effect of approximations. The overhead is bounded by , where denotes a constant that depends on . In spite of these technicalities, the core ideas of the proof are those presented in the above sketch.

### 2.3 Parameters

We fix , the length of the input strings .

We use a constant that will take a large enough value so that the estimations done in this proof are all valid. The construction uses parameters , , and that are all functions of and are defined as follows.

For all , is defined in terms of , according to the relation given in Theorem 2.3: . We also define .

The parameters are defined recursively in descending order as follows: , and then , for . Note that for all , , and , where denotes a constant that depends on . We will use the fact that for any constant , the following inequalities hold provided is large enough:

 γk≥a(γk+1+log(1/δ2k)+logn), (5)

and

 log(1/δk)>16γk+1+alogn. (6)

We next define for all ,

 ηk=2γk+log(2/δ2k). (7)

Note that for all , .

We denote .

The sequence is defined recursively (in descending order) as follows: and

 ^δk=2^δk+1+δk.

It can be checked that

### 2.4 Handling the input complexity profile

As we did in Section 2.2, Proof overview, we first assume that the decompressor knows the complexity profile of the input strings , which is the tuple . This assumption can be eliminated in the same way as we did in the proof overview.

### 2.5 Encoding

Each sender , , has as input the -bit string and uses the graph promised by Theorem 2.3, with that has the -rich owner property. Thus left nodes are -bit strings and in this way the input string is a left node in . Sender picks uniformly at random among the right neighbors of in the graph , and sends to the receiver. The length of is . (If the length of the input strings is not known by the receiver, Sender also sends the length of . Note that the algorithms work even if the strings have different lengths, in which case in the proof is the maximum of these lengths.)

### 2.6 Decoding

We first state some technical lemmas that play an important role in the decoding procedure. They are proved in Section 3. The first two lemmas estimate how the complexity of is related to the complexity of , for . There are two regimes to analyze, depending on whether the complexity of is low or high. We analyze the respective complexities conditioned by some string , which for now is an arbitrary string, but later when we apply these lemmas for and , will be instantiated with the previous inputs and the nodes .

###### Lemma 2.5.

(low complexity case) Let be an arbitrary string and suppose .

1. There exists an algorithm that on input and outputs with probability (over the random choice of ).

2. With probability , .

###### Lemma 2.6.

(high complexity case) Let be an arbitrary string and suppose .

1. There exists an algorithm that on input , and some string of length , outputs with probability (over the random choice of ).

2. With probability , and .

The decoding procedure needs good estimations of the complexities of the form . The following lemma shows that it is possible to effectively approximate them with precision . The inductive proof requires the approximation of more general complexities of the form for all and for all .

###### Lemma 2.7.

There is an algorithm with the following behaviour.

For all , the algorithm on input , , and for all non-empty , outputs an integer such that with probability ,

 ∣∣C(xV,p[k+1..ℓ])−A(xV,p[k+1..ℓ])∣∣≤4γk+1=(logn)Oℓ(1).

The next lemma shows that the constraints (1) remain roughly valid if we replace the left nodes with the corresponding right nodes .

###### Lemma 2.8.

For all , for all non-empty , the following inequality holds with probability :

 C(xV∣x[k]−V,p[k+1..ℓ])≤∑j∈Vnj+O(ℓ−k)⋅logn

Decoding algorithm

Some of the estimations below hold with error probability bounded by or , for various , and all these values are bounded by (the probability is on the random choices of ). There are “bad” events when the estimations are violated. By taking sufficiently large, the union of all “bad events” has probability at most . The following arguments are done conditioned on the event that none of the “bad” events happened.

First, using the algorithm from Lemma 2.7, the values are calculated by the formula

 A(xk∣x[k−1],p[k+1..ℓ])=A(xk,x[k−1],p[k+1..ℓ])−A(x[k−1],p[k+1..ℓ]).

By the chain rule and the bounds on approximation error established in Lemma 2.7, it holds that

 ∣∣C(xk∣x[k−1],p[k+1..ℓ])−A(xk∣x[k−1],p[k+1..ℓ])∣∣≤8γk+