On the Separation of Lossy SourceNetwork Coding and Channel Coding in Wireline Networks
Abstract
This paper proves the separation between sourcenetwork coding and channel coding in networks of noisy, discrete, memoryless channels. We show that the set of achievable distortion matrices in delivering a family of dependent sources across such a network equals the set of achievable distortion matrices for delivering the same sources across a distinct network which is built by replacing each channel by a noiseless, pointtopoint bitpipe of the corresponding capacity. Thus a code that applies sourcenetwork coding across links that are made almost lossless through the application of independent channel coding across each link asymptotically achieves the optimal performance across the network as a whole.
I Introduction
In his seminal work [1], Shannon separates the problem of communicating a memoryless source across a single noisy, memoryless channel into separate lossless source coding and channel coding problems. The corresponding result for lossy coding in pointtopoint channels is almost immediate since lossy coding in a pointtopoint channel is equivalent to lossless coding of the codeword indices, and it appears in the same work [1]. For a single pointtopoint channel, separation holds under a wide variety of source and channel distributions (see, for example [2] and the references therein). Unfortunately, separation does not necessarily hold in network systems. Even in very small networks like the multiple access channel [3], separation can fail when statistical dependencies between the sources at different network locations are useful for increasing the rate across the channel. Since source codes tend to destroy such dependencies, joint sourcechannel codes can achieve better performance than separate source and channel codes in these scenarios.
This paper proves the separation between sourcenetwork coding and channel coding in networks of independent noisy, discrete, memoryless channels (DMC). Roughly, we show that the vector of achievable distortions in delivering a family of dependent sources across such a network equals the vector of achievable distortions for delivering the same sources across a distinct network . Network is built by replacing each channel in by a noiseless, pointtopoint bitpipe of the corresponding capacity . Thus a code that applies sourcenetwork coding across links that are made almost lossless through the application of independent channel coding across each link asymptotically achieves the optimal performance across the network as a whole. Note that the operations of network source coding and network coding are not separable, as shown in [4] and [5] for nonmulticast and multicast lossless source coding, respectively. As a result, a joint networksource code is required, and only the channel code can be separated. While the achievability of a separated strategy is straightforward, the converse is more difficult since preserving statistical dependence between codewords transmitted across distinct edges of a network of noisy links improves the endtoend network performance in some networks [6].
The results derived here give a partial generalization of [7, 8] and [6], which prove the separation between network coding and channel coding for multicast [7, 8] and general demands [6], respectively, under the assumption that messages transmitted to different subset of users are independent and are uniformly distributed. The shift here is from independent sources to dependent sources, from lossless to lossy data description, and from memoryless to nonmemoryless sources.
The remainder of the paper is organized as follows. Sections II and III describe the notation and problem setup, respectively. Section IV describes a tool called a stacked network that allows us to employ typicality across copies of a network rather than typicality across time in the arguments that follow. Section V gives our main results for both memoryless sources and sources with memory.
Ii Notation
Calligraphic letters, like , , and , refer to sets, and the size of a set is denoted by . For a random variable , its alphabet set is represented by .
While a random variable is denoted by , represents a random vector. The length of a vector is implied in the context, and its element is denoted by .
For two vectors and of the same length , denotes the distance between the two vectors defined as . If and represent probability distributions, i.e., and for all , then the total variation distance between and is defined as
Unlike [6], this paper uses strong typicality arguments to demonstrate the equivalence between noisy channels and noiseless bitpipes of the same capacity. We therefore assume that the channel input and output alphabets are finite. The alphabets for the sources described across the channel may be discrete or continuous.
Iii The problem setup
Consider a multiterminal network consisting of nodes interconnected via some pointtopoint, independent DMCs. The network structure is represented by a directed graph with node set and edge set . Each directed edge implies a pointtopoint DMC between nodes (input) and (output). Each node observes some source process , and is interested in reconstructing a subset of the processes observed by the other nodes. The alphabet of source , , can be either scalar or vectorvalued. This allows node to have a vector of sources. For achieving this goal in a block coding framework, source output symbols are divided into nonoverlapping blocks of length . Each block is described separately. At the beginning of the coding period, each node has observed a length block of the process , i.e., . The blocks observed at different nodes are described over the network in uses of the network (The rate is a parameter of the code). For those time steps, at each step , each node generates its next channel inputs as a function of and its channels’ outputs up to time , here denoted by , according to
(1) 
Note that each node might be the input to more than one channel and/or the output of more than one channel. Hence, both and might be vectors depending on the indegree and outdegree of node . The reconstruction at node of the block observed at node is denoted by . This reconstruction is a function of the source observed at node and node ’s channel outputs, i.e., , where
(2) 
The performance criterion for a coding scheme is its induced expected average distortions between sources and reconstruction blocks, i.e., for all
where is a perletter distortion measure. As mentioned before and are either scalar or vectorvalued. This allows the case where node observes multiple sources and node is interested in reconstructing a subset of them. Let
If node is not interested in reconstructing node , then .
The distortion matrix is said to be achievable at a rate in a network , if for any , there exists a pair , , and block length coding scheme such that
(3) 
for any .
Iv Stacked network
For a given network , the corresponding fold stacked network is defined as copies of the original network [6]. That is, for each node and each edge in , there are copies of the same node or same edge in . At each time instance, each node has access to the data available at nodes which are its copies, and potentially uses this extra information in generating the channel inputs of the future time instances. Likewise, in decoding, all copies of a node can collaborate in reconstructing the signals. This is made more precise in the following two definitions
(4) 
and
(5) 
which correspond to (1) and (2) in the original network. In (4) and (5) all the vectors are of length .
In an layered network, the distortion between the source observed at node and its reconstruction at node is defined as
(6) 
for any .
A distortion matrix is said to be achievable in the stacked network at some rate if for any given , there exist , and large enough, such that for all . Note that the dimension of the distortion matrices in both single layer and multilayer networks is . Let and denote the closure of the set of achievable distortion matrices at some rate in a network and its stacked version respectively. The following theorem establishes the relationship between the two sets.
Theorem 1
At any rate ,
(7) 

Proof of . Consider any . Then for any , there exists a coding operating scheme at rate on such that (3) is satisfied. For any , a stacked network that uses this same coding strategy independently in each layer achieves
(8) 
. Let . Since , for any , there exists integers , , and such that a stacked network consisting of layers along with a block length coding scheme for source symbols on this stacked network achieves
for all . The same coding scheme can be used in a singlelayer network as follows. Consider a single layer network where each node observes a length block of source symbols and describes the block in the next time steps. At times , each node sends what would have been sent at time 1 by node in layer of the stacked network. After that, having collected the output of the previous time steps, at times , node sends the outputs of the same node at time 2 in layer (Note that in the first time steps, node ’s output is only a function of its own source, not the channels’ outputs. It only collects the channel outputs in order to use them during the next time steps.). The same strategy is used in time intervals, each comprising network uses. During each period, the new channel outputs observed by node are recorded to be used in the future periods, but do not affect the next inputs generated by that node during that time period. Using this strategy, at the end of channel uses, each node’s observation has exactly the same distribution as the collection of observations of its copies in the stacked networks. Therefore, applying the same decoding rule will result in the same performance. Hence, .
V Replacing a noisy channel with a bit pipe
Va Memoryless sources
In this section we assume all sources are jointly i.i.d., i.e., for any , , where does not depend on . Note that at each time instant the sources might be correlated with each other.
In the described network , for some such that , consider the noisy channel connecting these two nodes. The channel is described by its transition probabilities , and has some finite capacity Now consider a network which is identical to except for the noisy channel between and , which is replaced by a bitpipe of capacity .
Theorem 2
For any ,
(9) 
[Proof outline] By Theorem 1, the achievable region of a network is equal to the achievable region of its stacked version. Hence, it suffices to prove that .

: Let . We need to show that as well. Note that and are identical except for the DMC connecting nodes and in which is replaced by a bitpipe of capacity in . We next show that any code for can be operated on with a similar expected distortion. Let the number of layers in both networks be . Given the capacity of the bitpipes, the number of bits that can be carried from to in is at most , where . Hence, if is large enough, the same information can be transmitted from to in by doing appropriate channel coding across the layers over the noisy channel and its copies connecting and in . Let denote the probability of error of the channel code operating over the channel corresponding to the edge and its copies in , and let . Then the extra expected distortion introduced at each reconstruction point is bounded above by and can be made arbitrarily small.

: Let . We prove that . Consider a code defined on that achieves within of , and consider the fold stacked version of , . Assume that the same code is applied independently in each layer. We first show that, when all sources are memoryless and uniformly distributed, the performance of the code given the realization of only depends on the empirical distribution of defined as
(10) for all and . Here the subscript refers to time . After establishing this, we use the result proved in [9] and show that at time we can simulate the performance of the noisy link by a bitpipe of the same capacity. For the rest of the proof, let and denote some i.i.d. source observed at some node in and its reconstruction at some other node in .
In the original network,
(11) On the other hand, in the fold stacked network,
(12) Comparing (11) and (12) reveals that the desired result will follow if we can find a coding scheme for which,
(13) can be made arbitrary small.
To prove this, consider a channel with input drawn i.i.d. from some distribution . The encoder observes source symbols and sends a message of bits to the decoder. The decoder converts these bits into a reconstruction block . The empirical joint distribution between the channel input and channel output induced by the bit pipe is defined in the classical sense as follows
Consider a DMC described by transition probabilities whose input is an i.i.d. process distributed according to some distribution . In [9], it is shown that, as long as , any such channel can be simulated by a bit pipe of rate at most such that the total variation between and can be made arbitrarily small for large enough block lengths. In other words, there exists a sequence of coding schemes over the bitpipe such that
(14) (where and are vectors describing distributions () and () respectively.)
Combining this result with our initial claim yields the desired result, i.e., at time , we can replace the noisy link by a bitpipe. To extend this result to the next time steps, we use induction. Note that in the original network
(15) On the other hand, using the same analysis used in deriving (12), in the fold stacked network,
(16) Therefore, we need to show that by appropriate coding over the bitpipes,
(17) can be made arbitrary small. Note that
(18) and
(19) where for
We have already proved that by appropriate coding, we can make the first term in (19) converge to the first term in (18) with probability one. By induction, we can prove that the same result is true for any other term in (19) and its corresponding term in (18). After proving this, since all the terms in (19) and as a result their product are positive and upperbounded by , we can use the Dominated Convergence Theorem (see, for example, [10]) to show that (17) can be made arbitrary small.
To apply induction, assume there exist some coding schemes by which we make the first terms in (19) each converge to the corresponding term in (18) almost surely. Using this assumption, we prove that the same thing is true for the term as well.
Note that when the first terms are very close, the frequency of occurrence of each pattern across the layers in is very close to the pattern’s probability. Since the two networks perform the same except for link , the network guarantees that the frequency of is also close to its probability in . In order to finish the proof, we use Lemma 1 proved in Appendix 1.
Lemma 1
If we choose the random codes used at times and independently, then
(20) where the expectation is both with respect to the network and the code selections.
VB Sources with memory
Assume that the sources are no longer memoryless but mixing. That is for any integers and
goes to as approaches . In the proof of Theorem 2, we used the fact that the sources are correlated and jointly i.i.d. to conclude that the inputs to the copies of a channel in the stacked network are i.i.d. If the sources have memory, this does not hold any more. But, if we assume that the sources are mixing, then for block length large enough, the two sets and look like two i.i.d. sequences. Therefore, in the stacked network, if we code the evennumbered layers together and the oddnumbered ones together, such that each one is done separate from the other one, we get back to the i.i.d. regime and can prove a similar result.
Appendix A: Proof of Lemma 1
Acknowledgments
SJ is supported by the Center for Mathematics of Information at Caltech, and ME is supported by the DARPA ITMANET program under grant number W911NF0710029.
References
 C. E. Shannon,“A mathematical theory of communications: Parts I and II,” Bell Syst. Tech. J., vol. 27, pp. 379Ð423, 623Ð656, 1948.
 S. Vembu, S. Verdu, and Y. Steinberg, “The sourcechannel separation theorem revisited,” IEEE Trans. Info. Theory, vol. 41, no. 1, pp. 4454, Jan. 1995.
 A. El Gamal and T. M. Cover, “Multiple user information theory,” Proc. IEEE, vol. 68, pp. 1466Ð1483, Dec. 1980.
 M. Effros, M. Médard, T. Ho, S. Ray, D. Karger and R. Koetter, “Linear network codes: a unified framework for source, channel, and network coding,” Proc. of the DIMACS Workshop on Network Info. Theory, Piscataway, NJ, March 2003.
 A. Ramamoorthy, K. Jain, P. A. Chou, and M. Effros, “Separating distributed source coding from network coding,” IEEE Transactions on Information Theory, vol. 52, pp. 2785Ð2795, June 2006.
 R. Koetter, M. Effros, and M. Médard. “On the theory of network equivalence,” IEEE Inform. Theory Workshop (ITW), 2009.
 S. Borade, “Network Information Flow: Limits and Achievability,” In Proc. IEEE Int. Symp. Inform. Theory (ISIT), Lausanne, Switzerland, 2002.
 L. Song, R. W. Yeung, and N. Cai, “A separation theorem for single source network coding,” IEEE Transactions on Information Theory, vol. 52, pp. 18611871, May 2006.
 P. Cuff, H. Permuter, T.M. Cover. “Coordination Capacity,” submitted to IEEE Trans. on Info.Theory, Aug. 2009 (available at arxiv.org/abs/0909.2408).
 R. Durrett, Probability. Theory and examples, Wadsworth Brooks/Cole, Pacific Grove, CA, 1991.