A Formula for the Capacity of the General Gel’fand-Pinsker Channel

A Formula for the Capacity of the General Gel’fand-Pinsker Channel


We consider the Gel’fand-Pinsker problem in which the channel and state are general, i.e., possibly non-stationary, non-memoryless and non-ergodic. Using the information spectrum method and a non-trivial modification of the piggyback coding lemma by Wyner, we prove that the capacity can be expressed as an optimization over the difference of a spectral inf- and a spectral sup-mutual information rate. We consider various specializations including the case where the channel and state are memoryless but not necessarily stationary.


el’fand-Pinsker, Information spectrum, General channels, General sources

1 Introduction

In this paper, we consider the classical problem of channel coding with noncausal state information at the encoder, also know as the Gel’fand-Pinsker problem [1]. In this problem, we would like to send a uniformly distributed message over a state-dependent channel , where and are the state, input and output alphabets respectively. The random state sequence is available noncausally at the encoder but not at the decoder. See Fig. 1. The Gel’fand-Pinsker problem consists in finding the maximum rate for which there exists a reliable code. Assuming that the channel and state sequence are stationary and memoryless, Gel’fand and Pinsker [1] showed that this maximum message rate or capacity is given by


The coding scheme involves a covering step at the encoder to reduce the uncertainty due to the random state sequence and a packing step to decode the message [2, Chapter 7]. Thus, we observe the covering rate and the packing rate in (1). A weak converse can be proved by using the Csiszár-sum-identity [2, Chapter 7]. A strong converse was proved by Tyagi and Narayan [3] using entropy and image-size characterizations [4, Chapter 15]. The Gel’fand-Pinsker problem has numerous applications, particularly in information hiding and watermarking [5].

In this paper, we revisit the Gel’fand-Pinsker problem. Instead of assuming stationarity and memorylessness on the channel and state sequence, we let the channel be a general one in the sense of Verdú and Han [6, 7]. That is, is an arbitrary sequence of stochastic mappings from to . We also model the state distribution as a general one . Such generality allows us to understand optimal coding schemes for general systems. We prove an analogue of the Gel’fand-Pinsker capacity in (1) by using information spectrum analysis [7]. Our result is expressed in terms of the limit superior and limit inferior in probability operations [7]. For the direct part, we leverage on a technique used by Iwata and Muramatsu [8] for the general Wyner-Ziv problem [2, Chapter 11]. Our proof technique involves a non-trivial modification of Wyner’s piggyback coding lemma (PBL) [9, Lemma 4.3]. We also find the capacity region for the case where rate-limited coded state information is available at the decoder. This setting, shown in Fig. 1, was studied by Steinberg [10] but we consider the scenario in which the state and channel are general.

Figure 1: The Gel’fand-Pinsker problem with rate-limited coded state information available at the decoder [11, 10]. The message is uniformly distributed in and independent of . The compression index is rate-limited and takes values in . The canonical Gel’fand-Pinsker problem [1] is a special case in which the output of the state encoder is a deterministic quantity.

1.1 Main Contributions

There are two main contributions in this work.

First by developing a non-asymptotic upper bound on the average probability for any Gel’fand-Pinsker problem (Lemma 8), we prove that the general capacity is


where the supremum is over all conditional probability laws (or equivalently over Markov chains given the channel law ) and (resp. ) is the spectral inf-mutual information rate (resp. the spectral sup-mutual information rate) [7]. The expression in (2) reflects the fact that there are two distinct steps: a covering step and packing step. To cover successfully, we need to expend a rate of (for any ) as stipulated by general fixed-length rate-distortion theory [12, Section VI]. Thus, each subcodebook has to have at least sequences. To decode the codeword’s subcodebook index correctly, we can have at most codewords by the general channel coding result of Verdú and Han [6]. We can specialize the general result in (2) to the following scenarios: (i) no state information, thus recovering the result by Verdú and Han [6], (ii) common state information is available at the encoder and the decoder, (iii) the channel and state are memoryless (but not necessarily stationary) and (iv) mixed channels and states [7].

Second, we extend the above result to the case where coded state information is available at the decoder. This problem was first studied by Heegard and El Gamal [11] and later by Steinberg [10]. In this case, we combine our coding scheme with that of Iwata and Muramatsu for the general Wyner-Ziv problem [8] to obtain the tradeoff between , the rate of the compressed state information that is available at the decoder, and be the message rate. We show that the tradeoff (or capacity region) is the set of rate pairs satisfying


for some . This general result can be specialized the stationary, memoryless setting [10].

1.2 Related Work

The study of general channels started with the seminal work by Verdú and Han [6] in which the authors characterized the capacity in terms of the limit inferior in probability of a sequence of information densities. See Han’s book [7] for a comprehensive exposition on the information spectrum method. This line of analysis provides deep insights into the fundamental limits of the transmission of information over general channels and the compressibility of general sources that may not be stationary, memoryless or ergodic. Information spectrum analysis has been used for rate-distortion [12], the Wyner-Ziv problem [8], the Heegard-Berger problem [13], the Wyner-Ahlswede-Körner (WAK) problem [14] and the wiretap channel [15, 16]. The Wyner-Ziv and wiretap problems are the most closely related to the problem we solve in this paper. In particular, they involve differences of mutual informations akin to the Gel’fand-Pinsker problem. Even though it is not in the spirit of information spectrum methods, the work of Yu et al. [17] deals with the Gel’fand-Pinsker problem for non-stationary, non-ergodic Gaussian noise and state (also called “writing on colored paper”). We contrast our work to [17] in Section 3.5.

It is also worth mentioning that bounds on the reliability function (error exponent) for the Gel’fand-Pinsker problem have been derived by Tyagi and Narayan [3] (upper bounds) and Moulin and Wang [18] (lower bounds).

1.3 Paper Organization

The rest of this paper is structured as follows. In Section 2, we state our notation and various other definitions. In Section 3, we state all information spectrum-based results and their specializations. We conclude our discussion in Section 4. The proofs of all results are provided in the Appendices.

2 System Model and Main Definitions

In this section, we state our notation and the definitions of the various problems that we consider in this paper.

2.1 Notation

Random variables (e.g., ) and their realizations (e.g., ) are denoted by upper case and lower case serif font respectively. Sets are denoted in calligraphic font (e.g., the alphabet of is ). We use the notation to mean a vector of random variables . In addition, denotes a general source in the sense that each member of the sequence is a random vector. The same holds for a general channel , which is simply a sequence of stochastic mappings from to . The set of all probability distributions with support on an alphabet is denoted as . We use the notation to mean that the distribution of is . The joint distribution induced by the marginal and the conditional is denoted as . Information-theoretic quantities are denoted using the usual notations [7, 4] (e.g., for mutual information or if we want to make the dependence on distributions explicit). All logarithms are to the base . We also use the discrete interval [2] notation and, for the most part, ignore integer requirements.

We recall the following probabilistic limit operations [7].

Definition 1.

Let be a sequence of real-valued random variables. The limsup in probability of is an extended real-number defined as


The liminf in probability of is defined as


We also recall the following definitions from Han [7]. These definitions play a prominent role in the following.

Definition 2.

Given a pair of stochastic processes with joint distributions , the spectral sup-mutual information rate is defined as


The spectral inf-mutual information rate is defined as in (7) with in place of . The spectral sup- and inf-conditional mutual information rates are defined similarly.

2.2 The Gel’fand-Pinsker Problem

We now recall the definition of the Gel’fand-Pinsker problem. The setup is shown in Fig. 1 with .

Definition 3.

An code for the Gel’fand-Pinsker problem with channel and state distribution consists of

  • An encoder (possibly randomized)

  • A decoder

such that the average error probability in decoding the message does not exceed , i.e.,


where and .

We assume that the message is uniformly distributed in the message set and that it is independent of the state sequence . Here we remark that in (8) (and everywhere else in the paper), we use the notation even though may not be countable. This is the convention we adopt throughout the paper even though integrating against the measure as in would be more precise.

Definition 4.

We say that the nonnegative number is an -achievable rate if there exists a sequence of codes for which


The -capacity is the supremum of all -achievable rates. The capacity .

The capacity of the general Gel’fand-Pinsker channel is stated in Section 3.1. This generalizes the result in (1), which is the capacity for the Gel’fand-Pinsker channel when the channel and state are stationary and memoryless.

2.3 The Gel’fand-Pinsker Problem With Coded State Information at the Decoder

In fact the general information spectrum techniques allow us to solve a related problem which was first considered by Heegard and El Gamal [11] and subsequently solved by Steinberg [10]. The setup is shown in Fig. 1.

Definition 5.

An code for the Gel’fand-Pinsker problem with channel and state distribution and with coded state information at the decoder consists of

  • A state encoder:

  • An encoder: (possibly randomized)

  • A decoder:

such that the average error probability in decoding the message is no larger than , i.e.,


where .

Definition 6.

We say that the pair of nonnegative numbers is an achievable rate pair if there exists a sequence of codes such that in addition to (9), the following holds


The set of all achievable rate pairs is known as the capacity region .

Heegard and El Gamal [11] (achievability) and Steinberg [10] (converse) showed for the discrete memoryless channel and discrete memoryless state that the capacity region is the set of rate pairs such that


for some Markov chain . Furthermore, can be taken to be a deterministic function of , and . The constraint in (12) is obtained using Wyner-Ziv coding with “source” and “side-information” . The constraint in (13) is analogous to the Gel’fand-Pinsker capacity where is common to both encoder and decoder. A weak converse was proven using repeated applications of the Csiszár-sum-identity. We generalize Steinberg’s region for the general source and general channel using information spectrum techniques in Section 3.6.

3 Information Spectrum Characterization of the General Gel’fand-Pinsker problem

In this Section, we first present the main result concerning the capacity of the general Gel’fand-Pinsker problem (Definition 3) in Section 3.1. These results are derived using the information spectrum method. We then derive the capacity for various special cases of the Gel’fand-Pinsker problem in Section 3.2 (two-sided common state information) and Section 3.3 (memoryless channels and state). We consider mixed states and mixed channels in Section 3.4. The main ideas in the proof are discussed in Section 3.5. Finally, in Section 3.6, we extend our result to the general Gel’fand-Pinsker problem with coded state information at the decoder (Definition 5).

Info. Spec. of Info. Spec. of
Figure 2: Illustration of Theorem 1 where and similarly for . The capacity is the difference between and evaluated at the optimal processes. The stationary, memoryless case (Corollary 4) corresponds to the situation in which and so the information spectra are point masses at the mutual informations.

3.1 Main Result and Remarks

We now state our main result followed by some simple remarks. The proof can be found in Appendices .1 and .2.

Theorem 1 (General Gel’fand-Pinsker Capacity).

The capacity of the general Gel’fand-Pinsker channel with general states (see Definition 4) is


where the maximization is over all sequences of random variables forming the requisite Markov chain,1 having the state distribution coinciding with and having conditional distribution of given equal to the general channel .

See Fig. 2 for an illustration of Theorem 1.

Remark 1.

The general formula in (14) is the dual of that in the Wyner-Ziv case [8]. However, the proofs, and in particular, the constructions of the codebooks, the notions of typicality and the application of Wyner’s PBL, are subtly different from [14, 8, 19]. We discuss these issues in Section 3.5. Another problem which involves difference of mutual information quantities is the wiretap channel [2, Chapter 22]. General formulas for the secrecy capacity using channel resolvability theory [7, Chapter 5] were provided by Hayashi [15] and Bloch and Laneman [16]. They also involve the difference between spectral inf-mutual information rate (of the input and the legitimate receiver) and sup-mutual information rate (of the input and the eavesdropper).

Remark 2.

Unlike the usual Gel’fand-Pinsker formula for stationary and memoryless channels and states in (1), we cannot conclude that the conditional distribution we are optimizing over in (14) can be decomposed into the conditional distribution of given (i.e., ) and a deterministic function (i.e., ).

Remark 3.

If there is no state, i.e., in (14), then we recover the general formula for channel capacity by Verdú and Han (VH) [6]


The achievability follows by setting . The converse follows by noting that because  [6, Theorem 9]. This is the analogue of the data processing inequality for the spectral inf-mutual information rate.

Remark 4.

The general formula in (14) can be slightly generalized to the Cover-Chiang (CC) setting [20] in which (i) the channel depends on two state sequences (in addition to ), (ii) partial channel state information is available noncausally at the encoder and (iii) partial channel state information is available at the decoder. In this case, replacing with and with in (14) yields


where the supremum is over all processes such that coincides with the state distributions and given coincides with the sequence of channels . Hence the optimization in (16) is over the conditionals .

3.2 Two-sided Common State Information

Specializing (16) to the case where , i.e., the same side information is available to both encoder and decoder (ED), does not appear to be straightforward without further assumptions. Recall that in the usual scenario [20, Case 4 in Corollary 1], we use the identification and chain rule for mutual information to assert that evaluated at the optimal is the capacity. However, in information spectrum analysis, the chain rule does not hold for the operation. In fact, is superadditive [7]. Nevertheless, under the assumption that a sequence of information densities converges in probability, we can derive the capacity of the general channel with general common state available at both terminals using Theorem 1.

Corollary 2 (General Channel Capacity with State at ED).

Consider the problem


where the supremum is over all such that coincides with the given state distributions and given coincides with the given channels . Assume that the maximizer of (17) exists and denote it by . Let the distribution of given be . If


then the capacity of the state-dependent channel with state available at both encoder and decoder is in (17).

The proof is provided in Appendix .3. If the joint process satisfies (18) (and and are finite sets), it is called information stable [21]. In other words, the limit distribution of (where ) in Fig. 2 concentrates at a single point. We remark that a different achievability proof technique (that does not use Theorem 1) would allow us to dispense of the information stability assumption. We can simply develop a conditional version of Feinstein’s lemma [7, Lemma 3.4.1] to prove the direct part of (17). However, we choose to start from Theorem 1, which is the most general capacity result for the Gel’fand-Pinsker problem. Note that the converse of Corollary 2 does not require (18).

3.3 Memoryless Channels and Memoryless States

To see how we can use Theorem 1 in concretely, we specialize it to the memoryless (but not necessarily stationary) setting and we provide some interesting examples. In the memoryless setting, the sequence of channels and the sequence of state distributions are such that for every , we have , and for some and some .

Corollary 3 (Memoryless Gel’fand-Pinsker Channel Capacity).

Assume that and are finite sets and the Gel’fand-Pinsker channel is memoryless and characterized by and . Define . Let the maximizers to the optimization problems indexed by


be denoted as . Let be the joint distribution induced by . Assume that either of the two limits


exist. Then the capacity of the memoryless Gel’fand-Pinsker channel is


The proof of Corollary 3 is detailed in Appendix .4. The Cesàro summability assumption in (20) is only required for achievability. We illustrate the assumption in (20) with two examples in the sequel. The proof of the direct part of Corollary 3 follows by taking the optimization in the general result (14) to be over memoryless conditional distributions. The converse follows by repeated applications of the Csiszár-sum-identity [2, Chapter 2]. If in addition to being memoryless, the channels and states are stationary (i.e., each and each is equal to and respectively), the both limits in (20) exist since is the same for each .

Corollary 4 (Stationary, Memoryless Gel’fand-Pinsker Channel Capacity).

Assume that is a finite set. In the stationary, memoryless case, the capacity of the Gel’fand-Pinsker channels given by in (1).

We omit the proof because it is a straightforward consequence of Corollary 3. Close examination of the proof of Corollary 3 shows that only the converse of Corollary 4 requires the assumption that . The achievability of Corollary 4 follows easily from Khintchine’s law of large numbers [7, Lemma 1.3.2] (for abstract alphabets).

To gain a better understanding of the assumption in (20) in Corollary 3, we now present a couple of (pathological) examples which are inspired by [7, Remark 3.2.3].

Example 1.

Let . Consider a discrete, nonstationary, memoryless channel satisfying


where are two distinct channels. Let be the -fold extension of some . Let be the -marginal of the maximizer of (19) when the channel is . In general, . Because , the first limit in (20) does not exist. Similarly, the second limit does not exist in general and Corollary 3 cannot be applied.

Example 2.

Let be as in Example 1 and let the set of even and odd positive integers be and respectively. Let . Consider a binary, nonstationary, memoryless channel satisfying


where . Also consider a binary, nonstationary, memoryless state satisfying


where . In addition, assume that for are binary symmetric channels with arbitrary crossover probabilities . Let be the -marginal of the maximizer in (19) when the channel is and the state distribution is . For (odd blocklengths), due to the symmetry of the channels the optimal is Bernoulli and independent of  [2, Problem 7.12(c)]. Thus, for all odd blocklengths, the mutual informations in the first limit in (20) are equal to zero. Clearly, the first limit in (20) exists, equalling (contributed by the even blocklengths). Therefore, Corollary 3 applies and we can show that the Gel’fand-Pinsker capacity is


where in (19) is given explicitly in [2, Problem 7.12(c)] and is defined as


See Appendix .5 for the derivation of (25). The expression in (25) implies that the capacity consists of two parts: represents the performance of the system at even blocklengths, while represents the non-ergodic behavior of the channel at odd blocklengths with state distribution ; cf. [7, Remark 3.2.3]. In the special case that (e.g., ), then the capacity is the average of and .

3.4 Mixed Channels and Mixed States

Now we use Theorem 1 to compute the capacity of the Gel’fand-Pinsker channel when the channel and state sequence are mixed. More precisely, we assume that


Note that we require . In fact, let and . Note that if is a stationary and memoryless source, the composite source given by (28), is a canonical example of a non-ergodic and stationary source. By (27), the channel can be regarded as an average channel given by the convex combination of constituent channels. It is stationary but non-ergodic and non-memoryless. Given , define the following random variables which are indexed by and :

Corollary 5 (Mixed Channels and Mixed States).

The capacity of the general mixed Gel’fand-Pinsker channel with general mixed state as in (27)–(29) is


where the maximization is over all sequences of random variables with state distribution coinciding with in (28) and having conditional distribution of given equal to the general channel in (27). Furthermore, if each general state sequence and each general channel is stationary and memoryless, the capacity is lower bounded as


where and the maximization is over all joint distributions satisfying for some .

Corollary 5 is proved in Appendix .6 and it basically applies [7, Lemma 3.3.2] to the mixture with components in (29). Different from existing analyses for mixed channels and sources [7, 8], here there are two independent mixtures—that of the channel and the state. Hence, we need to minimize over two indices for the first term in (30). If instead of the countable number of terms in the sums in (27) and (28), the number of mixture components (of either the source or channel) is uncountable, Corollary 5 no longer applies and a corresponding result has to involve the assumptions that the alphabets are finite and the constituent channels are memoryless. See [7, Theorem 3.3.6].

The corollary says that the Gel’fand-Pinsker capacity is governed by two elements: (i) the “worst-case” virtual channel (from to ), i.e., the one with the smallest packing rate and (ii) the “worst-case” state distribution, i.e., the one that results in the largest covering rate . Unfortunately, obtaining a converse result for the stationary, memoryless case from (30) does not appear to be straightforward. The same issue was also encountered for the mixed wiretap channel [16].

3.5 Proof Idea of Theorem 1

Direct part

The high-level idea in the achievability proof is similar to the usual Gel’fand-Pinsker coding scheme [1] which involves a covering step to reduce the uncertainty due to the random state sequence and a packing step to decode the transmitted codeword. However, to use the information spectrum method on the general channel and general state, the definitions of “typicality” have to be restated in terms of information densities. See the definitions in Appendix .1. The main burden for the proof is to show that the probability that the transmitted codeword is not “typical” with the channel output vanishes. In regular Gel’fand-Pinsker coding, one appeals to the conditional typicality lemma [1, Lemma 2] [2, Chapter 2] (which holds for “strongly typical sets”) to assert that this error probability is small. But the “typical sets” used in information spectrum analysis do not allow us to apply the conditional typicality lemma in a straightforward manner. For example, our decoder is a threshold test involving the information density statistic . It is not clear in the event that there is no covering error that the transmitted codeword passes the threshold test (i.e., exceeds a certain threshold) with high probability.

To get around this problem, we modify Wyner’s PBL2 [9, Lemma 4.3] [22, Lemma A.1] accordingly. Wyner essentially derived an analog of the Markov lemma [2, Chapter 12] without strong typicality by introducing a new “typical set” defined in terms of conditional probabilities. This new definition is particularly useful for problems that involve covering and packing as well as having some Markov structure. Our analysis is somewhat similar to the analyses of the general Wyner-Ziv problem in [8] and the WAK problem in [14, 19]. This is unsurprising given that the Wyner-Ziv and Gel’fand-Pinsker problems are duals [20]. However, unlike in [8], we construct random subcodebooks and use them in subsequent steps rather than to assert the existence of a single codebook via random selection and subsequently regard it as being deterministic. This is because unlike Wyner-Ziv, we need to construct exponentially many subcodebooks each of size and indexing a message in . We also require each of these subcodebooks to be different and identifiable based on the channel output. Also, our analogue of Wyner’s “typical set” is different from previous works.

We also point out that Yu et al. [17] considered the Gaussian Gel’fand-Pinsker problem for non-stationary and non-ergodic channel and state. However, the notion of typicality used is “weak typicality”, which means that the sample entropy is close to the entropy rate. This notion does not generalize well for obtaining the capacity expression in (14), which involves limits in probability of information densities. Furthermore, Gaussianity is a crucial hypothesis in the proof of the asymptotic equipartition property in [17].

Converse part

For the converse, we use the Verdú-Han converse [7, Lemma 3.2.2] and the fact that the message is independent of the state sequence. Essentially, we emulate the steps for the converse of the general wiretap channel presented by Bloch and Laneman [16, Lemma 3].

3.6 Coded State Information at Decoder

We now state the capacity region of the coded state information problem (Definition 5).

Theorem 6 (Coded State Information at Decoder).

The capacity region of the Gel’fand-Pinsker problem with coded state information at the decoder (see Definition 6) is given by the set of pairs satisfying


for satisfying , having the state distribution coinciding with and having conditional distribution of given equal to the general channel .

A proof sketch is provided in Appendix .7. For the direct part, we combine Wyner-Ziv and Gel’fand-Pinsker coding to obtain the two constraints in Theorem 6. To prove the converse, we use exploit the independence of the message and the state, the Verdú-Han lemma [7, Lemma 3.2.2] and the proof technique for the converse of the general rate-distortion problem [7, Section 5.4]. Because the proof of Theorem 6 is very similar to Theorem 1, we only provide a sketch. We note that similar ideas can be easily employed to find the general capacity region for the problem of coded state information at the encoder (and full state information at the decoder) [23]. In analogy to Corollary 4, we can use Theorem 6 to recover Steinberg’s result [10] for the stationary, memoryless case. See Appendix .8 for the proof.

Corollary 7 (Coded State Information at Decoder for Stationary, Memoryless Channels and States).

Assume that is a finite set. The capacity of the Gel’fand-Pinsker channel with coded state information at the decoder in the stationary, memoryless case is given in (12) and (13).

4 Conclusion

In this work, we derived the capacity of the general Gel’fand-Pinsker channel with general state distribution using the information spectrum method. We also extended the analysis to the case where coded state information is available at the decoder.

.1 Proof of Theorem 1

Basic definitions: Fix and some conditional distribution . Define the sets


where the random variables . We define the probabilities


where and and where these joint distributions are computed with respect to . Note that and are information spectrum [7] quantities.


We begin with achievability. We show that the rate


is achievable. The next lemma provides an upper bound on the error probability in terms of the above quantities.

Lemma 8 (Nonasymptotic upper bound on error probability for Gel’fand-Pinsker).

Fix a sequence of conditional distributions . This specifies and