A Formula for the Capacity of the General Gel’fandPinsker Channel
Abstract
We consider the Gel’fandPinsker problem in which the channel and state are general, i.e., possibly nonstationary, nonmemoryless and nonergodic. Using the information spectrum method and a nontrivial modification of the piggyback coding lemma by Wyner, we prove that the capacity can be expressed as an optimization over the difference of a spectral inf and a spectral supmutual information rate. We consider various specializations including the case where the channel and state are memoryless but not necessarily stationary.
Gel’fandPinsker, Information spectrum, General channels, General sources
I Introduction
In this paper, we consider the classical problem of channel coding with noncausal state information at the encoder, also know as the Gel’fandPinsker problem [1]. In this problem, we would like to send a uniformly distributed message over a statedependent channel , where and are the state, input and output alphabets respectively. The random state sequence is available noncausally at the encoder but not at the decoder. See Fig. 1. The Gel’fandPinsker problem consists in finding the maximum rate for which there exists a reliable code. Assuming that the channel and state sequence are stationary and memoryless, Gel’fand and Pinsker [1] showed that this maximum message rate or capacity is given by
(1) 
The coding scheme involves a covering step at the encoder to reduce the uncertainty due to the random state sequence and a packing step to decode the message [2, Chapter 7]. Thus, we observe the covering rate and the packing rate in (1). A weak converse can be proved by using the Csiszársumidentity [2, Chapter 7]. A strong converse was proved by Tyagi and Narayan [3] using entropy and imagesize characterizations [4, Chapter 15]. The Gel’fandPinsker problem has numerous applications, particularly in information hiding and watermarking [5].
In this paper, we revisit the Gel’fandPinsker problem. Instead of assuming stationarity and memorylessness on the channel and state sequence, we let the channel be a general one in the sense of Verdú and Han [6, 7]. That is, is an arbitrary sequence of stochastic mappings from to . We also model the state distribution as a general one . Such generality allows us to understand optimal coding schemes for general systems. We prove an analogue of the Gel’fandPinsker capacity in (1) by using information spectrum analysis [7]. Our result is expressed in terms of the limit superior and limit inferior in probability operations [7]. For the direct part, we leverage on a technique used by Iwata and Muramatsu [8] for the general WynerZiv problem [2, Chapter 11]. Our proof technique involves a nontrivial modification of Wyner’s piggyback coding lemma (PBL) [9, Lemma 4.3]. We also find the capacity region for the case where ratelimited coded state information is available at the decoder. This setting, shown in Fig. 1, was studied by Steinberg [10] but we consider the scenario in which the state and channel are general.
Ia Main Contributions
There are two main contributions in this work.
First by developing a nonasymptotic upper bound on the average probability for any Gel’fandPinsker problem (Lemma 8), we prove that the general capacity is
(2) 
where the supremum is over all conditional probability laws (or equivalently over Markov chains given the channel law ) and (resp. ) is the spectral infmutual information rate (resp. the spectral supmutual information rate) [7]. The expression in (2) reflects the fact that there are two distinct steps: a covering step and packing step. To cover successfully, we need to expend a rate of (for any ) as stipulated by general fixedlength ratedistortion theory [12, Section VI]. Thus, each subcodebook has to have at least sequences. To decode the codeword’s subcodebook index correctly, we can have at most codewords by the general channel coding result of Verdú and Han [6]. We can specialize the general result in (2) to the following scenarios: (i) no state information, thus recovering the result by Verdú and Han [6], (ii) common state information is available at the encoder and the decoder, (iii) the channel and state are memoryless (but not necessarily stationary) and (iv) mixed channels and states [7].
Second, we extend the above result to the case where coded state information is available at the decoder. This problem was first studied by Heegard and El Gamal [11] and later by Steinberg [10]. In this case, we combine our coding scheme with that of Iwata and Muramatsu for the general WynerZiv problem [8] to obtain the tradeoff between , the rate of the compressed state information that is available at the decoder, and be the message rate. We show that the tradeoff (or capacity region) is the set of rate pairs satisfying
(3)  
(4) 
for some . This general result can be specialized the stationary, memoryless setting [10].
IB Related Work
The study of general channels started with the seminal work by Verdú and Han [6] in which the authors characterized the capacity in terms of the limit inferior in probability of a sequence of information densities. See Han’s book [7] for a comprehensive exposition on the information spectrum method. This line of analysis provides deep insights into the fundamental limits of the transmission of information over general channels and the compressibility of general sources that may not be stationary, memoryless or ergodic. Information spectrum analysis has been used for ratedistortion [12], the WynerZiv problem [8], the HeegardBerger problem [13], the WynerAhlswedeKörner (WAK) problem [14] and the wiretap channel [15, 16]. The WynerZiv and wiretap problems are the most closely related to the problem we solve in this paper. In particular, they involve differences of mutual informations akin to the Gel’fandPinsker problem. Even though it is not in the spirit of information spectrum methods, the work of Yu et al. [17] deals with the Gel’fandPinsker problem for nonstationary, nonergodic Gaussian noise and state (also called “writing on colored paper”). We contrast our work to [17] in Section IIIE.
IC Paper Organization
The rest of this paper is structured as follows. In Section II, we state our notation and various other definitions. In Section III, we state all information spectrumbased results and their specializations. We conclude our discussion in Section IV. The proofs of all results are provided in the Appendices.
Ii System Model and Main Definitions
In this section, we state our notation and the definitions of the various problems that we consider in this paper.
Iia Notation
Random variables (e.g., ) and their realizations (e.g., ) are denoted by upper case and lower case serif font respectively. Sets are denoted in calligraphic font (e.g., the alphabet of is ). We use the notation to mean a vector of random variables . In addition, denotes a general source in the sense that each member of the sequence is a random vector. The same holds for a general channel , which is simply a sequence of stochastic mappings from to . The set of all probability distributions with support on an alphabet is denoted as . We use the notation to mean that the distribution of is . The joint distribution induced by the marginal and the conditional is denoted as . Informationtheoretic quantities are denoted using the usual notations [7, 4] (e.g., for mutual information or if we want to make the dependence on distributions explicit). All logarithms are to the base . We also use the discrete interval [2] notation and, for the most part, ignore integer requirements.
We recall the following probabilistic limit operations [7].
Definition 1.
Let be a sequence of realvalued random variables. The limsup in probability of is an extended realnumber defined as
(5) 
The liminf in probability of is defined as
(6) 
We also recall the following definitions from Han [7]. These definitions play a prominent role in the following.
Definition 2.
Given a pair of stochastic processes with joint distributions , the spectral supmutual information rate is defined as
(7) 
The spectral infmutual information rate is defined as in (7) with in place of . The spectral sup and infconditional mutual information rates are defined similarly.
IiB The Gel’fandPinsker Problem
We now recall the definition of the Gel’fandPinsker problem. The setup is shown in Fig. 1 with .
Definition 3.
An code for the Gel’fandPinsker problem with channel and state distribution consists of

An encoder (possibly randomized)

A decoder
such that the average error probability in decoding the message does not exceed , i.e.,
(8) 
where and .
We assume that the message is uniformly distributed in the message set and that it is independent of the state sequence . Here we remark that in (8) (and everywhere else in the paper), we use the notation even though may not be countable. This is the convention we adopt throughout the paper even though integrating against the measure as in would be more precise.
Definition 4.
We say that the nonnegative number is an achievable rate if there exists a sequence of codes for which
(9) 
The capacity is the supremum of all achievable rates. The capacity .
IiC The Gel’fandPinsker Problem With Coded State Information at the Decoder
In fact the general information spectrum techniques allow us to solve a related problem which was first considered by Heegard and El Gamal [11] and subsequently solved by Steinberg [10]. The setup is shown in Fig. 1.
Definition 5.
An code for the Gel’fandPinsker problem with channel and state distribution and with coded state information at the decoder consists of

A state encoder:

An encoder: (possibly randomized)

A decoder:
such that the average error probability in decoding the message is no larger than , i.e.,
(10) 
where .
Definition 6.
We say that the pair of nonnegative numbers is an achievable rate pair if there exists a sequence of codes such that in addition to (9), the following holds
(11) 
The set of all achievable rate pairs is known as the capacity region .
Heegard and El Gamal [11] (achievability) and Steinberg [10] (converse) showed for the discrete memoryless channel and discrete memoryless state that the capacity region is the set of rate pairs such that
(12)  
(13) 
for some Markov chain . Furthermore, can be taken to be a deterministic function of , and . The constraint in (12) is obtained using WynerZiv coding with “source” and “sideinformation” . The constraint in (13) is analogous to the Gel’fandPinsker capacity where is common to both encoder and decoder. A weak converse was proven using repeated applications of the Csiszársumidentity. We generalize Steinberg’s region for the general source and general channel using information spectrum techniques in Section IIIF.
Iii Information Spectrum Characterization of the General Gel’fandPinsker problem
In this Section, we first present the main result concerning the capacity of the general Gel’fandPinsker problem (Definition 3) in Section IIIA. These results are derived using the information spectrum method. We then derive the capacity for various special cases of the Gel’fandPinsker problem in Section IIIB (twosided common state information) and Section IIIC (memoryless channels and state). We consider mixed states and mixed channels in Section IIID. The main ideas in the proof are discussed in Section IIIE. Finally, in Section IIIF, we extend our result to the general Gel’fandPinsker problem with coded state information at the decoder (Definition 5).
Iiia Main Result and Remarks
We now state our main result followed by some simple remarks. The proof can be found in Appendices A and B.
Theorem 1 (General Gel’fandPinsker Capacity).
The capacity of the general Gel’fandPinsker channel with general states (see Definition 4) is
(14) 
where the maximization is over all sequences of random variables forming the requisite Markov chain,^{1}^{1}1For three processes , we say that forms a Markov chain if for all . having the state distribution coinciding with and having conditional distribution of given equal to the general channel .
Remark 1.
The general formula in (14) is the dual of that in the WynerZiv case [8]. However, the proofs, and in particular, the constructions of the codebooks, the notions of typicality and the application of Wyner’s PBL, are subtly different from [14, 8, 19]. We discuss these issues in Section IIIE. Another problem which involves difference of mutual information quantities is the wiretap channel [2, Chapter 22]. General formulas for the secrecy capacity using channel resolvability theory [7, Chapter 5] were provided by Hayashi [15] and Bloch and Laneman [16]. They also involve the difference between spectral infmutual information rate (of the input and the legitimate receiver) and supmutual information rate (of the input and the eavesdropper).
Remark 2.
Remark 3.
If there is no state, i.e., in (14), then we recover the general formula for channel capacity by Verdú and Han (VH) [6]
(15) 
The achievability follows by setting . The converse follows by noting that because [6, Theorem 9]. This is the analogue of the data processing inequality for the spectral infmutual information rate.
Remark 4.
The general formula in (14) can be slightly generalized to the CoverChiang (CC) setting [20] in which (i) the channel depends on two state sequences (in addition to ), (ii) partial channel state information is available noncausally at the encoder and (iii) partial channel state information is available at the decoder. In this case, replacing with and with in (14) yields
(16) 
where the supremum is over all processes such that coincides with the state distributions and given coincides with the sequence of channels . Hence the optimization in (16) is over the conditionals .
IiiB Twosided Common State Information
Specializing (16) to the case where , i.e., the same side information is available to both encoder and decoder (ED), does not appear to be straightforward without further assumptions. Recall that in the usual scenario [20, Case 4 in Corollary 1], we use the identification and chain rule for mutual information to assert that evaluated at the optimal is the capacity. However, in information spectrum analysis, the chain rule does not hold for the operation. In fact, is superadditive [7]. Nevertheless, under the assumption that a sequence of information densities converges in probability, we can derive the capacity of the general channel with general common state available at both terminals using Theorem 1.
Corollary 2 (General Channel Capacity with State at ED).
Consider the problem
(17) 
where the supremum is over all such that coincides with the given state distributions and given coincides with the given channels . Assume that the maximizer of (17) exists and denote it by . Let the distribution of given be . If
(18) 
then the capacity of the statedependent channel with state available at both encoder and decoder is in (17).
The proof is provided in Appendix C. If the joint process satisfies (18) (and and are finite sets), it is called information stable [21]. In other words, the limit distribution of (where ) in Fig. 2 concentrates at a single point. We remark that a different achievability proof technique (that does not use Theorem 1) would allow us to dispense of the information stability assumption. We can simply develop a conditional version of Feinstein’s lemma [7, Lemma 3.4.1] to prove the direct part of (17). However, we choose to start from Theorem 1, which is the most general capacity result for the Gel’fandPinsker problem. Note that the converse of Corollary 2 does not require (18).
IiiC Memoryless Channels and Memoryless States
To see how we can use Theorem 1 in concretely, we specialize it to the memoryless (but not necessarily stationary) setting and we provide some interesting examples. In the memoryless setting, the sequence of channels and the sequence of state distributions are such that for every , we have , and for some and some .
Corollary 3 (Memoryless Gel’fandPinsker Channel Capacity).
Assume that and are finite sets and the Gel’fandPinsker channel is memoryless and characterized by and . Define . Let the maximizers to the optimization problems indexed by
(19) 
be denoted as . Let be the joint distribution induced by . Assume that either of the two limits
(20) 
exist. Then the capacity of the memoryless Gel’fandPinsker channel is
(21) 
The proof of Corollary 3 is detailed in Appendix D. The Cesàro summability assumption in (20) is only required for achievability. We illustrate the assumption in (20) with two examples in the sequel. The proof of the direct part of Corollary 3 follows by taking the optimization in the general result (14) to be over memoryless conditional distributions. The converse follows by repeated applications of the Csiszársumidentity [2, Chapter 2]. If in addition to being memoryless, the channels and states are stationary (i.e., each and each is equal to and respectively), the both limits in (20) exist since is the same for each .
Corollary 4 (Stationary, Memoryless Gel’fandPinsker Channel Capacity).
Assume that is a finite set. In the stationary, memoryless case, the capacity of the Gel’fandPinsker channels given by in (1).
We omit the proof because it is a straightforward consequence of Corollary 3. Close examination of the proof of Corollary 3 shows that only the converse of Corollary 4 requires the assumption that . The achievability of Corollary 4 follows easily from Khintchine’s law of large numbers [7, Lemma 1.3.2] (for abstract alphabets).
To gain a better understanding of the assumption in (20) in Corollary 3, we now present a couple of (pathological) examples which are inspired by [7, Remark 3.2.3].
Example 1.
Let . Consider a discrete, nonstationary, memoryless channel satisfying
(22) 
where are two distinct channels. Let be the fold extension of some . Let be the marginal of the maximizer of (19) when the channel is . In general, . Because , the first limit in (20) does not exist. Similarly, the second limit does not exist in general and Corollary 3 cannot be applied.
Example 2.
Let be as in Example 1 and let the set of even and odd positive integers be and respectively. Let . Consider a binary, nonstationary, memoryless channel satisfying
(23) 
where . Also consider a binary, nonstationary, memoryless state satisfying
(24) 
where . In addition, assume that for are binary symmetric channels with arbitrary crossover probabilities . Let be the marginal of the maximizer in (19) when the channel is and the state distribution is . For (odd blocklengths), due to the symmetry of the channels the optimal is Bernoulli and independent of [2, Problem 7.12(c)]. Thus, for all odd blocklengths, the mutual informations in the first limit in (20) are equal to zero. Clearly, the first limit in (20) exists, equalling (contributed by the even blocklengths). Therefore, Corollary 3 applies and we can show that the Gel’fandPinsker capacity is
(25) 
where in (19) is given explicitly in [2, Problem 7.12(c)] and is defined as
(26) 
See Appendix E for the derivation of (25). The expression in (25) implies that the capacity consists of two parts: represents the performance of the system at even blocklengths, while represents the nonergodic behavior of the channel at odd blocklengths with state distribution ; cf. [7, Remark 3.2.3]. In the special case that (e.g., ), then the capacity is the average of and .
IiiD Mixed Channels and Mixed States
Now we use Theorem 1 to compute the capacity of the Gel’fandPinsker channel when the channel and state sequence are mixed. More precisely, we assume that
(27)  
(28) 
Note that we require . In fact, let and . Note that if is a stationary and memoryless source, the composite source given by (28), is a canonical example of a nonergodic and stationary source. By (27), the channel can be regarded as an average channel given by the convex combination of constituent channels. It is stationary but nonergodic and nonmemoryless. Given , define the following random variables which are indexed by and :
(29) 
Corollary 5 (Mixed Channels and Mixed States).
The capacity of the general mixed Gel’fandPinsker channel with general mixed state as in (27)–(29) is
(30) 
where the maximization is over all sequences of random variables with state distribution coinciding with in (28) and having conditional distribution of given equal to the general channel in (27). Furthermore, if each general state sequence and each general channel is stationary and memoryless, the capacity is lower bounded as
(31) 
where and the maximization is over all joint distributions satisfying for some .
Corollary 5 is proved in Appendix F and it basically applies [7, Lemma 3.3.2] to the mixture with components in (29). Different from existing analyses for mixed channels and sources [7, 8], here there are two independent mixtures—that of the channel and the state. Hence, we need to minimize over two indices for the first term in (30). If instead of the countable number of terms in the sums in (27) and (28), the number of mixture components (of either the source or channel) is uncountable, Corollary 5 no longer applies and a corresponding result has to involve the assumptions that the alphabets are finite and the constituent channels are memoryless. See [7, Theorem 3.3.6].
The corollary says that the Gel’fandPinsker capacity is governed by two elements: (i) the “worstcase” virtual channel (from to ), i.e., the one with the smallest packing rate and (ii) the “worstcase” state distribution, i.e., the one that results in the largest covering rate . Unfortunately, obtaining a converse result for the stationary, memoryless case from (30) does not appear to be straightforward. The same issue was also encountered for the mixed wiretap channel [16].
IiiE Proof Idea of Theorem 1
IiiE1 Direct part
The highlevel idea in the achievability proof is similar to the usual Gel’fandPinsker coding scheme [1] which involves a covering step to reduce the uncertainty due to the random state sequence and a packing step to decode the transmitted codeword. However, to use the information spectrum method on the general channel and general state, the definitions of “typicality” have to be restated in terms of information densities. See the definitions in Appendix A. The main burden for the proof is to show that the probability that the transmitted codeword is not “typical” with the channel output vanishes. In regular Gel’fandPinsker coding, one appeals to the conditional typicality lemma [1, Lemma 2] [2, Chapter 2] (which holds for “strongly typical sets”) to assert that this error probability is small. But the “typical sets” used in information spectrum analysis do not allow us to apply the conditional typicality lemma in a straightforward manner. For example, our decoder is a threshold test involving the information density statistic . It is not clear in the event that there is no covering error that the transmitted codeword passes the threshold test (i.e., exceeds a certain threshold) with high probability.
To get around this problem, we modify Wyner’s PBL^{2}^{2}2 One version of the piggyback coding lemma (PBL), given in [22, Lemma A.1], can be stated as follows: If are random variables forming Markov chain, , and is a function satisfying , then for any given , for all sufficiently large there exists a mapping such that (i) and (ii) . The function is usually taken to be the indicator that are jointly typical. [9, Lemma 4.3] [22, Lemma A.1] accordingly. Wyner essentially derived an analog of the Markov lemma [2, Chapter 12] without strong typicality by introducing a new “typical set” defined in terms of conditional probabilities. This new definition is particularly useful for problems that involve covering and packing as well as having some Markov structure. Our analysis is somewhat similar to the analyses of the general WynerZiv problem in [8] and the WAK problem in [14, 19]. This is unsurprising given that the WynerZiv and Gel’fandPinsker problems are duals [20]. However, unlike in [8], we construct random subcodebooks and use them in subsequent steps rather than to assert the existence of a single codebook via random selection and subsequently regard it as being deterministic. This is because unlike WynerZiv, we need to construct exponentially many subcodebooks each of size and indexing a message in . We also require each of these subcodebooks to be different and identifiable based on the channel output. Also, our analogue of Wyner’s “typical set” is different from previous works.
We also point out that Yu et al. [17] considered the Gaussian Gel’fandPinsker problem for nonstationary and nonergodic channel and state. However, the notion of typicality used is “weak typicality”, which means that the sample entropy is close to the entropy rate. This notion does not generalize well for obtaining the capacity expression in (14), which involves limits in probability of information densities. Furthermore, Gaussianity is a crucial hypothesis in the proof of the asymptotic equipartition property in [17].
IiiE2 Converse part
IiiF Coded State Information at Decoder
We now state the capacity region of the coded state information problem (Definition 5).
Theorem 6 (Coded State Information at Decoder).
The capacity region of the Gel’fandPinsker problem with coded state information at the decoder (see Definition 6) is given by the set of pairs satisfying
(32)  
(33) 
for satisfying , having the state distribution coinciding with and having conditional distribution of given equal to the general channel .
A proof sketch is provided in Appendix G. For the direct part, we combine WynerZiv and Gel’fandPinsker coding to obtain the two constraints in Theorem 6. To prove the converse, we use exploit the independence of the message and the state, the VerdúHan lemma [7, Lemma 3.2.2] and the proof technique for the converse of the general ratedistortion problem [7, Section 5.4]. Because the proof of Theorem 6 is very similar to Theorem 1, we only provide a sketch. We note that similar ideas can be easily employed to find the general capacity region for the problem of coded state information at the encoder (and full state information at the decoder) [23]. In analogy to Corollary 4, we can use Theorem 6 to recover Steinberg’s result [10] for the stationary, memoryless case. See Appendix H for the proof.
Iv Conclusion
In this work, we derived the capacity of the general Gel’fandPinsker channel with general state distribution using the information spectrum method. We also extended the analysis to the case where coded state information is available at the decoder.
a Proof of Theorem 1
Basic definitions: Fix and some conditional distribution . Define the sets
(34)  
(35) 
where the random variables . We define the probabilities
(36)  
(37) 
where and and where these joint distributions are computed with respect to . Note that and are information spectrum [7] quantities.
Proof.
We begin with achievability. We show that the rate