On the BICM Capacity
Optimal binary labelings, input distributions, and input alphabets are analyzed for the so-called bit-interleaved coded modulation (BICM) capacity, paying special attention to the low signal-to-noise ratio (SNR) regime. For 8-ary pulse amplitude modulation (PAM) and for 0.75 bit/symbol, the folded binary code results in a higher capacity than the binary reflected gray code (BRGC) and the natural binary code (NBC). The 1 dB gap between the additive white Gaussian noise (AWGN) capacity and the BICM capacity with the BRGC can be almost completely removed if the input symbol distribution is properly selected. First-order asymptotics of the BICM capacity for arbitrary input alphabets and distributions, dimensions, mean, variance, and binary labeling are developed. These asymptotics are used to define first-order optimal (FOO) constellations for BICM, i.e., constellations that make BICM achieve the Shannon limit . It is shown that the required for reliable transmission at asymptotically low rates in BICM can be as high as infinity, that for uniform input distributions and 8-PAM there are only 72 classes of binary labelings with a different first-order asymptotic behavior, and that this number is reduced to only 26 for 8-ary phase shift keying (PSK). A general answer to the question of FOO constellations for BICM is also given: using the Hadamard transform, it is found that for uniform input distributions, a constellation for BICM is FOO if and only if it is a linear projection of a hypercube. A constellation based on PAM or quadrature amplitude modulation input alphabets is FOO if and only if they are labeled by the NBC; if the constellation is based on PSK input alphabets instead, it can never be FOO if the input alphabet has more than four points, regardless of the labeling.
The problem of reliable transmission of digital information through a noisy channel dates back to the works of Nyquist [1, 2] and Hartley  almost 90 years ago. Their efforts were capitalized by C. E. Shannon who formulated a unified mathematical theory of communication in 1948 [4, 5]111An excellent summary of the contributions that influenced Shannon’s work can be found in [6, Sec. I].. After he introduced the famous capacity formula for the additive white Gaussian noise (AWGN) channel, the problem of designing a system that operates close to that limit has been one of the most important and challenging problems in information/communication theory. While low spectral efficiencies can be obtained by combining binary signaling and a channel encoder, high spectral efficiencies are usually obtained by using a coded modulation (CM) scheme based on a multilevel modulator.
In 1974, Massey proposed the idea of jointly designing the channel encoder and modulator , which inspired Ungerboeck’s trellis-coded modulation (TCM) [8, 9] and Imai and Hirakawa’s multilevel coding (MLC) [10, 11]. Since both TCM and MLC aim to maximize a Euclidean distance measure, they perform very well over the AWGN channel. However, their performance over fading channels is rather poor. The next breakthrough came in 1992, when Zehavi introduced the so-called bit-interleaved coded modulation (BICM)  (later analyzed in ), which is a serial concatenation of a binary channel encoder, a bit-level interleaver, and a memoryless mapper. BICM aims to increase the code diversity—the key performance measure in fading channels—and therefore outperforms TCM in this scenario [13, Table III]. BICM is very attractive from an implementation point view because of its flexibility, i.e., the channel encoder and the modulator can be selected independently, somehow breaking Massey’s joint design paradigm. BICM is nowadays a de facto standard, and it is used in most of the existing wireless systems, e.g., HSPA (HSDPA and HSUPA) [15, Ch. 12], IEEE 802.11a/g  IEEE 802.11n [17, Sec. 20.3.3], and the latest DVB standards (DVB-T2 , DVB-S2 , and DVB-C2 ).
Plots of the BICM capacity vs. reveal that BICM does not always achieve the Shannon limit (SL) . This can be explained based on first-order asymptotics of the BICM capacity, which were recently developed by Martinez et al. for uniform input distributions and one- and two-dimensional input alphabets [21, 22]. It was shown that there is a bounded loss between the BICM capacity and the SL when pulse amplitude modulation (PAM) input alphabets labeled by the binary reflected gray code (BRGC) is used. Recently, Stierstorfer and Fischer showed in  that this is caused by the selection of the binary labeling and that equally spaced PAM and quadrature amplitude modulation (QAM) input alphabets with uniform input distributions labeled by the natural binary code (NBC) achieve the SL. Moreover, the same authors showed in  that for low to medium signal-to-noise ratios (SNR), the NBC results in a higher capacity than the BRGC for PAM and QAM input alphabets and uniform input distributions.
The fact that the BICM capacity does not always achieve the SL raises the fundamental question about first-order optimal (FOO) constellations for BICM, i.e., constellations that make the BICM achieve the SL. In this paper, we generalize the first-order asymptotics of the BICM capacity presented in  to input alphabets with arbitrary dimensions, input distributions, mean, variance, and binary labelings. Based on this model, we present asymptotic results for PAM and phase shift keying (PSK) input alphabets with uniform input distribution and different binary labelings. Our analysis is based on the so-called Hadamard transform [25, pp. 53–54], which allows us to fully characterize FOO constellations for BICM with uniform input distributions for fading and nonfading channels. A complete answer to the question about FOO constellations for BICM with uniform input distributions is given: a constellation is FOO if and only if it is a linear projection of a hypercube. Furthermore, binary labelings for the traditional input alphabets PAM, QAM, and PSK are studied. In particular, it is proven that for PAM and QAM input alphabets, the NBC is the only binary labeling that results in an FOO constellation. It is also proven that PSK input alphabets with more than four points can never yield an FOO constellation, regardless of the binary labeling. When 8-PAM with a uniform input distribution is considered, the folded binary code (FBC) results in a higher capacity than the BRGC and the NBC. Moreover, it is shown how the BICM capacity can be increased by properly selecting the input distribution, i.e., by using so-called probabilistic shaping . In particular, probabilistic shaping is used to show that PAM input alphabets labeled by the BRGC or the FBC can also be FOO, and to show that the 1 dB gap between the AWGN capacity and the BICM capacity with the BRGC can be almost completely removed.
Ii-a Notation Convention
Hereafter we use lowercase letters to denote a scalar, boldface letters to denote a row vector of scalars, and underlined symbols to denote a sequence. Blackboard bold letters represent matrices and represents the entry of at row , column , where all the indices start at zero. The transpose of is denoted by , is the trace of , and is .
We denote random variables by capital letters , probabilities by , the probability mass function (pmf) of the random vector by , and the probability density function (pdf) of the random vector by . The joint pdf of the random vectors and is denoted by , and the conditional pdf of conditioned on is denoted by . The same notation applies to joint and conditional pmfs, i.e., and . The expectation of an arbitrary function over the joint pdf of and is denoted by , the expectation over the conditional pdf is denoted by , and is the covariance matrix of the random vector .
We denote the base-2 representation of the integer , where , by the vector , where is the most significant bit of and the least significant. To facilitate some of the developments in this paper, we also define the ordered direct product as
where for and . The ordered direct product in (1) is analogous to the Cartesian product except that it operates on vectors/matrices instead of sets.
Ii-B Binary Labelings
A binary labeling of order is defined using an matrix where each row corresponds to one of the length- distinct binary codewords, , where .
In order to recursively define some particular binary labelings, we first define expansions, repetitions, and reflections of binary labelings. To expand a labeling into a labeling , we repeat each binary codeword once to obtain a new matrix , and then we obtain by appending one extra column of length . To generate a labeling from a labeling by repetition, we repeat the labeling once to obtain a new matrix , and we add an extra column from the left, consisting of zeros followed by ones. Finally, to generate a labeling from a labeling by reflection, we join and a reversed version of to obtain a new matrix , and we add an extra column from the left, consisting of zeros followed by ones .
In this paper we are particularly interested in the binary reflected Gray code (BRGC)[28, 29], the natural binary code (NBC), and the folded binary code (FBC) . The FBC was analyzed in  for uncoded transmission and here we will, to our knowledge for the first time, consider it for coded transmission. In Sec. III-D and Sec. V-C it is shown to yield a higher capacity than other labelings under some conditions. We also introduce a new binary labeling denoted binary semi-Gray code (BSGC). These binary labelings are generated as follows:
The BRGC of order is generated by recursive expansions of the trivial labeling , or, alternatively, by recursive reflections of .
The NBC of order is defined as the codewords that are the base-2 representations of the integers , i.e., . Alternatively, can be generated by recursive repetitions of the trivial labeling , or as ordered direct products of with itself.
The BSGC of order is generated by replacing the first column of by the modulo-2 sum of the first and last columns.
The FBC of order is generated by one reflection of .
For any labeling matrix , where , we define a modified labeling matrix which is obtained by reversing the order of the columns and applying the mapping , i.e.,
with and .
Example 1 (Binary labelings of order )
Ii-C Constellations and Input Distributions
Throughout this paper, we use to represent the set of symbols used for transmission. Each element of is an -dimensional symbol , , where and . We define the input alphabet using an matrix which contains all the elements of .
For practical reasons, we are interested in well structured input alphabets. An -PAM input alphabet is defined by the column vector where with . An -PSK input alphabet is the matrix where with . Finally, a rectangular -QAM input alphabet is the matrix , where and are vectors of length and , respectively.
For a given input alphabet , the input distribution of the symbols is denoted by the pmf , which represents the probabilities of transmitting the symbols , i.e., . We define the matrix as an ordered list containing the probabilities of the symbols, i.e., . We use to denote the discrete uniform input distribution.
We define a constellation as the list of matrices , i.e., an input alphabet using a given labeling and input distribution. Finally, for a given pair , we denote with the set of indexes of the symbols with a binary label at bit position , i.e., .
Ii-D System Model
In this paper, we analyze coded modulation schemes (CM) as the one shown in Fig. 1. Each of the possible messages is represented by the binary vector , where . The transmitter maps each message to a sequence , which corresponds to -dimensional symbols ( channel uses222A “channel use” corresponds to the transmission of one -dimensional symbol, i.e., it can be considered as a “vectorial channel use”.). The code is a subset of such that , which is used for transmission. The transmitter is then defined as a one-to-one function that assigns each information message to one of the possible sequences . The code rate in information bits per coded bits is then given by or, equivalently, information bits per channel use (information bits per symbol, or information bits per real dimensions). At the receiver’s side, based on the channel observations, a maximum likelihood sequence receiver generates an estimate of the information bits selecting the most likely transmitted message.
We consider transmissions over a discrete-time memoryless fast fading channel
where the operator denotes the so-called Schur product (element-wise product) between two vectors, , , , and are the underlying random vectors for , , , and respectively, with being the discrete time index, and is a Gaussian noise with zero mean and variance in each dimension. The channel is represented by the -dimensional vector , and it contains real fading coefficients which are assumed to be random variables, possibly dependent, with same pdf . We assume that and are perfectly known at the receiver or can be perfectly estimated. Since the channel is memoryless, from now on we drop the index .
The conditional transition pdf of the channel in (3) is given by
We assume that both and have finite and nonzero second moments, that , , and are mutually independent, and that there exists a constant such that for all sufficiently large the vector satisfies
Each transmitted symbol conveys information bits and thus, the relation between the average symbol energy and the average information bit energy is given by . We define the average signal-to-noise ratio (SNR) as
The AWGN channel is obtained as a special case of (3) by taking as the all-one vector. Another particular case is obtained when , which particularizes to the Rayleigh fading channel when and are independent zero-mean Gaussian random variables. In this case, the instantaneous SNR defined by follows a chi-square distribution with one degree of freedom (an exponential distribution). Similarly, the Nakagami- fading channel is obtained when follows a Nakagami- distribution. It can be shown that the condition (5) is fulfilled in all the cases above.
In a BICM system [12, 13], the transmitter in Fig. 1 is realized using a serial concatenation of a binary encoder of rate , a bit level interleaver, and a memoryless mapper . The mapper is defined as a one-to-one mapping rule that maps the length- binary random vector to one symbol , i.e., . At the receiver’s side, the demapper computes soft information on the coded bits, which are then deinterleaved and passed to the channel decoder. The a posteriori L-values for the th bit in the symbol and for a given fading realization are given by
The max-log metric in (9) (already proposed in[12, 13]) is suboptimal; however, it is very popular in practical implementations because of its low complexity, e.g., in the 3rd generation partnership project (3GPP) working groups . It is also known that when Gray-labeled constellations are used, the use of this simplification results in a negligible impact on the receiver’s performance [33, Fig. 9] [34, Fig. 6]. The max-log approximation also allows BICM implementations which do not require the knowledge of , for example, when a Viterbi decoder is used, or when the demapper passes hard decisions to the decoder. Moreover, the use of the max-log approximation transforms the nonlinear relation in (8) into a piecewise linear relation. This has been used to develop expressions for the pdf of the L-values in (9) using arbitrary input alphabets  (based on an algorithmic approach), closed-form expressions for QAM input alphabets labeled by the BRGC for the AWGN channel [36, 37], and for fading channels . Recently, closed-form approximations for the pdf of the L-values in (9) for arbitrary input alphabets and binary labeling in fading channels have been presented .
Ii-E The Hadamard Transform
The Hadamard transform (HT) is a discrete, linear, orthogonal transform, like for example the Fourier transform, but its coefficients take values in only. Among the different applications that the HT has, one that is often overlooked is as an analysis tool for binary labelings [40, 41]. The HT is defined by means of an matrix, the Hadamard matrix, which is defined recursively as follows when is a power of two [25, pp. 53–54].
Example 2 (Hadamard matrix )
In the following, we will drop the index, letting represent a Hadamard matrix of any size . Hadamard matrices have the following appealing properties.
where is the th bit of the base-2 representation of the integer .
At this point it is interesting to note the close relation between the columns of the matrix in Example 1 and the columns of in (10) for . Its generalization is given by the following lemma, whose proof follows immediately from (2), the definition of the NBC in Sec. II-B, and (12).
Let be the modified labeling matrix for the NBC of order , and let be the Hadamard matrix. For any , and for and ,
The HT operates on a vector of length , for any integer , or in a more general case, on a matrix with rows. The transform of a matrix is denoted and has the same dimensions as . It is defined as
and the inverse transform is . Equivalently,
where from (11) we have that , and where we have introduced the row vectors and such that
Because of (12), the first element of the transform is simply .
Iii Capacity of Coded Modulation Systems
In this section we analyze the capacity of CM schemes, i.e., the so-called CM and BICM capacities. We review their relation and we analyze how the selection of the constellation influences them. We pay special attention to the selection of the binary labeling and the use of probabilistic shaping for BICM.
Iii-a AMI and Channel Capacity
In this subsection, we assume the use of a continuous input alphabet, i.e., , which upperbounds the performance of finite input alphabets.
The average mutual information (AMI) in bits333Throughout this paper all the AMIs are given in bits. per channel use between the random vectors and when the channel is perfectly known at the receiver is defined as
where we use as the index of to emphasize the fact that the AMI depends on the input PDF . For an arbitrary channel parameter , the AMI in (17) can be expressed as444We note that the AMI with perfect channel state information is usually denoted by , however, and for notation simplicity, we use .
where is given by (4).
where the maximization is over all possible input distributions. The capacity in (20) has units of [bit/channel use] (or equivalently [bit/symbol]), and it is an upper bound on the number of bits per symbol that can be reliably transmitted through the channel, where a symbol consists of real dimensions. Shannon’s channel coding theorem states that it is not possible to transmit information reliably above this fundamental limit, i.e.,
This capacity is attained when are i.i.d. zero-mean Gaussian random variables with variance in each dimension and it follows from the fact that the noise is independent in each dimension, and thus, the transmission of can be considered as a transmission through parallel independent Gaussian channels.
We define the conditional AMI for discrete input alphabets as the AMI between and conditioned on the outcome of a third random variable , i.e.,
which is valid for any random .
Iii-B CM Capacity
The CM capacity is defined as the AMI between and for a given constellation , i.e.,
where to pass from (25) to (26), we used the fact that the mapping rule between and is one-to-one. To pass from (26) to (27) we have used the chain rule of mutual information [44, Sec. 2.5], where represents a bit level AMI which represents the maximum rate that can be used at the th bit position, given a perfect knowledge of the previous bits.
The CM capacity in (25) corresponds to the capacity of the memoryless “CM channel” in Fig. 1 for a given constellation . We note that different binary labelings will produce different values of in (27); however, the overall sum will remain constant, i.e., the CM capacity does not depend on the binary labeling. We use the name “CM capacity” for in (25) following the standard terminology555Sometimes, this is also called joint capacity , or (constellation) constrained capacity [46, 47]. used in the literature (cf. [13, 21, 48, 23, 49]), although we recognize the misusage of the word capacity since no optimization over the input distribution is performed (cf. (20)). Moreover, it is also possible to optimize the input alphabet in order to obtain an increase in the AMI (so-called signal shaping ). Nevertheless, throughout this paper we will refer to the AMI for a given in (25) as the CM capacity.
In this paper we are interested in optimal constellations, and therefore, we define the maximum CM capacity as
As mentioned before, the CM capacity does not depend on the binary labeling, i.e., it does not depend on how the mapping rule is implemented, and therefore, in (29) we only show two optimization parameters: the input alphabet and the input distribution.
The CM capacity in (25) (for a given constellation ) is an upper bound on the number of bits per symbol that can be reliably transmitted using for example TCM  or MLC with multistage decoding (MLC-MSD)[10, 51]. MLC-MSD is in fact a direct application of the summation in (27), i.e., parallel encoders are used, each of them having a rate . At the receiver’s side, the first bit level is decoded and the decisions are passed to the second decoder, which then passes the decisions to the third decoder, and so on. Other design rules can also be applied in MLC, cf. . The maximum CM capacity in (29) represents an upper bound on the number of bits per symbol that can be reliably transmitted using a fully optimized system, i.e., a system where for each SNR value , the input alphabet and the input distribution are selected in order to maximize the CM capacity .
Iii-C BICM with Arbitrary Input Distributions
It is commonly assumed that the sequence generated by the binary encoder in Fig. 1 is infinitely long and symmetric, and also that the interleaver () operates over this infinite sequence, simply permuting it in a random way. Under these standard assumptions, it follows that the input symbol distribution will be always . Since in this paper we are interested in analyzing a more general setup where the input symbol distribution can by modified, we develop a more general model in which we relax the equiprobable input distribution assumption.
Let the binary random variable representing the bits at the th modulator’s input, where the pmf represents the probability of transmitting a bit at bit position . We assume that in general , i.e., the coded and interleaved sequence could have more zeros than ones (or vice-versa). Note that since is a pmf, .
Let be the binary label of the symbol . We assume that the bits at the input of the modulator are independent, and therefore, the input symbol probabilities are
The independence condition on the coded bits that results in (30) can be obtained if the interleaver block in Fig. 1 completely breaks the temporal correlation of the coded bits. The condition that the coded and interleaved sequence could be asymmetric can be obtained for example by using an encoder with nonuniform outputs, or by a particular puncturing scheme applied to the coded bits. This can be combined with the use of multiple interleavers and multiplexing , which would allow . Examples of how to construct a BICM scheme where nonuniform input symbol distributions are obtained include the “shaping encoder” of [53, 54] and the nonuniform signaling scheme based on a Huffman code of .
For future use, we also define the conditional input symbol probabilities, conditioned on the th bit being , as
where is defined in Sec. II-C.
Iii-D BICM Capacity
The “BICM channel” in Fig. 1 was introduced in  and it is what separates the encoder and decoder in a BICM system. The BICM capacity is then defined as the capacity of the BICM channel. Using the definitions in Sec. III-C and the equivalent channel model in [13, Fig. 6], which replaces the BICM channel by parallel binary-input continuous-output channels, the BICM capacity for a given constellation is defined as
where (35) follows from (34), (31), and the fact that the value of does not affect the conditional channel transition probability, i.e., . The BICM capacity in (35) is a general expression that depends on all the constellation parameters . This can be numerically implemented using Gauss–Hermite quadratures, or alternatively, by using a one-dimensional integration based on the pdf of the L-values developed in [35, 37, 38, 39].
The AMIs in (32) are, in contrast to the ones in (29), not conditioned on the previous bit values. Because of this, and unlike the CM capacity, the binary labeling strongly affects the BICM capacity in (32). Note that the BICM capacity is equivalent to the capacity achieved by MLC with (suboptimal) parallel decoding of the individual bit levels, because in BICM, the bits are treated as independent . The differences are that BICM uses only one encoder, and that in BICM the equivalent channels are not used in parallel, but time multiplexed. Again, following the standard terminology666It is also called parallel decoding capacity in , or receiver constrained capacity in . used in the literature (cf. [13, 21, 48, 23, 49]), we use the name “BICM capacity” even though no optimization over the input distribution is performed.
One relevant question here is what is the optimum labeling from a capacity maximization point of view. Once this question is answered, approaching the fundamental limit will depend only on a good design of the channel encoder/decoder. Caire et al. conjectured the optimality of the BRGC, which, as the next example shows, is not correct at all SNR. This was first disproved in  for PAM input alphabets based on an exhaustive search of binary labelings up to .
Example 3 (CM and BICM Capacities for the AWGN channel, -PAM, and )
In Fig. 2, we show the BICM capacity in (36) and the CM capacity in (25) for 8-PAM, , and the four binary labelings in Example 1. Fig. 2 (a) illustrates that the difference between the CM capacity and the BICM capacity is small if the binary labeling is properly selected. The best of the four binary labelings is the NBC for low SNR ( bit/symbol), the FBC for medium SNR ( bit/symbol), and the BRGC for high SNR ( bit/symbol). Hence, the BRGC is suboptimal in at least 36% of the range. The gap between the CM capacity and the BICM capacity for the BSGC is quite large at low to moderate SNR. The low-SNR behavior is better elucidated in Fig. 2 (b), where the same capacity curves are plotted versus instead of . Interestingly, the CM capacity and the BICM capacity using the NBC achieve the SL at asymptotically low rates; Gaussian inputs are not necessary, cf. [58, Sec. I].
Formally, is bounded from below by , where
This function always exists, because the capacity777From now on we will refer to “capacity” using the notation in a broad sense. can be the AWGN capacity in (22), the CM capacity in (25), or the BICM capacity in (32). is a strictly increasing888This can be proved using the relation between the AMI and the minimum mean square error (MMSE) presented in , i.e., that the derivative of the AMI with respect to is proportional to the MMSE for any . Since the MMSE is a strictly decreasing function of , the AMI is a strictly increasing function of . function of and thus invertible, while in contrast is in general not monotone. This is the reason why a given for some labelings maps to more than one capacity value, as shown in . The phenomenon can be understood by considering the function in a linear scale, instead of logarithmic as in Fig. 2 (a). If plotted, the function would pass through the origin for all labelings. Furthermore, any straight line through the origin represents a constant by (6), where the slope is determined by the value of . Such a line cannot intersect more than once for , if is concave. This is the case for the BRGC, NBC, and FBC, and therefore the function exists, as illustrated for in Fig. 2 (b). However, for some labelings such as the BSGC (and many others shown in [49, Fig. 3.5]), is not concave and is not invertible. This phenomenon has also been observed for linear precoding for BICM with iterative demapping and decoding [60, Fig. 3], punctured turbo codes [61, Fig. 3], and incoherent -ary PSK [62, Figs. 2 and 5] and frequency-shift keying channels [63, Figs. 1 and 6].
Since analytical expressions for the inverse function of the capacity are usually not available, expressions for are rare in the literature. One well-known exception is the capacity of the Gaussian channel given by (22), for which
which results in the SL
Analogously, we will use the notation and when the capacity considered is the CM and the BICM capacity, respectively.999The same notation convention will be used for other functions that will be introduced later in the paper.
The results in Fig. 2 (a)–(b) suggest a more general question: What are the optimal constellations for BICM at a given ? To formalize this question, and in analogy to the maximum CM capacity in (28), we define the maximum BICM capacity as
where the optimization is in this case over the three parameters defining . In analogy to the maximum CM capacity, the maximum BICM capacity represents an upper bound on the number of bits per symbol that can be reliably transmitted using a fully optimized BICM system, i.e., a system where for each , the constellation is selected to maximize the BICM capacity.
We conclude this subsection by expressing the BICM capacity as a difference of AMIs and conditional AMIs , which will facilitate the analysis in Sec. IV. The following result is a somehow straightforward generalization of [21, Proposition 1], [47, eq. (65)] to -dimensional input alphabets, nonuniform input distributions, and fading channels.
The BICM capacity can be expressed as
Iii-E Minimum for Reliable Transmission
In this section, we determine the minimum that permits reliable transmission, for a given input alphabet and labeling. As observed in Fig. 2 (b), this minimum does not necessarily occur at rate .