On the Rényi Divergence, Joint Range of Relative Entropies, and a Channel Coding Theorem
Abstract
This paper starts by considering the minimization of the Rényi divergence subject to a constraint on the total variation distance. Based on the solution of this optimization problem, the exact locus of the points is determined when are arbitrary probability measures which are mutually absolutely continuous, and the total variation distance between and is not below a given value. It is further shown that all the points of this convex region are attained by probability measures which are defined on a binary alphabet. This characterization yields a geometric interpretation of the minimal Chernoff information subject to a constraint on the variational distance.
This paper also derives an exponential upper bound on the performance of binary linear block codes (or code ensembles) under maximumlikelihood decoding. Its derivation relies on the Gallager bounding technique, and it reproduces the ShulmanFeder bound as a special case. The bound is expressed in terms of the Rényi divergence from the normalized distance spectrum of the code (or the average distance spectrum of the ensemble) to the binomially distributed distance spectrum of the capacityachieving ensemble of random block codes. This exponential bound provides a quantitative measure of the degradation in performance of binary linear block codes (or code ensembles) as a function of the deviation of their distance spectra from the binomial distribution. An efficient use of this bound is considered.
Keywords: Chernoff information, distance spectrum, error exponent, maximumlikelihood decoding, relative entropy, Rényi divergence, total variation distance.
I Introduction
The Rényi divergence, introduced in [30], has been studied so far in various informationtheoretic contexts (and it has been actually used before it had a name [37]). These include generalized cutoff rates and error exponents for hypothesis testing ([1, 6, 38]), guessing moments ([2, 9]), source and channel coding error exponents ([2, 12, 22, 27, 37]), strong converse theorems for classes of networks [11], strong data processing theorems for discrete memoryless channels [28], bounds for joint sourcechannel coding [41], and oneshot bounds for informationtheoretic problems [46].
In [14], Gilardoni derived a Pinskertype lower bound on the Rényi divergence for . In view of the fact that this lower bound is not tight, especially when the total variation distance is large, this paper starts by considering the minimization of the Rényi divergence , for an arbitrary , subject to a given (or minimal) value of the total variation distance. Note that the minimization here is taken over all probability measures with a total variation distance which is not below a given value; this problem differs from the type of problems studied in [3] and [24], in connection to the minimization of the relative entropy subject to a minimal value of the total variation distance with a fixed probability measure . The solution of this problem generalizes the problem of minimizing the relative entropy subject to a given value of the total variation distance where the latter is a special case with (see [10, 13, 29]).
One possible way to deal with this problem stems from the fact that the Rényi divergence is a onetoone transformation of the Hellinger divergence where for :
(1) 
and is an divergence; since the total variation distance is also an divergence, this problem can be viewed as a minimization of an divergence subject to a constraint on another divergence. The numerical optimization of an divergence subject to simultaneous constraints on divergences was recently studied in [15], where it has been shown that it suffices to restrict attention to alphabets of cardinality . In fact, as shown in [44, (22)], a binary alphabet suffices if there is a single constraint (i.e., ) which is on the total variation distance. In view of (1), the same conclusion also holds when minimizing the Rényi divergence subject to a constraint on the total variation distance. To set notation, the divergences are defined at the end of this section, being consistent with the notation in [35] and [45].
This paper treats this minimization problem of the Rényi divergence in a different way. We first generalize the analysis in [10], which was used for the minimization of the relative entropy subject to a constraint on the variational distance, for proving that it suffices to restrict attention to probability measures which are defined on a binary alphabet. Furthermore, the continuation of the analysis in this paper relies on the Lagrange duality, and a solution of the KarushKuhnTucker (KKT) equations while asserting strong duality for the studied problem. The use of Lagrange duality further simplifies the computational task of the studied minimization problem.
As complementary results to the minimization problem studied in this paper, the reader is referred to [35, Section 8] which provides upper bounds on the Rényi divergence for an arbitrary as a function of either the total variation distance or relative entropy in case that the relative information is bounded.
The solution of the minimization problem of the Rényi divergence, subject to a constraint on the total variation distance, provides an elegant way for the characterization of the exact locus of the points where and are probability measures whose total variation distance is not below a given value , and is an arbitrary probability measure. It is further shown in this paper that all the points of this convex region can be attained by a triple of probability measures which are defined on a binary alphabet.
In view of the characterization of the exact locus of these points, a geometric interpretation is provided in this paper for the minimal Chernoff information between and , denoted by , subject to an separation constraint on the variational distance between and . It is demonstrated in the following that the intersection point at the boundary of the locus of and the straight line is the point whose coordinates are equal to the minimal value of under the constraint . The reader is referred to [48], which relies on the closedform expression in [31, Proposition 2] for the minimization of the constrained Chernoff information, and which analyzes the problem of channelcode detection by a thirdparty receiver via the likelihood ratio test. In the latter problem, a thirdparty receiver has to detect the channel code used by the transmitter by observing a large number of noiseaffected codewords; this setup has applications in security or cognitive radios, or in link adaptation in some wireless technologies.
Since the Rényi divergence forms a generalization of the relative entropy , where the latter corresponds to , the approach suggested in this paper for the characterization of the exact locus of pairs of relative entropies in view of a solution to a minimization problem of the Rényi divergence is analogous to the usefulness of complex analysis in solving realvalued problems. We consider the analysis of the considered problem as mathematically pleasing in its own right. Note, however, that an operational meaning of a special point at the boundary of this locus has an operational meaning in view of [48] (see the previous paragraph). The studied problem considered here differs from the study in [17] which considered the joint range of divergences for pairs (rather than triplets) of probability measures.
The performance analysis of linear codes under maximumlikelihood (ML) decoding is of interest for studying the potential performance of these codes under optimal decoding, and for the evaluation of the degradation in performance that is incurred by the use of suboptimal and practical decoding algorithms. The reader is referred to [32] which is focused on this topic.
The second part of this paper derives an exponential upper bound on the performance of ML decoded binary linear block codes (or code ensembles). Its derivation relies on the Gallager bounding technique (see [32, Chapter 4], [36]), and it reproduces the ShulmanFeder bound [40] as a special case. The new exponential bound derived in this paper is expressed in terms of the Rényi divergence from the normalized distance spectrum of the code (or average distance spectrum of the ensemble) to the binomial distribution which characterizes the average distance spectrum of the capacityachieving ensemble of fully random block codes. This exponential bound provides a quantitative measure of the degradation in performance of binary linear block codes (or code ensembles) as a function of the deviation of their (average) distance spectra from the binomial distribution, and its use is exemplified for an ensemble of turboblock codes.
This paper is structured as follows: Section II solves the minimization problem for the Rényi divergence under a constraint on the total variation distance, Section III uses the solution of this minimization problem to obtain an exact characterization of the joint range of the relative entropies in the considered setting above. Section IV provides a new exponential upper bound on the block error probability of ML decoded binary linear block codes, which is expressed in terms of the Rényi divergence, suggests an efficient way to apply the bound to the performance evaluation of binary linear block codes (or code ensembles), and exemplifies its use. Throughout this paper, logarithms are to the base .
We end this section by introducing the definitions and notation used in this work, which are consistent with [35], [45], and are included here for the convenience of the reader.
Definitions and Notation
We assume throughout that the probability measures and are defined on a common measurable space , and denotes that is absolutely continuous with respect to , namely there is no event such that . Let denote the RadonNikodym derivative (or density) of with respect to .
Definition 1 (Relative entropy)
The relative entropy is given by
(2) 
Definition 2 (Total variation distance)
The total variation distance is given by
(3) 
Definition 3 (Hellinger divergence)
The Hellinger divergence of order is given by
(4) 
The analytic extension of at yields (nats).
Definition 4 (Rényi divergence)
The Rényi divergence of order is given as follows:

If , then
(5) 
If , then
(6) 
which is the analytic extension of at .

If then
(7) with .
Definition 5 (Chernoff information)
The Chernoff information between probability measures and is expressed as follows in terms of the Rényi divergence:
(8) 
and it is the best achievable exponent in the Bayesian probability of error for binary hypothesis testing (see, e.g., [5, Theorem 11.9.1]).
Ii Minimization of the Rényi Divergence with a Constrained Total Variation Distance
In this section, we derive a tight lower bound on the Rényi divergence subject to an equality constraint on the total variation distance where is fixed; alternatively, it can regarded as a minimization problem under the inequality constraint . It is first shown that this lower bound is attained for probability measures defined on a binary alphabet, and Lagrange duality is used to further simplify the computational task of this bound. The special case where , which is specialized to the minimization of the relative entropy subject to a fixed total variation distance, has been studied extensively, and three equivalent forms of the solution to this optimization problem were derived in [10], [13], [29].
In [14, Corollaries 6 and 9], Gilardoni derived two Pinskertype lower bounds on the Rényi divergence of order , expressed in terms of the total variation distance. Among these two bounds, the improved lower bound is given (in nats) by
(9) 
where denotes the total variation distance between and . Note that in the limit where tend to 2 (from below), this lower bound converges to a finite value which is at most ; it is, however, an artifact of the lower bound in view of the next lemma.
Lemma 1
(10) 
See Appendix AA.
In the following, we derive a tight lower bound which is shown to be achievable by a restriction of the probability measures to a binary alphabet. For , let
(11)  
(12) 
In the following, we evaluate the function . In view of [10, Section 2] which characterizes the minimum of the relative entropy in terms of the total variation distance, we first extend the argument in [10] to prove the next lemma.
Lemma 2
For an arbitrary , the minimization in (11) is attained by probability measures which are defined on a binary alphabet.
See Appendix AB.
The following proposition enables to calculate for an arbitrary positive .
Proposition 1
This directly follows from Lemma 2.
Proposition 2
(15) 
and
(16) 
Furthermore, for and ,
(17) 
and
(18) 
where
(19) 
See Appendix B.
Remark 1
In the following, we use Lagrange duality to obtain an alternative expression as a solution of the minimization problem for . Recall that Proposition 1 applies to every . The following enables to simplify considerably the computational task in calculating , for .
Lemma 3
Let and . The function
(20) 
is strictly monotonically increasing, positive, continuous, and
(21) 
See Appendix C.
Corollary 1
For and , the equation
(22) 
has a unique solution .
It follows from Lemma 3, and the mean value theorem for continuous functions.
Remark 2
An alternative simplified form for the optimization problem in Proposition 1 is next provided for orders . Hence, Proposition 1 applies to every , whereas the following is restricted to . This, however, proves to be very useful in the next section in terms of obtaining a significant reduction in the computational complexity of where only is of interest there.^{1}^{1}1This saving in the computational complexity accelerated the running time of the numerical calculations in our computer by two orders of magnitude.
Proposition 3
See Appendix D.
Iii The Locus of With a Constrained Total Variation Distance
In this section, we address the following question:
Question 1
What is the locus of the points if are arbitrary probability measures which are mutually absolutely continuous, and for a given value ? (none of the three probability measures is fixed).
The present section provides an exact characterization of this locus in view of the solution to the minimization problem in Section II, and the following lemma:
Lemma 4
Let be pairwise mutually absolutely continuous probability measures defined on a measurable space . Then, for ,
(23) 
where the probability measure is given by
(24) 
See Appendix E.
As a corollary of Lemma 4, the following tight inequality holds, which is attributed to van Erven [7, Lemma 6.6] and Shayevitz [39, Section IV.B.8]). It will be useful for the continuation of this section, jointly with the results of Section II.
Corollary 2
Let be mutually absolutely continuous discrete probability measures defined on a common set . If then
(25) 
with equality if and only if, for every ,
(26) 
For , inequality (25) is reversed with the same necessary and sufficient condition for an equality.
Remark 3
The knowledge of the maximizing probability measure in (26) is required for the characterization of the exact locus which is studied in this section.
The exact locus of the points is determined as follows: let for a fixed , and let be chosen arbitrarily. By the tight lower bound in Section II, we have
(27) 
where is expressed in (13). For and for a fixed value of , let and in be set to achieve the global minimum in (13) (note that, without loss of generality, one can assume that since if achieves the minimum in (13) then also achieves the same minimum). Consequently, the lower bound in (27) is attained by probability measures which are defined on a binary alphabet (see Lemma 2) with
(28) 
From Corollary 2 and (27), (28), it follows that for every
(29) 
where equality in (29) holds if and are the probability measures in (28) which are defined on a binary alphabet, and is the respective probability measure in (26) which is therefore also defined on a binary alphabet. Hence, there exists a triple of probability measures which are defined on a binary alphabet and satisfy (29) with equality, and these probability measures are easy to calculate for every and .
Remark 4
Theorem 1
The exact locus of in the setting of Question 1 is the convex region whose boundary is the convex envelope of all the straight lines
(31) 
(i.e., the boundary is the pointwise maximum of the set of straight lines in (31) for ). Furthermore, all the points in this convex region, including its boundary, are attained by probability measures which are defined on a binary alphabet.
Let be arbitrary probability measures which are mutually absolutely continuous and satisfy the separation condition for and in total variation. In view of Corollary 2 and since by definition , it follows that the point satisfies
(32) 
for every ; this implies that every such a point is either on or above the convex envelope of the parameterized straight lines in (31).
We next prove that a point which is below the convex envelope of the lines in (31) cannot be achieved under the constraint . The reason for this claim is because for such a point , there is some for which
(33) 
Since under the separation condition for and in total variation distance, , then for such , inequality (25) is violated; in view of Corollary 2, this yields that the point is not achievable under the constraint . As an interim conclusion, it follows that the exact locus of the achievable points is the set of all points in the plane which are on or above the convex envelope of the parameterized straight lines in (31) for .
The next step aims to show that an arbitrary point which is located at the boundary of this region can be obtained by a triplet of probability measures which are defined on a binary alphabet, and satisfy . To that end, note that every point which is on the boundary of this region is a tangent point to one of the straight lines in (31) for some . Accordingly, the proper probability measures , and can be determined as follows for a given :

Find the slope of the tangent line at the selected point on the boundary; in view of (31), yields .

In view of Proposition 3, determine such that and . Consequently, let and be the probability measures which are defined on the binary alphabet with and .

The respective probability measure is calculated from (26), and it is therefore also defined on the binary alphabet.
Finally, we show that every interior point in the achievable region can be attained as well by a proper selection of , and which are defined on a binary alphabet. To that end, note that every such interior point is located at the boundary of the locus of under the constraint with some ; this follows from the fact that is a strictly monotonically increasing and continuous function in , which tends to infinity as we let tend to 2 (see Lemma 1). It therefore follows that the suitable triplet of probability measures can be obtained by the same algorithm used for points on the boundary of this region, except for replacing by the larger value .
This concludes the proof by first characterizing the exact locus of points, and then demonstrating that every point in this convex region (including its boundary) is attained by probability measures which are defined on the binary alphabet; the proof is also constructive in the sense of providing an algorithm to calculate such probability measures for an arbitrary point in this closed and convex region.
As it is shown in Figure 4, the boundaries of these regions become less curvy as .
A Geometric Interpretation of the Minimal Chernoff Information with a Constraint on the Variational Distance
Consider the point in Figure 4 which, in the plane of , is the intersection of the straight line and the boundary of the convex region which is characterized in Theorem 1 for an arbitrary .
In view of the proof of Theorem 1, this intersection point satisfies for some , for which are probability measures defined on a binary alphabet with , and is given in (26). The equal coordinates of this intersection point are therefore equal to the Chernoff information (see [5, Section 11.9]). Due to the symmetry of this region with respect to the straight line (this follows from the symmetry property ), the slope of the tangent line to the boundary of the convex region at this intersection point is (see Figure 4). This yields that , and from Proposition 2, . Hence, from (31) with , the equal coordinates of this intersection point are given by
(34) 
Based on [31, Proposition 2], this value is equal to the minimum of the Chernoff information subject to an separation constraints for and in total variation distance. We next calculate the probability measures , and which attain this intersection point. Eq. (13) with yields
(35) 
such that and . A possible solution of this equation is and , so the respective probability measures which are defined on the binary alphabet satisfy and ; consequently, from (26), is the equiprobable distribution on the binary alphabet.
As a byproduct of the characterization of the convex region in Theorem 1, it follows that the straight line (in the plane of Figure 4) intersects the boundary of the convex region which is specified in Theorem 1 at the point whose coordinates are equal to the minimized Chernoff information subject to the constraint . The equal coordinates of each of the 4 intersection points in Figure 4, which refer to , are equal to nats, respectively.
Iv A Performance Bound for Coded Communications via the Rényi Divergence
Iva New Exponential Upper Bound
This section derives an exponential upper bound on the performance of binary linear block codes, expressed in terms of the Rényi divergence. Similarly to [19], [20], [21], [23], [25], [33, Section 3.B], [36], [40] and [43], the upper bound in the next theorem quantifies the degradation in the performance of block codes under ML decoding in terms of the deviation of their distance spectra from the binomially distributed (average) distance spectrum of the capacityachieving ensemble of random block codes.
Theorem 2
Consider a binary linear block code of length and rate where designates the number of codewords. Let and, for , let be the number of nonzero codewords of Hamming weight . Assume that the transmission of the code takes place over a memoryless, binaryinput and outputsymmetric channel. Then, the block error probability under ML decoding satisfies
(36) 
where for (with the convention that for ), is the binomial distribution with parameter and independent trials (i.e., for ), is the PMF defined by for , is the Rényi divergence of order (i.e., where here), and designates the Gallager random coding error exponent in [12, Eq. (5.6.14)].
Before proving Theorem 2, we relate this exponential bound to previously reported bounds.
Remark 5
Note that the loosening of the bound by taking and, respectively, gives the upper bound
which coincides with the ShulmanFeder bound [40]. Equality (a) follows from the definition of the Gallager random coding exponent in [12, Eq. (5.6.16)] where the symmetric input distribution is the optimal input distribution for any memoryless, binaryinput outputsymmetric channel, equality (b) follows from the expression of the Rényi divergence of order infinity (see, e.g., [8, Theorem 6]), and equality (c) follows from the definition of the PMFs and in Theorem 2.
Remark 6
The proof of Theorem 2 is based on the framework of the Gallager bounds in [32, Chapter 4] and [36]. Specifically, it has an overlap with [36, Appendix A]. Unlike the analysis in [36, Appendix A], working with the Rényi divergence of order , instead of the relative entropy as a lower bound (see [36, Eq. (A19)]) reveals a need for an optimization of the error exponent, which leads to the error exponent in Theorem 2. Namely, if the value of is increased then the value of is decreased, and therefore is also decreased (unless it is zero, see [8, Theorem 3]; note that and do not depend on the parameters and , so they stay unaffected by varying the values of these parameters). The maximization of the error exponent in Theorem 2 aims at finding a proper balance between the two summands and on the righthand side of (36), while also performing an optimization over the second dependent variable .
We proceed now with the proof of Theorem 2.
The proof of Theorem 2 is based on the framework of the Gallager bounds in [32, Chapter 4] and [36]. Specifically, it relies on [36, Appendix A]. We explain in the following how our proof differs from the analysis in [36, Appendix A]. From [36, Eq. (A17)], we have that for every
(37) 
From this point, we deviate from the analysis in [36, Appendix A]. Since where , we have
(38) 
where is the Rényi divergence of order from to . This enables to refer to the Rényi divergence of order , instead of lower bounding this quantity by the relative entropy, and consequently loosening the bound (see [36, Eq. (A19)]). Note that since the Rényi divergence is monotonically increasing in its order (see, e.g., [8, Theorem 3]) and the Rényi divergence of order 1 is particularized to the relative entropy, the inequality holds. The combination of (37) and (38) gives
(39) 
A maximization of the error exponent in (39) with respect to the parameters and (recall that ) gives the upper bound in (36).
IvB Application of Theorem 2
An efficient use of Theorem 2 for the performance evaluation of binary linear block codes (or coee ensembles) is suggested in the following by borrowing a concept of bounding from [23], which has been further studied, e.g., in [32], [33], [43], and combining it with the new bound in Theorem 2. In order to utilize the ShulmanFeder bound for binary linear block codes in a clever way, it has been suggested in [23] to partition the binary linear block code into two subcodes and where and is the allzero codeword. The first subcode contains the allzero codeword and all the codewords of whose Hamming weights belong to a subset , while contains the other codewords of which have Hamming weights of , together with the allzero codeword. From the symmetry of the channel, where and designate the conditional ML decoding error probabilities of and , respectively, given that the allzero codeword is transmitted. Note that although the code is linear, its two subcodes and are in general nonlinear. One can rely on different upper bounds on the conditional error probabilities and , i.e., we may bound by invoking Theorem 2, due to its tightening of the ShulmanFeder bound (see Remark 5), and also rely on an alternative approach for obtaining an upper bound on (e.g., it is possible to rely on the union bound with respect to the fixed composition codes of the subcode ). The idea behind this partitioning is to include in the subcode the codewords of all the Hamming weights whose distance spectrum is close enough to the binomial distribution (see Theorem 2) in the sense that the additional term in the exponent of (36) has a marginal effect on the conditional ML decoding error probability of the subcode .
Theorem 2 can be applied as well to ensembles of binary linear block codes. The verify this claim, let be an ensemble of binary linear block codes. The proof of Theorem 2 follows from the Duman and Salehi bounding technique [36] which leads to the derivation of [36, Eq. (A.11)]. By taking the expectation on the RHS of [36, Eq. (A.11)] with respect to the code ensemble and invoking Jensen’s inequality, the same bound holds while , as it is defined in Theorem 2, is replaced by the expectation with respect to the code ensemble . This enables to replace on the RHS of (36) with where
which therefore justifies the generalization of Theorem 2 to code ensembles of binary linear block codes.
As it is exemplified in Section IVC, Theorem 2 can be efficiently applied to ensembles of turbolike codes in the same way that it was demonstrated to be efficient in [43]. Similarly to Theorem 2, the bound in [43, Theorem 3.1] forms another refinement of the ShulmanFeder bound, and the novelty in the former bound is the obtained tightening of the ShulmanFeder bound via the use of the Rényi divergence.
IvC An Example: Performance Bounds for an Ensemble of TurboBlock Codes
We conclude this section by an example which applies this bounding technique to the ensemble of uniformly interleaved turbo codes whose two component codes are chosen uniformly at random from the ensemble of (1072, 1000) binary systematic linear block codes. The transmission of these codes takes place over an additive white Gaussian noise (AWGN) channel, and the codes are BPSK modulated and coherently detected. The calculation of the average distance spectrum of this ensemble has been performed in [43, Section 5.D], which is required for the calculation of the upper bound in (36) where the PD is replaced by its expected value over the ensemble (i.e., the normalization of the average distance spectrum by the number of codewords, as it is defined in Theorem 2). In the following, two upper bounds on the block error probability are compared under ML decoding: the first one is the tangentialsphere bound (TSB) of Herzberg and Poltyrev (see [18], [26], [32, Section 3.2.1]), and the second bound follows from the suggested combination of the union bound and Theorem 2. Note that an optimal partitioning has been performed, in a way which is conceptually similar to [43, Algorithm 1], for obtaining the tightest bound which is obtained by combining the union bound and Theorem 2.
A comparison of the two bounds shows an advantage of the latter combined bound over the TSB in a similar way to [43, upper plot of Fig. 8] (e.g., providing a gain of about 0.2 dB over the TSB for a block error probability of ). Note that the ShulmanFeder bound is rather loose in this case due to the significant deviation of the ensemble distance spectrum from the binomial distribution at low and high Hamming weights. Furthermore, we note that the advantage of the proposed bound over the TSB in this example is consistent with the analysis in [26] and [42], demonstrating a gap between the random coding error exponent of Gallager and the corresponding error exponents that follow from the TSB and some of its improved versions. Recall that the random coding error exponent of Gallager achieves the channel capacity, whereas the random coding error exponent that follows from the TSB (or some of its improved variants) does not achieve the capacity of a binaryinput AWGN channel for BPSK modulated fully random block codes, where the gap to capacity is especially pronounced for high coding rates. In this example, the rate of the ensemble is 0.8741 bits per channel use.
Appendix A Proofs of Lemmas 1 and 2
Aa Proof of Lemma 1
For , where denotes the Bhattacharyya coefficient between the two PDs . We have
(I.1) 
where (see, e.g., [31, Proposition 1]; inequality (I.1) is known in quantum information theory with respect to the relation between the trace distance and fidelity [47, Section 9.3]). Hence, (I.1) implies that (10) holds for . Since is monotonically increasing in its order (see [8, Theorem 3]), it follows that (10) also holds for . Finally, due to the skewsymmetry property of (see [8, Proposition 2]) where for , and since the total variation distance is a symmetric measure and