Comparison of channels: criteria fordomination by a symmetric channel

# Comparison of channels: criteria for domination by a symmetric channel

Anuran Makur and Yury Polyanskiy A. Makur and Y. Polyanskiy are with the Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA (e-mail: a_makur@mit.edu; yp@mit.edu). This research was supported in part by the National Science Foundation CAREER award under grant agreement CCF-12-53205, and in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-09-39370. This work was presented at the 2017 IEEE International Symposium on Information Theory (ISIT) [1].
###### Abstract

This paper studies the basic question of whether a given channel can be dominated (in the precise sense of being more noisy) by a -ary symmetric channel. The concept of “less noisy” relation between channels originated in network information theory (broadcast channels) and is defined in terms of mutual information or Kullback-Leibler divergence. We provide an equivalent characterization in terms of -divergence. Furthermore, we develop a simple criterion for domination by a -ary symmetric channel in terms of the minimum entry of the stochastic matrix defining the channel . The criterion is strengthened for the special case of additive noise channels over finite Abelian groups. Finally, it is shown that domination by a symmetric channel implies (via comparison of Dirichlet forms) a logarithmic Sobolev inequality for the original channel.

Less noisy, degradation, -ary symmetric channel, additive noise channel, Dirichlet form, logarithmic Sobolev inequalities.

## I Introduction

For any Markov chain , it is well-known that the data processing inequality, , holds. This result can be strengthened to [2]:

 I(U;Y)≤ηI(U;X) (1)

where the contraction coefficient only depends on the channel . Frequently, one gets and the resulting inequality is called a strong data processing inequality (SDPI). Such inequalities have been recently simultaneously rediscovered and applied in several disciplines; see [3, Section 2] for a short survey. In [3, Section 6], it was noticed that the validity of (1) for all is equivalent to the statement that an erasure channel with erasure probability is less noisy than the given channel . In this way, the entire field of SDPIs is equivalent to determining whether a given channel is dominated by an erasure channel.

This paper initiates the study of a natural extension of the concept of SDPI by replacing the distinguished role played by erasure channels with -ary symmetric channels. We give simple criteria for testing this type of domination and explain how the latter can be used to prove logarithmic Sobolev inequalities. In the next three subsections, we introduce some basic definitions and notation. We state and motivate our main question in subsection I-D, and present our main results in section II.

### I-a Preliminaries

The following notation will be used in our ensuing discussion. Consider any . We let (respectively ) denote the set of all real (respectively complex) matrices. Furthermore, for any matrix , we let denote the transpose of , denote the Moore-Penrose pseudoinverse of , denote the range (or column space) of , and denote the spectral radius of (which is the maximum of the absolute values of all complex eigenvalues of ) when . We let denote the sets of positive semidefinite and symmetric matrices, respectively. In fact, is a closed convex cone (with respect to the Frobenius norm). We also let denote the Löwner partial order over : for any two matrices , we write (or equivalently, , where is the zero matrix) if and only if . To work with probabilities, we let be the probability simplex of row vectors in , be the relative interior of , and be the convex set of row stochastic matrices (which have rows in ). Finally, for any (row or column) vector , we let denote the diagonal matrix with entries for each , and for any set of vectors , we let be the convex hull of the vectors in .

### I-B Channel preorders in information theory

Since we will study preorders over discrete channels that capture various notions of relative “noisiness” between channels, we provide an overview of some well-known channel preorders in the literature. Consider an input random variable and an output random variable , where the alphabets are and for without loss of generality. We let be the set of all probability mass functions (pmfs) of , where every pmf and is perceived as a row vector. Likewise, we let be the set of all pmfs of . A channel is the set of conditional distributions that associates each with a conditional pmf . So, we represent each channel with a stochastic matrix that is defined entry-wise as:

 ∀x∈X,∀y∈Y,[W]x+1,y+1≜WY|X(y|x) (2)

where the th row of corresponds to the conditional pmf , and each column of has at least one non-zero entry so that no output alphabet letters are redundant. Moreover, we think of such a channel as a (linear) map that takes any row probability vector to the row probability vector .

One of the earliest preorders over channels was the notion of channel inclusion proposed by Shannon in [4]. Given two channels and for some , he stated that includes , denoted , if there exist a pmf for some , and two sets of channels and , such that:

 V=m∑k=1gkBkWAk. (3)

Channel inclusion is preserved under channel addition and multiplication (which are defined in [5]), and the existence of a code for implies the existence of as good a code for in a probability of error sense [4]. The channel inclusion preorder includes the input-output degradation preorder, which can be found in [6], as a special case. Indeed, is an input-output degraded version of , denoted , if there exist channels and such that . We will study an even more specialized case of Shannon’s channel inclusion known as degradation [7, 8].

A channel is said to be a degraded version of a channel with the same input alphabet, denoted , if for some channel .

We note that when Definition 1 of degradation is applied to general matrices (rather than stochastic matrices), it is equivalent to Definition C.8 of matrix majorization in [9, Chapter 15]. Many other generalizations of the majorization preorder over vectors (briefly introduced in Appendix A) that apply to matrices are also presented in [9, Chapter 15].

Körner and Marton defined two other preorders over channels in [10] known as the more capable and less noisy preorders. While the original definitions of these preorders explicitly reflect their significance in channel coding, we will define them using equivalent mutual information characterizations proved in [10]. (See [11, Problems 6.16-6.18] for more on the relationship between channel coding and some of the aforementioned preorders.) We say a channel is more capable than a channel with the same input alphabet, denoted , if for every input pmf , where denotes the mutual information of the joint pmf defined by and . The next definition presents the less noisy preorder, which will be a key player in our study.

###### Definition 2 (Less Noisy Preorder).

Given two channels and with the same input alphabet, let and denote the output random variables of and , respectively. Then, is less noisy than , denoted , if for every joint distribution , where the random variable has some arbitrary range , and forms a Markov chain.

An analogous characterization of the less noisy preorder using Kullback-Leibler (KL) divergence or relative entropy is given in the next proposition.

###### Proposition 1 (KL Divergence Characterization of Less Noisy [10]).

Given two channels and with the same input alphabet, if and only if for every pair of input pmfs , where denotes the KL divergence.111Throughout this paper, we will adhere to the convention that is true. So, is not violated when both KL divergences are infinity.

We will primarily use this KL divergence characterization of in our discourse because of its simplicity. Another well-known equivalent characterization of due to van Dijk is presented below, cf. [12, Theorem 2]. We will derive some useful corollaries from it later in subsection IV-B.

###### Proposition 2 (van Dijk Characterization of Less Noisy [12]).

Given two channels and with the same input alphabet, consider the functional :

 ∀PX∈Pq,F(PX)≜I(PX,WY|X)−I(PX,VY|X).

Then, if and only if is concave.

The more capable and less noisy preorders have both been used to study the capacity regions of broadcast channels. We refer readers to [13, 14, 15], and the references therein for further details. We also remark that the more capable and less noisy preorders tensorize, as shown in [11, Problem 6.18] and [3, Proposition 16], [16, Proposition 5], respectively.

On the other hand, these preorders exhibit rather counter-intuitive behavior in the context of Bayesian networks (or directed graphical models). Consider a Bayesian network with “source” nodes (with no inbound edges) and “sink” nodes (with no outbound edges) . If we select a node in this network and replace the channel from the parents of to with a less noisy channel, then we may reasonably conjecture that the channel from to also becomes less noisy (motivated by the results in [3]). However, this conjecture is false. To see this, consider the Bayesian network in Figure 1 (inspired by the results in [17]), where the source nodes are and (almost surely), the node is the output of a binary symmetric channel (BSC) with crossover probability , denoted , and the sink node is the output of a NOR gate. Let be the end-to-end mutual information. Then, although for , it is easy to verify that . So, when we replace the with a less noisy , the end-to-end channel does not become less noisy (or more capable).

The next proposition illustrates certain well-known relationships between the various preorders discussed in this subsection.

###### Proposition 3 (Relations between Channel Preorders).

Given two channels and with the same input alphabet, we have:

1. ,

2. .

These observations follow in a straightforward manner from the definitions of the various preorders. Perhaps the only nontrivial implication is , which can be proven using Proposition 1 and the data processing inequality.

### I-C Symmetric channels and their properties

We next formally define -ary symmetric channels and convey some of their properties. To this end, we first introduce some properties of Abelian groups and define additive noise channels. Let us fix some with and consider an Abelian group of order equipped with a binary “addition” operation denoted by . Without loss of generality, we let , and let denote the identity element. This endows an ordering to the elements of . Each element permutes the entries of the row vector to by (left) addition in the Cayley table of the group, where denotes a permutation of , and for every . So, corresponding to each , we can define a permutation matrix such that:

 [v0⋯vq−1]Px=[vσx(0)⋯vσx(q−1)] (4)

for any , where for each , is the th standard basis column vector with unity in the th position and zero elsewhere. The permutation matrices (with the matrix multiplication operation) form a group that is isomorphic to (see Cayley’s theorem, and permutation and regular representations of groups in [18, Sections 6.11, 7.1, 10.6]). In particular, these matrices commute as is Abelian, and are jointly unitarily diagonalizable by a Fourier matrix of characters (using [19, Theorem 2.5.5]). We now recall that given a row vector , we may define a corresponding -circulant matrix, , that is defined entry-wise as [20, Chapter 3E, Section 4]:

 ∀a,b∈[q],[\small circX(x)]a+1,b+1≜x−a⊕b. (5)

where denotes the inverse of . Moreover, we can decompose this -circulant matrix as:

 \small circX(x)=q−1∑i=0xiPTi (6)

since for every . Using similar reasoning, we can write:

 \small circX(x)=[P0y⋯Pq−1y]=[P0xT⋯Pq−1xT]T (7)

where , and is the identity matrix. Using (6), we see that -circulant matrices are normal, form a commutative algebra, and are jointly unitarily diagonalizable by a Fourier matrix. Furthermore, given two row vectors , we can define as the -circular convolution of and , where the commutativity of -circular convolution follows from the commutativity of -circulant matrices.

A salient specialization of this discussion is the case where is addition modulo , and is the cyclic Abelian group . In this scenario, -circulant matrices correspond to the standard circulant matrices which are jointly unitarily diagonalized by the discrete Fourier transform (DFT) matrix. Furthermore, for each , the permutation matrix , where is the generator cyclic permutation matrix as presented in [19, Section 0.9.6]:

 ∀a,b∈[q],[Pq]a+1,b+1≜Δ1,(b−a(\scriptsize modq)) (8)

where is the Kronecker delta function, which is unity if and zero otherwise. The matrix cyclically shifts any input row vector to the right once, i.e. .

Let us now consider a channel with common input and output alphabet , where is an Abelian group. Such a channel operating on an Abelian group is called an additive noise channel when it is defined as:

 Y=X⊕Z (9)

where is the input random variable, is the output random variable, and is the additive noise random variable that is independent of with pmf . The channel transition probability matrix corresponding to (9) is the -circulant stochastic matrix , which is also doubly stochastic (i.e. both ). Indeed, for an additive noise channel, it is well-known that the pmf of , , can be obtained from the pmf of , , by -circular convolution: . We remark that in the context of various channel symmetries in the literature (see [21, Section VI.B] for a discussion), additive noise channels correspond to “group-noise” channels, and are input symmetric, output symmetric, Dobrushin symmetric, and Gallager symmetric.

The -ary symmetric channel is an additive noise channel on the Abelian group with noise pmf , where . Its channel transition probability matrix is denoted :

 Wδ≜\small circX(wδ)=[wδTPTqwδT⋯(PTq)q−1wδT]T (10)

which has in the principal diagonal entries and in all other entries regardless of the choice of group . We may interpret as the total crossover probability of the symmetric channel. Indeed, when , represents a BSC with crossover probability . Although is only stochastic when , we will refer to the parametrized convex set of matrices with parameter as the “symmetric channel matrices,” where each has the form (10) such that every row and column sums to unity. We conclude this subsection with a list of properties of symmetric channel matrices.

###### Proposition 4 (Properties of Symmetric Channel Matrices).

The symmetric channel matrices, , satisfy the following properties:

1. For every , is a symmetric circulant matrix.

2. The DFT matrix , which is defined entry-wise as for where and , jointly diagonalizes for every . Moreover, the corresponding eigenvalues or Fourier coefficients, are real:

 λj(Wδ)={1,j=11−δ−δq−1,j=2,…,q

where denotes the Hermitian transpose of .

3. For all , is a doubly stochastic matrix that has the uniform pmf as its stationary distribution: .

4. For every , with , and for , is unit rank and singular, where .

5. The set with the operation of matrix multiplication is an Abelian group.

###### Proof.

See Appendix B. ∎

### I-D Main question and motivation

As we mentioned at the outset, our work is partly motivated by [3, Section 6], where the authors demonstrate an intriguing relation between less noisy domination by an erasure channel and the contraction coefficient of the SDPI (1). For a common input alphabet , consider a channel and a -ary erasure channel with erasure probability . Recall that given an input , a -ary erasure channel erases and outputs e (erasure symbol) with probability , and outputs itself with probability ; the output alphabet of the erasure channel is . It is proved in [3, Proposition 15] that if and only if , where is the contraction coefficient for KL divergence:

 η\tiny KL(V)≜supPX,QX∈Pq0

which equals the best possible constant in the SDPI (1) when (see [3, Theorem 4] and the references therein). This result illustrates that the -ary erasure channel with the largest erasure probability (or the smallest channel capacity) that is less noisy than has .222A -ary erasure channel with erasure probability has channel capacity , which is linear and decreasing. Furthermore, there are several simple upper bounds on that provide sufficient conditions for such less noisy domination. For example, if the -distances between the rows of are bounded by for some , then , cf. [22]. Another criterion follows from Doeblin minorization [23, Remark III.2]: if for some pmf and some , entry-wise, then and .

To extend these ideas, we consider the following question: What is the -ary symmetric channel with the largest value of (or the smallest channel capacity) such that ?333A -ary symmetric channel with total crossover probability has channel capacity , which is convex and decreasing. Here, denotes the Shannon entropy of the pmf . Much like the bounds on in the erasure channel context, the goal of this paper is to address this question by establishing simple criteria for testing domination by a -ary symmetric channel. We next provide several other reasons why determining whether a -ary symmetric channel dominates a given channel is interesting.

Firstly, if , then (where is the -fold tensor product of ) since tensorizes, and for every Markov chain (see Definition 2). Thus, many impossibility results (in statistical decision theory for example) that are proven by exhibiting bounds on quantities such as transparently carry over to statistical experiments with observations on the basis of . Since it is common to study the -ary symmetric observation model (especially with ), we can leverage its sample complexity lower bounds for other .

Secondly, we present a self-contained information theoretic motivation. if and only if , where is the secrecy capacity of the Wyner wiretap channel with as the main (legal receiver) channel and as the eavesdropper channel [24, Corollary 3], [11, Corollary 17.11]. Therefore, finding the maximally noisy -ary symmetric channel that dominates establishes the minimal noise required on the eavesdropper link so that secret communication is feasible.

Thirdly, domination turns out to entail a comparison of Dirichlet forms (see subsection II-D), and consequently, allows us to prove Poincaré and logarithmic Sobolev inequalities for from well-known results on -ary symmetric channels. These inequalities are cornerstones of the modern approach to Markov chains and concentration of measure [25, 26].

## Ii Main results

In this section, we first delineate some guiding sub-questions of our study, indicate the main results that address them, and then present these results in the ensuing subsections. We will delve into the following four leading questions:

1. Can we test the less noisy preorder without using KL divergence?
Yes, we can use -divergence as shown in Theorem 1.

2. Given a channel , is there a simple sufficient condition for less noisy domination by a -ary symmetric channel ?
Yes, a condition using degradation (which implies less noisy domination) is presented in Theorem 2.

3. Can we say anything stronger about less noisy domination by a -ary symmetric channel when is an additive noise channel?
Yes, Theorem 3 outlines the structure of additive noise channels in this context (and Figure 2 depicts it).

4. Why are we interested in less noisy domination by -ary symmetric channels?
Because this permits us to compare Dirichlet forms as portrayed in Theorem 4.

We next elaborate on these aforementioned theorems.

### Ii-a χ2-divergence characterization of the less noisy preorder

Our most general result illustrates that although less noisy domination is a preorder defined using KL divergence, one can equivalently define it using -divergence. Since we will prove this result for general measurable spaces, we introduce some notation pertinent only to this result. Let , , and be three measurable spaces, and let and be two Markov kernels (or channels) acting on the same source space . Given any probability measure on , we denote by the probability measure on induced by the push-forward of .444Here, we can think of and as random variables with codomains and , respectively. The Markov kernel behaves like the conditional distribution of given (under regularity conditions). Moreover, when the distribution of is , the corresponding distribution of is . Recall that for any two probability measures and on , their KL divergence is given by:

 D(PX||QX)≜⎧⎪⎨⎪⎩∫Xlog(dPXdQX)dPX,% ifPX≪QX+∞,otherwise (12)

and their -divergence is given by:

 χ2(PX||QX)≜⎧⎪⎨⎪⎩∫X(dPXdQX)2dQX−1,ifPX≪QX+∞,otherwise (13)

where denotes that is absolutely continuous with respect to , denotes the Radon-Nikodym derivative of with respect to , and is the natural logarithm with base (throughout this paper). Furthermore, the characterization of in Proposition 1 extends naturally to general Markov kernels; indeed, if and only if for every pair of probability measures and on . The next theorem presents the -divergence characterization of .

###### Theorem 1 (χ2-Divergence Characterization of ⪰\tiny ln).

For any Markov kernels and acting on the same source space, if and only if:

 χ2(PXW||QXW)≥χ2(PXV||QXV)

for every pair of probability measures and on .

Theorem 1 is proved in subsection IV-A.

### Ii-B Less noisy domination by symmetric channels

Our remaining results are all concerned with less noisy (and degraded) domination by -ary symmetric channels. Suppose we are given a -ary symmetric channel with , and another channel with common input and output alphabets. Then, the next result provides a sufficient condition for when .

###### Theorem 2 (Sufficient Condition for Degradation by Symmetric Channels).

Given a channel with and minimum probability , we have:

 0≤δ≤ν1−(q−1)ν+νq−1⇒Wδ⪰\tiny degV.

Theorem 2 is proved in section VI. We note that the sufficient condition in Theorem 2 is tight as there exist channels that violate when . Furthermore, Theorem 2 also provides a sufficient condition for due to Proposition 3.

### Ii-C Structure of additive noise channels

Our next major result is concerned with understanding when -ary symmetric channels operating on an Abelian group dominate other additive noise channels on , which are defined in (9), in the less noisy and degraded senses. Given a symmetric channel with , we define the additive less noisy domination region of as:

 (14)

which is the set of all noise pmfs whose corresponding channel transition probability matrices are dominated by in the less noisy sense. Likewise, we define the additive degradation region of as:

which is the set of all noise pmfs whose corresponding channel transition probability matrices are degraded versions of . The next theorem exactly characterizes , and “bounds” in a set theoretic sense.

###### Theorem 3 (Additive Less Noisy Domination and Degradation Regions for Symmetric Channels).

Given a symmetric channel with and , we have:

where the first set inclusion is strict for and , denotes the generator cyclic permutation matrix as defined in (8), u denotes the uniform pmf, is the Euclidean -norm, and:

 γ=1−δ1−δ+δ(q−1)2.

Furthermore, is a closed and convex set that is invariant under the permutations defined in (4) corresponding to the underlying Abelian group (i.e. for every ).

Theorem 3 is a compilation of several results. As explained at the very end of subsection V-B, Proposition 6 (in subsection III-A), Corollary 1 (in subsection III-B), part 1 of Proposition 9 (in subsection V-A), and Proposition 11 (in subsection V-B) make up Theorem 3. We remark that according to numerical evidence, the second and third set inclusions in Theorem 3 appear to be strict, and seems to be a strictly convex set. The content of Theorem 3 and these observations are illustrated in Figure 2, which portrays the probability simplex of noise pmfs for and the pertinent regions which capture less noisy domination and degradation by a -ary symmetric channel.

### Ii-D Comparison of Dirichlet forms

As mentioned in subsection I-D, one of the reasons we study -ary symmetric channels and prove Theorems 2 and 3 is because less noisy domination implies useful bounds between Dirichlet forms. Recall that the -ary symmetric channel with has uniform stationary distribution (see part 3 of Proposition 4). For any channel that is doubly stochastic and has uniform stationary distribution, we may define a corresponding Dirichlet form:

 ∀f∈Rq,EV(f,f)=1qfT(Iq−V)f (16)

where are column vectors, and denotes the identity matrix (as shown in [25] or [26]). Our final theorem portrays that implies that the Dirichlet form corresponding to dominates the Dirichlet form corresponding to pointwise. The Dirichlet form corresponding to is in fact a scaled version of the so called standard Dirichlet form:

 ∀f∈Rq,Estd(f,f)≜VARu(f)=1qq∑k=1f2k−(1qq∑k=1fk)2 (17)

which is the Dirichlet form corresponding to the -ary symmetric channel with all uniform conditional pmfs. Indeed, using , we have:

 (18)

The standard Dirichlet form is the usual choice for Dirichlet form comparison because its logarithmic Sobolev constant has been precisely computed in [25, Appendix, Theorem A.1]. So, we present Theorem 4 using rather than .

###### Theorem 4 (Domination of Dirichlet Forms).

Given the doubly stochastic channels with and , if , then:

 ∀f∈Rq,EV(f,f)≥qδq−1Estd(f,f).

An extension of Theorem 4 is proved in section VII. The domination of Dirichlet forms shown in Theorem 4 has several useful consequences. A major consequence is that we can immediately establish Poincaré (spectral gap) inequalities and logarithmic Sobolev inequalities (LSIs) for the channel using the corresponding inequalities for -ary symmetric channels. For example, the LSI for with is:

 D(f2u||u)≤(q−1)log(q−1)(q−2)δEWδ(f,f) (19)

for all such that , where we use (54) and the logarithmic Sobolev constant computed in part 1 of Proposition 12. As shown in Appendix B, (19) is easily established using the known logarithmic Sobolev constant corresponding to the standard Dirichlet form. Using the LSI for that follows from (19) and Theorem 4, we immediately obtain guarantees on the convergence rate and hypercontractivity properties of the associated Markov semigroup . We refer readers to [25] and [26] for comprehensive accounts of such topics.

### Ii-E Outline

We briefly outline the content of the ensuing sections. In section III, we study the structure of less noisy domination and degradation regions of channels. In section IV, we prove Theorem 1 and present some other equivalent characterizations of . We then derive several necessary and sufficient conditions for less noisy domination among additive noise channels in section V, which together with the results of section III, culminates in a proof of Theorem 3. Section VI provides a proof of Theorem 2, and section VII introduces LSIs and proves an extension of Theorem 4. Finally, we conclude our discussion in section VIII.

## Iii Less noisy domination and degradation regions

In this section, we focus on understanding the “geometric” aspects of less noisy domination and degradation by channels. We begin by deriving some simple characteristics of the sets of channels that are dominated by some fixed channel in the less noisy and degraded senses. We then specialize our results for additive noise channels, and this culminates in a complete characterization of and derivations of certain properties of presented in Theorem 3.

Let be a fixed channel with , and define its less noisy domination region:

 LW≜{V∈Rq×rsto:W⪰\tiny lnV} (20)

as the set of all channels on the same input and output alphabets that are dominated by in the less noisy sense. Moreover, we define the degradation region of :

 DW≜{V∈Rq×rsto:W⪰\tiny degV}