Strong Data Processing Inequalitiesand \Phi-Sobolev Inequalities for Discrete Channels

# Strong Data Processing Inequalities and Φ-Sobolev Inequalities for Discrete Channels

Maxim Raginsky The author is with the Department of Electrical and Computer Engineering and the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801, USA. E-mail: maxim@illinois.edu.This work was supported in part by the NSF under CAREER award no. CCF-1254041 and by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370. The material in this paper was presented in part at the 2013 IEEE International Symposium on Information Theory.
March 30, 2016
###### Abstract

The noisiness of a channel can be measured by comparing suitable functionals of the input and output distributions. For instance, the worst-case ratio of output relative entropy to input relative entropy for all possible pairs of input distributions is bounded from above by unity, by the data processing theorem. However, for a fixed reference input distribution, this quantity may be strictly smaller than one, giving so-called strong data processing inequalities (SDPIs). The same considerations apply to an arbitrary -divergence. This paper presents a systematic study of optimal constants in SDPIs for discrete channels, including their variational characterizations, upper and lower bounds, structural results for channels on product probability spaces, and the relationship between SDPIs and so-called -Sobolev inequalities (another class of inequalities that can be used to quantify the noisiness of a channel by controlling entropy-like functionals of the input distribution by suitable measures of input-output correlation). Several applications to information theory, discrete probability, and statistical physics are discussed.

## 1 Introduction

The well-known data processing inequality for the relative entropy states that, for any two probability distributions over an alphabet and for any stochastic transformation (channel) with input alphabet and output alphabet ,

 D(νK∥μK)≤D(ν∥μ),

where denotes the distribution at the output of when the input has distribution (and similarly for ). However, if we fix the reference distribution and vary only , then in many cases it is possible to show that is strictly smaller than unless . To capture this effect, we define the quantity

 η(μ,K)≜supν≠μD(νK∥μK)D(ν∥μ),

and we say that the channel satisfies a strong data processing inequality (SDPI) at input distribution if . In a remarkable paper [1], Ahlswede and Gács have uncovered deep relationships between and several other quantities, such as the maximal correlation (see [2] and references therein) and so-called hypercontractivity constants of a certain Markov operator associated to the pair . For example, they have shown that if , , and , then , which is also equal to the squared maximal correlation in the joint distribution with and , the so-called doubly symmetric binary source (DSBS) with parameter [3].

After the pioneering work of Ahlswede and Gács, the contraction properties of relative entropy (and other -divergences [4, 5]) under the action of stochastic transformations have been studied by several other authors [6, 7, 8, 9, 10]. In particular, Cohen et al. [6], who were the first ones to take up this subject after [1], showed that the SDPI constant of any channel with respect to any -divergence is always upper-bounded by the so-called Dobrushin contraction coefficient of [11, 12], another well-known numerical measure of the amount of noise introduced by a channel. (This result of Cohen et al. was rediscovered five years later in the machine learning community [13].) In the last couple of years, strong data processing inequalities became the subject of intense interest in the information theory community [14, 15, 16, 17, 18, 19, 20, 21, 22] due to their apparent usefulness for establishing various converse results.

In this paper, we revisit the problem of characterizing the strong data processing constant [and its generalizations for arbitrary -divergence] and establish a number of new upper and lower bounds, as well as new structural results on SDPI constants in product probability spaces. We also address the relationship between strong data processing inequalities and so-called -Sobolev inequalities [23]. These inequalities also quantify the noisiness of a Markov operator (probability transition kernel) by relating certain “entropy-like” functionals of the input to the rate of increase of suitable “energy-like” quantities from the input to the output. (Logarithmic Sobolev inequalities, widely studied in the theory of probability and Markov chains [24, 25, 8, 26, 27], are a special case.) In particular, we show that the optimal constants in -Sobolev inequalities for a reversible Markov chain can be related to SDPI constants of certain factorizations of the transition kernel of the chain as a product of a forward channel and a backward channel. Such factorizations correspond to all possible realizations of the one-step transition of the chain as a two-component Gibbs sampler [28], which is a standard technique in Markov chain Monte Carlo [29, 30]. Conversely, for a fixed input distribution on , the SDPI constants of a given channel with input in and output in are related to -Sobolev constants of the reversible Markov chain on obtained by composing the forward channel with the backward channel determined via Bayes’ rule. To keep things simple, we focus on the discrete case, when both and are finite, although some of our results generalize easily to the case of arbitrary Polish alphabets (see, e.g., [21]).

The remainder of the paper is organized as follows. After giving some necessary background on -entropies and -divergences in Section 2, we proceed to the study of strong data processing inequalities in Section 3. Next, in Section 4, we define the -Sobolev inequalities and characterize their relation with SDPIs. Several examples of applications are given in Section 5. Section 6 provides a summary of key contributions. A number of auxiliary technical results are stated and proved in the Appendices.

### 1.1 Notation

We will denote by the set of all probability distributions on an alphabet and by the subset of consisting of all strictly positive distributions. The set of all real-valued functions on is denoted by ; and are the subsets of consisting of all strictly positive and nonnegative functions, respectively. Any channel111We will also use the terms “stochastic transformation” or “Markov kernel.” with input alphabet , output alphabet , and transition probabilities acts on probability distributions from the right by

 μK(y) =∑x∈Xμ(x)K(y|x),y∈Y

or on functions from the left by

 Kf(x) =∑y∈YK(y|x)f(y),x∈X.

The set of all such channels will be denoted by . The affine map naturally extends to a linear map on the signed measures on , since any such measure can be uniquely represented as for some constants and some ; thus, we set . The linear map is positive [i.e., ], and unital [i.e., , where denotes the constant function that takes the value everywhere on its domain]. If denotes the distribution of a random pair with and , then for any and .

We will say that a pair is admissible if and . For any such pair, there exists a unique channel with the property that

 E[g(Y)Kf(Y)]=E[K∗g(X)f(X)] (1.1)

for all . This backward or adjoint channel can be specified explicitly via the transition probabilities

 K∗(x|y)=K(y|x)μ(x)μK(y),(x,y)∈X×Y (1.2)

(this is simply an application of Bayes’ rule). If , then , so in particular for any and . Strictly speaking, depends on both and , and we may occasionally indicate this fact by writing instead of .

Given a number , we will often write for . For , we let . Thus, if and are independent random variables, then has distribution . For , we let and . Other notation and definitions will be introduced in the sequel as needed.

## 2 Background on Φ-entropies and Φ-divergences

Let denote the set of all convex functions . For any , the -entropy of a nonnegative real-valued random variable is defined by

 EntΦ[U]≜E[Φ(U)]−Φ(EU), (2.1)

provided (see [23] and [31, Chap. 14]). For example, if , then ; if , then

 EntΦ[U]=E[UlogU]−E[U]logE[U].

The -entropy is nonnegative by Jensen’s inequality.

The -divergences222We use the term “-divergence” instead of the more common “-divergence” because we reserve for real-valued functions on . between probability distributions [4, 5] arise as a special case of the above definition. Fix some (this restriction is sufficient for our purposes, and helps avoid certain technicalities involving division by zero). Then, for any , the -divergence between an arbitrary probability distribution and is defined as

 DΦ(ν∥μ)≜Eμ[Φ(\rm dν\rm dμ)]−Φ(1).

Note that this differs from the usual definition by the subtraction of . There are two reasons behind this modification: (a) for any ,333However, unless is strictly convex at , does not necessarily imply that . and (b) any two such that is affine determine the same divergence. If we now consider a random variable with distribution and let , then

 DΦ(ν∥μ)=EntΦ[f(X)].

Moreover, if , we can write since . Here are some important examples of -divergences [5]:

1. The relative entropy

 D(ν∥μ)=Eν[log\rm dν\rm dμ]=Eμ[\rm dν%dμlog\rm dν\rm dμ]

is a -divergence with .

2. The total variation distance

 ∥ν−μ∥TV=12Eμ∣∣∣\rm dν\rm dμ−1∣∣∣

is a -divergence with .

3. The -divergence

 χ2(ν∥μ)=Eμ⎡⎣(\rm d% ν\rm dμ−1)2⎤⎦

is a -divergence with or . This is a particular instance of the fact that any two that differ by an affine function determine the same divergence.

4. The squared Hellinger distance

 H2(ν,μ)=Eμ⎡⎣(√% \rm dν\rm dμ−1)2⎤⎦

is a -divergence with or .

An important class of -divergences arises in the context of Bayesian estimation. Given a parameter , consider a random pair with

 Θ∼Bern(λ)andPX|Θ=θ={μ,if θ=0ν,if θ=1.

Fix an action space and a loss function — in other words, if and an action is selected, then we incur the loss of . Consider the problem of selecting an action in based on some observation related to via the Markov chain — i.e., and are conditionally independent given . If for some function , then we incur the average loss

 E[ℓ(Θ,γ(Z))] =¯λE[ℓ(0,γ(Z))]+λE[ℓ(1,γ(Z))].

The goal is to pick to minimize this expected loss for a given observation channel . In the extreme case when is independent of , the best we can do is to take

 a∗=\operatornamewithlimitsargmina∈A[¯λℓ(0,a)+λℓ(1,a)],

giving us the average loss of

 L∗λ ≜infa∈A[¯λℓ(0,a)+λℓ(1,a)].

On the other hand, if , then we can attain the minimum Bayes risk

 L∗λ(ν,μ) ≜infγE[ℓ(Θ,γ(X))] =infγ{¯λ∫Xℓ(0,γ(x))ν(\rm dx)+λ∫Xℓ(1,γ(x))μ(% \rm dx)},

where the infimum is over all measurable functions . The following result is well-known (see, e.g., [32, p. 882]), but the proof is so simple that we give it here:

###### Proposition 2.1.

The quantity

 Dℓ,λ(ν∥μ)≜L∗λ−L∗λ(ν,μ)

is a -divergence.

###### Proof.

Define the function

 Φℓ,λ(u)≜supa∈A[L∗λ−¯λℓ(0,a)−λℓ(1,a)u],u≥0.

Being a pointwise supremum of affine functions of , it is convex. Moreover, . With this, we can write

 L∗λ−L∗λ(ν,μ) =supγ(L∗λ−∫Xμ(% \rm dx)[¯λℓ(0,γ(x))+λ\rm dν\rm dμ(x)ℓ(1,γ(x))]) =∫Xμ(\rm dx)supa∈A[L∗λ−¯λℓ(0,a)−λℓ(1,a)\rm d% ν\rm dμ(x)] =Eμ[Φℓ,λ(\rm d% ν\rm dμ)].

We consider two particular cases:

• , . An easy calculation shows that and

 Φℓ,λ(u) =[λ∧¯λ−¯λu]∨[λ∧¯λ−¯λ] =λ∧¯λ−(λu)∧¯λ.

Alternatively, we can write

 L∗λ=12−12∥Bern(λ)−Bern(¯λ)∥TV=12−12|1−2λ|

and

 L∗λ(ν,μ)=12−12∥λν−¯λμ∥TV,

where the total variation norm of a signed measure on is given by

 ∥ν∥TV=12∑x∈X|ν(x)|.

The optimal decision function is

 γ∗(x)=1{λ\rm dν\rm dμ(x)≤¯λ}.

The resulting divergence is known as the Bayes or statistical information [33]

 Bλ(ν∥μ)=12∥λν−¯λμ∥TV−12|1−2λ|.

In fact, any -divergence can be expressed as an integral of statistical informations [5, Thm. 11]: for any , there exists a unique Borel measure on , such that

 DΦ(ν∥μ)=∫[0,1]Bλ(ν∥μ)MΦ(\rm dλ). (2.2)
• , . Then and

 Φℓ,λ(u)=λ¯λ(1−uλu+¯λ),

which gives

 L∗λ(ν,μ)=λ¯λEμ[\rm dν/\rm dμλ\rm dν/\rm dμ+¯λ],

with the optimum decision function

 γ∗(x) =λ\rm dν\rm dμ(x)λ\rm dν\rm dμ(x)+¯λ.

The corresponding divergence is then given by

 Dℓ,λ(ν∥μ) =λ¯λ(1−Eμ[% \rm dν/\rm dμλ\rm dν/\rm dμ+¯λ]) =(λ¯λ)2Eμ[(% \rm dν/\rm dμ−1)2λ\rm dν/\rm dμ+¯λ],

where the second expression follows after some algebraic manipulations. Note that the functions for also belong to . The divergences generated by these functions (modulo multiplicative constants) have appeared throughout the statistical literature [34, 35]. In particular, Le Cam [34] considers the case with the above Bayesian hypothesis testing interpretation, while Györfi and Vajda [35] look at arbitrary (including the endpoints and ). For our purposes, it will be convenient to work with the function , which gives the Le Cam divergence with parameter :

 LCλ(ν∥μ) ≜λ¯λEμ[(\rm dν/\rm dμ−1)2λ\rm dν/\rm d% μ+¯λ]≡1λ¯λDℓ,λ(ν∥μ). (2.3)

The Le Cam divergences and are also well-defined and are identically zero.

More examples of -divergences, as well as a wide variety of inequalities between them, can be found in [36].

From now on, when dealing with quantities indexed by , we will often substitute with some mnemonic notation related to the corresponding -divergence, e.g., , , etc. Moreover, for the case of the relative entropy we will often omit the index altogether and write , , etc.

Let and be jointly distributed random variables, where takes nonnegative real vaues and is arbitrary. Given a function , define the conditional -entropy of given :

 EntΦ[U|Y] ≜E[Φ(U)|Y]−Φ(E[U|Y]). (2.4)

This is a random variable, since it depends on . Combining (2.4) with (2.1) gives the following generalization of the law of total variance:

 EntΦ[U]=E[EntΦ[U|Y]]+EntΦ[E[U|Y]] (2.5)

(see [23, pp. 351–352]).

###### Remark 2.1.

We may think of

 JΦ(U|Y)≜E[EntΦ[U|Y]]

as a kind of “Fisher -information” about contained in .444We are grateful to P. Tetali for suggesting this interpretation. Indeed, let us consider the following special case: let be an exchangeable pair on some space (i.e., for all ), and let for some . Let be the stochastic transformation . Then has the same distribution as , and

 JΦ(U|Y) =EntΦ[U]−EntΦ[E[U|Y]] =EntΦ[f(Y)]−EntΦ[K∗f(Y)].

By convexity of ,

 Φ(u+v)≥Φ(u)+vΦ′(u).

If we write , where is the identity operator on , then

 JΦ(U|Y) =EntΦ[f(Y)]−EntΦ[f(Y)+Lf(Y)] ≤−E[Φ′(f(Y))Lf(Y)].

Moreover, if we have a continuous-time reversible Markov chain on with stationary distribution and with infinitesimal generator , then is an exchangeable pair for each , and

 JΦ(f(Y0)|Yt) =EntΦ[f(Y0)]−EntΦ[K∗tf(Y0)] =−tE[Φ′(f(Y0))Lf(Y0)]+o(t)

Dividing both sides by and taking the limit as , we get

 \rm d\rm dtJΦ(f(Y0)|Yt)∣∣t=0=limt→0JΦ(f(Y0)|Yt)t=−E[Φ′(f(Y0))Lf(Y0)],

which coincides with the -Fisher information functional of Chafaï [23, Eq. (1.14)].

We say that the -entropy is subadditive if the inequality

 EntΦ[f(Xn)]≤n∑i=1E[EntΦ[f(Xn)∣∣X∖i]] (2.6)

holds for any tuple of independent random variables taking values in some spaces and for any function , such that . Here, denotes the -tuple obtained by deleting from . We are interested in the following question: what conditions on ensure that this subadditivity property holds?

For example, if , then , and in this case the subadditivity property (2.6) is the well-known Efron–Stein–Steele inequality [37, 38]

 Var[U] ≤n∑i=1E[Var[U|X∖i]],U=f(Xn).

It is also not hard to show that the “ordinary” entropy [i.e., the -entropy with ] is subadditive. In general, an induction argument can be used to show that subadditivity is equivalent to the following convexity property [39]: for any two probability spaces and and any function ,

 EntΦ[∫X2f(X1,x2)ν2(\rm dx2)] ≤∫X2EntΦ[f(X1,x2)]ν2(\rm dx2), (2.7)

where . The following criterion for subadditivity is useful [39, 40]:

###### Proposition 2.2.

Let be the class of all convex functions that are twice differentiable on , and such that either is affine or and is concave. Then the -entropy is subadditive for all . Conversely, if is twice differentiable with and the -entropy is subadditive, then is concave.

## 3 Strong data processing inequalities

We now turn to the main subject of the paper: strong data processing inequalities.

###### Definition 3.1.

Given an admissible pair and a function , we say that satisfies a -type strong data processing inequality (SDPI) at with constant , or for short, if

 DΦ(νK∥μK)≤cDΦ(ν∥μ) (3.1)

for all . We say that satisfies if it satisfies for all .

We are interested in the tightest constants in SDPIs; with that in mind, we define

 ηΦ(μ,K) ≜supν≠μDΦ(νK∥μK)DΦ(ν∥μ), ηΦ(K) ≜supμ∈P∗(X)ηΦ(μ,K).

For future reference, we record the following straightforward results:

###### Proposition 3.1 (Functional form of SDPI).

Fix an admissible pair and let be a random pair with probability law . Then if and only if the inequality

 EntΦ[f(X)]≤11−cE[EntΦ[f(X)|Y]] (3.2)

holds for all nonconstant with . Consequently,

 ηΦ(μ,K) =sup{EntΦ[K∗f(Y)]EntΦ[f(X)]:f∈F0∗(X),f≠const,E[f(X)]=1} (3.3) =1−inf{E[EntΦ[f(X)|Y]]EntΦ[f(X)]:f∈F0∗(X),f≠const,E[f(X)]=1}. (3.4)
###### Proof.

Fix a probability distribution and let . Then , , and

 \rm d(νK)\rm d(μK)=K∗f

by Lemma A.1 in the Appendix. Therefore,

 DΦ(ν∥μ)=EntΦ[% \rm dν\rm dμ(X)]andDΦ(νK∥μK)=EntΦ[\rm d(νK)% \rm d(μK)(Y)].

Conversely, for any nonconstant with there exists a probability distribution such that and . In that case, the above formulas for the -entropies hold as well.

Now, if , then (3.2) holds trivially, so assume . In that case, the result follows from Eq. (3.2) and the law of total -entropy, Eq. (2.5). ∎

###### Definition 3.2.

We say that the -entropy is homogeneous if there exists some function , such that the equality

 EntΦ[cU]=κ(c)EntΦ[U] (3.5)

holds for any nonnegative random variable such that and for any positive real number .

For example, satisfies (3.5) with , while , , satisfies (3.5) with .

###### Proposition 3.2.

Suppose that (3.5) holds. Then

 ηΦ(μ,K) =sup{EntΦ[K∗f(Y)]EntΦ[f(X)]:f∈F0∗(X),f≠const} =1−inf{E[EntΦ[f(X)|Y]]EntΦ[f(X)]:f∈F0∗(X),f≠const}. (3.6)

Moreover, if is an invertible function, then

 (3.7)

Again, is a random pair with law .

###### Proof.

Eq. (3.6) is obvious from homogeneity. To prove (3.7), pick an arbitrary nonconstant and let

 c=κ−1(tEntΦ[f(X)]).

Let . Then . Therefore,

 EntΦ[K∗g(Y)] ≤ηΦ(μ,K,t)EntΦ[g(X)].

Since , and since by the properties of , we conclude that , which implies that . The reverse inequality, , is obvious. ∎

###### Proposition 3.3 (Convexity in the kernel).

For a given choice of , , and , the SDPI constants and are convex in .

###### Proof.

For fixed , the functional is convex because of the joint convexity of [41, Lemma 4.1].555Joint convexity of follows from the fact that, for any convex function , the perspective function is jointly convex in [42, Prop. 2.2.1]. Now,

 ηΦ(μ,K) =supνDΦ(νK∥μK)DΦ(ν∥μ)andηΦ(K)=supμsupνDΦ(νK∥μK)DΦ(ν∥μ)

are pointwise suprema of convex functionals of , and therefore are convex in . ∎

### 3.1 A universal upper bound via Markov contraction

A universal upper bound on was originally obtained by Cohen et al. [6] in the discrete case and subsequently extended by Del Moral et al. [10] to the general case. We state this bound and give a proof which is more information-theoretic in nature:

###### Theorem 3.1.

Define the Dobrushin contraction coefficient [11, 12] of a channel by

 ϑ(K)≜maxx,x′∈X∥K(⋅|x)−K(⋅|x′)∥TV. (3.8)

Then for any we have

 ηΦ(K)≤ϑ(K). (3.9)

Moreover, .

###### Proof.

By the integral representation (2.2), it suffices to show that (3.9) holds for the statistical informations , . For that, we need the following strong Markov contraction lemma [6, Lemma 3.2]: for any signed measure on and any Markov kernel ,

 ∥~νK∥TV≤ϑ(K)∥~ν∥TV+1−ϑ(K)2|~ν(X)|. (3.10)

Let . Then and . Thus, using (3.10), we get

 ∥λνK−¯λμK∥TV≤ϑ(K)∥λν−¯λμ∥TV+1−ϑ(K)2|1−2λ|.

Therefore,

 Bλ(νK∥μK) =12∥λνK−¯λμK∥TV−12|1−2λ| ≤ϑ(K)⋅(12∥λν−¯λμ∥TV−12|1−2λ|) =ϑ(K)⋅Bλ(ν∥μ).

This establishes the bound (3.9). It remains to show that this bound is achieved for .

To that end, let us first assume that . Let achieve the maximum in (3.8), pick some such that , , , and consider the following probability distributions:

• that puts the mass on , on , and distributes the remaining mass of evenly among the set ;

• that puts the mass on , on , and distributes the remaining mass of evenly among the set .

Then a simple calculation gives

 ∥ν−μ∥TV =|ε1−ε2| ∥νK−μK∥TV =|ε1−ε2|⋅∥K(⋅|x0)−K(⋅|x1)∥TV =ϑ(K)⋅∥ν−μ∥TV.

For , the idea is the same, except that there is no need for the extra slack . ∎

###### Remark 3.1.

Theorem 3.1 says that any channel with satisfies an SDPI for any at any reference input distribution . However, the bounds it gives are generally loose. For example, for with , we have , so by Theorem 3.1

for all and all . However, as we know from [1],

 η(Bern