Strong Data Processing Inequalities
and Sobolev Inequalities for Discrete Channels
Abstract
The noisiness of a channel can be measured by comparing suitable functionals of the input and output distributions. For instance, the worstcase ratio of output relative entropy to input relative entropy for all possible pairs of input distributions is bounded from above by unity, by the data processing theorem. However, for a fixed reference input distribution, this quantity may be strictly smaller than one, giving socalled strong data processing inequalities (SDPIs). The same considerations apply to an arbitrary divergence. This paper presents a systematic study of optimal constants in SDPIs for discrete channels, including their variational characterizations, upper and lower bounds, structural results for channels on product probability spaces, and the relationship between SDPIs and socalled Sobolev inequalities (another class of inequalities that can be used to quantify the noisiness of a channel by controlling entropylike functionals of the input distribution by suitable measures of inputoutput correlation). Several applications to information theory, discrete probability, and statistical physics are discussed.
Contents
 1 Introduction
 2 Background on entropies and divergences

3 Strong data processing inequalities
 3.1 A universal upper bound via Markov contraction
 3.2 Bounds via maximal correlation
 3.3 Upper bounds for operator convex
 3.4 Upper bounds via subgaussian concentration and informationtransportation inequalities
 3.5 Tensorization
 3.6 Mixtures of local channels
 3.7 Comparison of SDPI constants
 3.8 Extremal functions
 4 Connections with Sobolev inequalities
 5 Some applications
 6 Summary of contributions and concluding remarks
 A Miscellaneous lemmas
 B Proof of Proposition 4.1
1 Introduction
The wellknown data processing inequality for the relative entropy states that, for any two probability distributions over an alphabet and for any stochastic transformation (channel) with input alphabet and output alphabet ,
where denotes the distribution at the output of when the input has distribution (and similarly for ). However, if we fix the reference distribution and vary only , then in many cases it is possible to show that is strictly smaller than unless . To capture this effect, we define the quantity
and we say that the channel satisfies a strong data processing inequality (SDPI) at input distribution if . In a remarkable paper [1], Ahlswede and Gács have uncovered deep relationships between and several other quantities, such as the maximal correlation (see [2] and references therein) and socalled hypercontractivity constants of a certain Markov operator associated to the pair . For example, they have shown that if , , and , then , which is also equal to the squared maximal correlation in the joint distribution with and , the socalled doubly symmetric binary source (DSBS) with parameter [3].
After the pioneering work of Ahlswede and Gács, the contraction properties of relative entropy (and other divergences [4, 5]) under the action of stochastic transformations have been studied by several other authors [6, 7, 8, 9, 10]. In particular, Cohen et al. [6], who were the first ones to take up this subject after [1], showed that the SDPI constant of any channel with respect to any divergence is always upperbounded by the socalled Dobrushin contraction coefficient of [11, 12], another wellknown numerical measure of the amount of noise introduced by a channel. (This result of Cohen et al. was rediscovered five years later in the machine learning community [13].) In the last couple of years, strong data processing inequalities became the subject of intense interest in the information theory community [14, 15, 16, 17, 18, 19, 20, 21, 22] due to their apparent usefulness for establishing various converse results.
In this paper, we revisit the problem of characterizing the strong data processing constant [and its generalizations for arbitrary divergence] and establish a number of new upper and lower bounds, as well as new structural results on SDPI constants in product probability spaces. We also address the relationship between strong data processing inequalities and socalled Sobolev inequalities [23]. These inequalities also quantify the noisiness of a Markov operator (probability transition kernel) by relating certain “entropylike” functionals of the input to the rate of increase of suitable “energylike” quantities from the input to the output. (Logarithmic Sobolev inequalities, widely studied in the theory of probability and Markov chains [24, 25, 8, 26, 27], are a special case.) In particular, we show that the optimal constants in Sobolev inequalities for a reversible Markov chain can be related to SDPI constants of certain factorizations of the transition kernel of the chain as a product of a forward channel and a backward channel. Such factorizations correspond to all possible realizations of the onestep transition of the chain as a twocomponent Gibbs sampler [28], which is a standard technique in Markov chain Monte Carlo [29, 30]. Conversely, for a fixed input distribution on , the SDPI constants of a given channel with input in and output in are related to Sobolev constants of the reversible Markov chain on obtained by composing the forward channel with the backward channel determined via Bayes’ rule. To keep things simple, we focus on the discrete case, when both and are finite, although some of our results generalize easily to the case of arbitrary Polish alphabets (see, e.g., [21]).
The remainder of the paper is organized as follows. After giving some necessary background on entropies and divergences in Section 2, we proceed to the study of strong data processing inequalities in Section 3. Next, in Section 4, we define the Sobolev inequalities and characterize their relation with SDPIs. Several examples of applications are given in Section 5. Section 6 provides a summary of key contributions. A number of auxiliary technical results are stated and proved in the Appendices.
1.1 Notation
We will denote by the set of all probability distributions on an alphabet and by the subset of consisting of all strictly positive distributions. The set of all realvalued functions on is denoted by ; and are the subsets of consisting of all strictly positive and nonnegative functions, respectively. Any channel^{1}^{1}1We will also use the terms “stochastic transformation” or “Markov kernel.” with input alphabet , output alphabet , and transition probabilities acts on probability distributions from the right by
or on functions from the left by
The set of all such channels will be denoted by . The affine map naturally extends to a linear map on the signed measures on , since any such measure can be uniquely represented as for some constants and some ; thus, we set . The linear map is positive [i.e., ], and unital [i.e., , where denotes the constant function that takes the value everywhere on its domain]. If denotes the distribution of a random pair with and , then for any and .
We will say that a pair is admissible if and . For any such pair, there exists a unique channel with the property that
(1.1) 
for all . This backward or adjoint channel can be specified explicitly via the transition probabilities
(1.2) 
(this is simply an application of Bayes’ rule). If , then , so in particular for any and . Strictly speaking, depends on both and , and we may occasionally indicate this fact by writing instead of .
Given a number , we will often write for . For , we let . Thus, if and are independent random variables, then has distribution . For , we let and . Other notation and definitions will be introduced in the sequel as needed.
2 Background on entropies and divergences
Let denote the set of all convex functions . For any , the entropy of a nonnegative realvalued random variable is defined by
(2.1) 
provided (see [23] and [31, Chap. 14]). For example, if , then ; if , then
The entropy is nonnegative by Jensen’s inequality.
The divergences^{2}^{2}2We use the term “divergence” instead of the more common “divergence” because we reserve for realvalued functions on . between probability distributions [4, 5] arise as a special case of the above definition. Fix some (this restriction is sufficient for our purposes, and helps avoid certain technicalities involving division by zero). Then, for any , the divergence between an arbitrary probability distribution and is defined as
Note that this differs from the usual definition by the subtraction of . There are two reasons behind this modification: (a) for any ,^{3}^{3}3However, unless is strictly convex at , does not necessarily imply that . and (b) any two such that is affine determine the same divergence. If we now consider a random variable with distribution and let , then
Moreover, if , we can write since . Here are some important examples of divergences [5]:

The relative entropy
is a divergence with .

The total variation distance
is a divergence with .

The divergence
is a divergence with or . This is a particular instance of the fact that any two that differ by an affine function determine the same divergence.

The squared Hellinger distance
is a divergence with or .
An important class of divergences arises in the context of Bayesian estimation. Given a parameter , consider a random pair with
Fix an action space and a loss function — in other words, if and an action is selected, then we incur the loss of . Consider the problem of selecting an action in based on some observation related to via the Markov chain — i.e., and are conditionally independent given . If for some function , then we incur the average loss
The goal is to pick to minimize this expected loss for a given observation channel . In the extreme case when is independent of , the best we can do is to take
giving us the average loss of
On the other hand, if , then we can attain the minimum Bayes risk
where the infimum is over all measurable functions . The following result is wellknown (see, e.g., [32, p. 882]), but the proof is so simple that we give it here:
Proposition 2.1.
The quantity
is a divergence.
Proof.
Define the function
Being a pointwise supremum of affine functions of , it is convex. Moreover, . With this, we can write
∎
We consider two particular cases:

, . An easy calculation shows that and
Alternatively, we can write
and
where the total variation norm of a signed measure on is given by
The optimal decision function is
The resulting divergence is known as the Bayes or statistical information [33]
In fact, any divergence can be expressed as an integral of statistical informations [5, Thm. 11]: for any , there exists a unique Borel measure on , such that
(2.2) 
, . Then and
which gives
with the optimum decision function
The corresponding divergence is then given by
where the second expression follows after some algebraic manipulations. Note that the functions for also belong to . The divergences generated by these functions (modulo multiplicative constants) have appeared throughout the statistical literature [34, 35]. In particular, Le Cam [34] considers the case with the above Bayesian hypothesis testing interpretation, while Györfi and Vajda [35] look at arbitrary (including the endpoints and ). For our purposes, it will be convenient to work with the function , which gives the Le Cam divergence with parameter :
(2.3) The Le Cam divergences and are also welldefined and are identically zero.
More examples of divergences, as well as a wide variety of inequalities between them, can be found in [36].
From now on, when dealing with quantities indexed by , we will often substitute with some mnemonic notation related to the corresponding divergence, e.g., , , etc. Moreover, for the case of the relative entropy we will often omit the index altogether and write , , etc.
2.1 Subadditivity of entropies
Let and be jointly distributed random variables, where takes nonnegative real vaues and is arbitrary. Given a function , define the conditional entropy of given :
(2.4) 
This is a random variable, since it depends on . Combining (2.4) with (2.1) gives the following generalization of the law of total variance:
(2.5) 
(see [23, pp. 351–352]).
Remark 2.1.
We may think of
as a kind of “Fisher information” about contained in .^{4}^{4}4We are grateful to P. Tetali for suggesting this interpretation. Indeed, let us consider the following special case: let be an exchangeable pair on some space (i.e., for all ), and let for some . Let be the stochastic transformation . Then has the same distribution as , and
By convexity of ,
If we write , where is the identity operator on , then
Moreover, if we have a continuoustime reversible Markov chain on with stationary distribution and with infinitesimal generator , then is an exchangeable pair for each , and
Dividing both sides by and taking the limit as , we get
which coincides with the Fisher information functional of Chafaï [23, Eq. (1.14)].
We say that the entropy is subadditive if the inequality
(2.6) 
holds for any tuple of independent random variables taking values in some spaces and for any function , such that . Here, denotes the tuple obtained by deleting from . We are interested in the following question: what conditions on ensure that this subadditivity property holds?
For example, if , then , and in this case the subadditivity property (2.6) is the wellknown Efron–Stein–Steele inequality [37, 38]
It is also not hard to show that the “ordinary” entropy [i.e., the entropy with ] is subadditive. In general, an induction argument can be used to show that subadditivity is equivalent to the following convexity property [39]: for any two probability spaces and and any function ,
(2.7) 
where . The following criterion for subadditivity is useful [39, 40]:
Proposition 2.2.
Let be the class of all convex functions that are twice differentiable on , and such that either is affine or and is concave. Then the entropy is subadditive for all . Conversely, if is twice differentiable with and the entropy is subadditive, then is concave.
3 Strong data processing inequalities
We now turn to the main subject of the paper: strong data processing inequalities.
Definition 3.1.
Given an admissible pair and a function , we say that satisfies a type strong data processing inequality (SDPI) at with constant , or for short, if
(3.1) 
for all . We say that satisfies if it satisfies for all .
We are interested in the tightest constants in SDPIs; with that in mind, we define
For future reference, we record the following straightforward results:
Proposition 3.1 (Functional form of SDPI).
Fix an admissible pair and let be a random pair with probability law . Then if and only if the inequality
(3.2) 
holds for all nonconstant with . Consequently,
(3.3)  
(3.4) 
Proof.
Fix a probability distribution and let . Then , , and
by Lemma A.1 in the Appendix. Therefore,
Conversely, for any nonconstant with there exists a probability distribution such that and . In that case, the above formulas for the entropies hold as well.
Definition 3.2.
We say that the entropy is homogeneous if there exists some function , such that the equality
(3.5) 
holds for any nonnegative random variable such that and for any positive real number .
Proposition 3.2.
Suppose that (3.5) holds. Then
(3.6) 
Moreover, if is an invertible function, then
(3.7) 
Again, is a random pair with law .
Proof.
Proposition 3.3 (Convexity in the kernel).
For a given choice of , , and , the SDPI constants and are convex in .
Proof.
For fixed , the functional is convex because of the joint convexity of [41, Lemma 4.1].^{5}^{5}5Joint convexity of follows from the fact that, for any convex function , the perspective function is jointly convex in [42, Prop. 2.2.1]. Now,
are pointwise suprema of convex functionals of , and therefore are convex in . ∎
3.1 A universal upper bound via Markov contraction
A universal upper bound on was originally obtained by Cohen et al. [6] in the discrete case and subsequently extended by Del Moral et al. [10] to the general case. We state this bound and give a proof which is more informationtheoretic in nature:
Theorem 3.1.
Proof.
By the integral representation (2.2), it suffices to show that (3.9) holds for the statistical informations , . For that, we need the following strong Markov contraction lemma [6, Lemma 3.2]: for any signed measure on and any Markov kernel ,
(3.10) 
Let . Then and . Thus, using (3.10), we get
Therefore,
This establishes the bound (3.9). It remains to show that this bound is achieved for .
To that end, let us first assume that . Let achieve the maximum in (3.8), pick some such that , , , and consider the following probability distributions:

that puts the mass on , on , and distributes the remaining mass of evenly among the set ;

that puts the mass on , on , and distributes the remaining mass of evenly among the set .
Then a simple calculation gives
For , the idea is the same, except that there is no need for the extra slack . ∎