Entropy Jumps for Radially Symmetric Random Vectors

# Entropy Jumps for Radially Symmetric Random Vectors

Department of Electrical Engineering and Computer Sciences
University of California, Berkeley
###### Abstract

We establish a quantitative bound on the entropy jump associated to the sum of independent, identically distributed (IID) radially symmetric random vectors having dimension greater than one. Following the usual approach, we first consider the analogous problem of Fisher information dissipation, and then integrate along the Ornstein-Uhlenbeck semigroup to obtain an entropic inequality. In a departure from previous work, we appeal to a result by Desvillettes and Villani on entropy production associated to the Landau equation. This obviates strong regularity assumptions, such as presence of a spectral gap and log-concavity of densities, but comes at the expense of radial symmetry. As an application, we give a quantitative estimate of the deficit in the Gaussian logarithmic Sobolev inequality for radially symmetric functions.

## 1 Introduction

Let be a random vector on with density . The entropy associated to is defined by

 h(X)=−∫Rdflogf, (1)

provided the integral exists. The non-Gaussianness of , denoted by , is given by

 D(X)=h(GX)−h(X), (2)

where denotes a Gaussian random vector with the same covariance as . Evidently, is the relative entropy of with respect to , and is therefore nonnegative. Moreover, if and only if is Gaussian.

Our main result may be informally stated as follows: Let be IID radially symmetric random vectors on , , with sufficiently regular density . For any

 h(1√2(X+X∗))−h(X) ≥Cε(X)D(X)1+ε, (3)

where is an explicit function depending on , the regularity of , and a finite number of moments of . In particular, if for some , then . A precise statement can be found in Section 3, along with an analogous result for Fisher information and a related estimate that imposes no regularity conditions. In interpreting the result, it is important to note that, although a radially symmetric density has a one-dimensional parameterization, the convolution is inherently a -dimensional operation unless is Gaussian. Thus, it does not appear that (3) can be easily reduced to a one-dimensional problem.

The quantity characterizes the entropy production (or, entropy jump) associated to under rescaled convolution. Similarly, letting denote the Fisher information of , the quantity characterizes the dissipation of Fisher information. By the convolution inequalities of Shannon [1], Blachman [2] and Stam [3] for entropy power and Fisher information, it follows that both the production of entropy and the dissipation of Fisher information under rescaled convolution are nonnegative. Moreover, both quantities are identically zero if and only if is Gaussian.

The fundamental problem of bounding entropy production (and dissipation of Fisher information) has received considerable attention, yet quantitative bounds are few. In particular, the entropy power inequality establishes that entropy production is strictly greater than zero unless is Gaussian, but gives no indication of how entropy production behaves as, say, a function of when is non-Gaussian. As a consequence, basic stability properties of the entropy power inequality remain elusive, despite it being a central inequality in information theory. For instance, a satisfactory answer to the following question is still out of reach: If the entropy power inequality is nearly saturated, are the random summands involved quantifiably close to Gaussian?

Perhaps the first major result to address the question of entropy production in this setting is due to Carlen and Soffer [4], who showed for each random vector with and , there exists a nonnegative function on , strictly increasing from 0, and depending only on the two auxiliary functions

 ψ(R)=E[|X|21{|x|≥R}], (4)

and

 χ(t)=h(Xt)−h(X),    Xt:=e−tX+(1−e−2t)1/2GX, (5)

such that

 h(1√2(X+X∗))−h(X)≥Θψ,χ(D(X)). (6)

Moreover, the function will do for any density with and . Hence, this provides a nonlinear estimate of entropy production in terms of that holds uniformly over all probability densities that exhibit the same decay and smoothness properties (appropriately defined). Unfortunately, the proof establishing existence of relies on a compactness argument, and therefore falls short of giving satisfactory quantitative bounds.

A random vector with density has spectral gap (equivalently, finite Poincaré constant ) if, for all smooth functions with ,

 c∫Rdfg2≤∫Rdf|∇g|2. (7)

Generally speaking, a non-zero spectral gap is a very strong regularity condition on the density (e.g., it implies has finite moments of all orders).

In dimension , if has spectral gap , then [5, 6] established the linear bound111Technically speaking, Barron and Johnson establish a slightly different inequality. However, a modification of their argument gives the same result as Ball, Barthe and Naor. See the discussion surrounding [7, Theorem 2.4].

 h(1√2(X+X∗))−h(X)≥c2+2cD(X). (8)

In dimension , Ball and Nguyen [8] recently established an essentially identical result under the additional assumption that is isotropic (i.e., ) with log-concave density . Along these lines, Toscani has a strengthened EPI for log-concave densities [9], but the deficit is qualitative in nature in contrast to the quantitative estimate obtained by Ball and Nguyen.

Clearly, entropy production and Fisher information dissipation is closely related to convergence rates in the entropic and Fisher information central limit theorems. Generally speaking though, bounds in the spirit of (8) are unnecessarily strong for establishing entropic central limit theorems of the form , where denotes the normalized sum of IID copies of . Indeed, it was long conjectured that under moment conditions. This was positively resolved by Bobkov, Chistyakov and Götze [10, 11] using Edgeworth expansions and local limit theorems. By Pinsker’s inequality, we know that dominates squared total-variation distance, so is interpreted as a version of the Berry-Esseen theorem for the entropic CLT with the optimal convergence rate. However, while the results of Bobkov et al. give good long-range estimates of the form (with explicit constants depending on the cumulants of ), the smaller-order terms propagate from local limit theorems for Edgeworth expansions and are non-explicit. Thus, explicit bounds for the initial entropy jump cannot be readily obtained.

Along these lines, we remark that Ledoux, Nourdin and Peccati [12] recently established the weaker convergence rate via the explicit bound

 D(Sn)≤S2(X|G)2nlog(1+nI(X|G)S2(X|G)), (9)

where is the relative Fisher information of with respect to the standard normal , and denotes the Stein discrepancy of with respect to , which is defined when a Stein kernel exists (see [12] for definitions). In principle, this has potential to give an explicit bound on the entropy production in terms of by considering . Unfortunately, a Stein kernel may not always exist; even if it does, further relationships between and or would need to be developed to ultimately yield a bound like (3).

Another related line of work in statistical physics considers quantitative bounds on entropy production in the Boltzmann equation (see the review [13] for an overview). The two problems are not the same, but there is a strong analogy between entropy production in the Boltzmann equation and entropy jumps associated to rescaled convolution as can be seen by comparing [14] to [4]. The details of this rich subject are tangential to the present discussion, but we remark that a major milestone in this area was achieved when the entropy production in the Boltzmann equation was bounded from below by an explicit function of (and various norms of ), where models the velocity of a particle in a rarified gas[15, 16]. A key ingredient used to prove this bound was an earlier result by Desvillettes and Villani that controls relative Fisher information via entropy production in the Landau equation:

###### Lemma 1.

[17] Let be a random vector on , satisfying and having density . Then,

 12∬|x−x∗|2f(x)f(x∗)∣∣∣Π(x−x∗)[∇ff(x)−∇ff(x∗)]∣∣∣2dxdx∗≥λ(d−1)I(X|G), (10)

where is the minimum eigenvalue of the covariance matrix associated to , and is the orthogonal projection onto the subspace orthogonal to .

Our proof of (3) follows a program similar to [15, 16], and is conceptually straightforward after the correct ingredients are assembled. In particular, we begin by recognizing that the LHS of (10) resembles dissipation of Fisher information when written in the context of projections (cf. [6, Lemma 3.1]). Using the radial symmetry assumption, we are able to bound the Fisher information dissipation from below by error terms plus entropy production in the Landau equation, which is subsequently bounded by relative Fisher information using Lemma 1. Care must be exercised in order to control error terms (this is where our regularity assumptions enter), but the final result (3) closely parallels that proved in [15] for the Boltzmann equation. We remark that the assumption of a non-vanishing Boltzmann collision kernel in [15] has a symmetrizing effect on the particle density functions involved; the rough analog in the present paper is the radial symmetry assumption.

### Organization

The rest of this paper is organized as follows. Section 2 briefly introduces notation and definitions that are used throughout. Main results are stated and proved in Section 3, followed by a brief discussion on potential extensions to non-symmetric distributions. Section 4 gives an application of the results to bounding the deficit in the Gaussian logarithmic Sobolev inequality.

## 2 Notation and Definitions

For a vector , we let denote its Euclidean norm. For a random variable on and , we write for the usual -norm of . It will be convenient to use the same notation for , with the understanding that is not a norm in this case.

Throughout, denotes a standard Gaussian random vector on ; the dimension will be clear from context. For a random vector on , we let be a normalized Gaussian vector, so that . For , we denote the coordinates of a random vector on as . Thus, for example, is a zero-mean Gaussian random variable with variance .

For a random vector with smooth density222All densities are with respect to Lebesgue measure. , we define the Fisher information

 J(X)=4∫∣∣∇√f∣∣2=∫f>0|∇f|2f (11)

and the entropy

 h(X)=−∫flogf, (12)

where ‘’ denotes the natural logarithm throughout. For random vectors with respective densities , the relative Fisher information is defined by

 I(X|Q)=4∫g∣∣∇√f/g∣∣2 (13)

and the relative entropy is defined by

 D(X|Q)=∫flogfg. (14)

Evidently, both quantities are nonnegative and

 I(X):=I(X|GX)=J(X)−J(GX) D(X):=D(X|GX)=h(GX)−h(X). (15)

Finally, we recall two basic inequalities that will be taken for granted several times without explicit reference: for real-valued we have , and for random variables , we have Minkowski’s inequality: when .

###### Definition 1.

A random vector with density is radially symmetric if for some function .

We primarily concern ourselves with random vectors that satisfy certain mild regularity conditions. In particular, it is sufficient to control pointwise in terms of .

###### Definition 2.

A random vector on with smooth density is -regular if, for all ,

 |∇logf(x)|≤c(|x|+E|X|). (16)

We remark that the smoothness requirement of in the definition of -regularity is stronger than generally required for our purposes. However, it allows us to avoid further qualifications; for instance, the identities (11) hold for any -regular function. Moreover, since for smooth , we have for any -regular with .

Evidently, -regularity quantifies the smoothness of a density function. The following important example shows that any density can be mollified to make it -regular.

###### Proposition 1.

[18] Let and be independent, where . Then is -regular with .

Observe that, in the notation of the above proposition, if is radially symmetric then so is . Therefore, Proposition 1 provides a convenient means to construct radially symmetric random vectors that are -regular. Indeed, we have the following useful corollaries (proofs are found in the appendix).

###### Proposition 2.

Let be a random vector on , and let for . If is -regular, then is -regular.

###### Proposition 3.

Let be a non-negative random variable with and distribution function . For any and , there exists a -regular radially symmetric random vector on with and satisfying

 FR0(r−√(t+1)ε√1−ε)−e−dt2/8≤FR(r) ≤FR0(r+√(t+1)ε√1−ε)+e−dt2/8, (17)

where is the distribution function of .

## 3 Main Results

In this section, we establish quantitative estimates on entropy production and Fisher information dissipation under rescaled convolution. As can be expected, we begin with an inequality for Fisher information, and then obtain a corresponding entropy jump inequality by integrating along the Ornstein-Uhlenbeck semigroup.

### 3.1 Dissipation of Fisher Information under Rescaled Convolution

###### Theorem 1.

Let be IID radially symmetric random vectors on , , with -regular density . For any

 J(X)−J(1√2(X+X∗)) ≥Kε(X)I(X)1+ε, (18)

where

 Kε(X)=(ε/8)ε(8(1+ε))1+ε⋅∥|X|2∥1+ε1c2ε∥∥|X|2∥∥1+2ε2+1/ε. (19)
###### Remark 1.

We have made no attempt to optimize the constant .

A few comments are in order. First, we note that inequality (18) is invariant to scaling for . Indeed, if is -regular, then a change of variables shows that is -regular. So, using homogeneity of the norms, we find that

 Kε(tX)=t2εKε(X). (20)

Combined with the property that , we have

 Kε(tX)I(tX|GtX)1+ε=t−2Kε(X)I(X|GX)1+ε, (21)

which has the same scaling behavior as the LHS of (18). That is,

 J(tX)−J(1√2(tX+tX∗))=t−2(J(X)−J(1√2(X+X∗))). (22)

Second, inequality (18) does not contain any terms that explicitly depend on dimension. However, it is impossible to say that inequality (18) is dimension-free in the usual sense that both sides scale linearly in dimension when considering product distributions. Indeed, the product of two identical radially symmetric densities is again radially symmetric if and only if the original densities were Gaussian themselves, which corresponds to the degenerate case when the dissipation of Fisher information is identically zero. However, inequality (18) does exhibit dimension-free behavior in the following sense: Suppose for simplicity that is normalized so that . Since is radially symmetric, it can be expressed as the product of independent random variables , where is uniform on the -dimensional sphere and is a nonnegative real-valued random variable satisfying . Now, by the log Sobolev inequality and Talagrand’s inequality, we have

 I(X|G)≥2D(X|G)≥W22((√dRU),G)=dW22(R,1√d|G|) (23)

The equality follows since, for any vectors we have . However, this can be achieved with equality by the coupling . Thus, we have

 Kε(X)I(X|G)1+ε ≥(ε/8)εd1+2εW2ε2(R,1√d|G|)c2ε(8(1+ε))1+ε∥∥|X|2∥∥1+2ε2+1/εI(X|G) (24) =(ε/8)εW2ε2(R,1√d|G|)c2ε(8(1+ε))1+ε∥R∥1+2ε2+1/εI(X|G). (25)

Now, we note that , so we have a bound of the form

 J(X)−J(1√2(X+X∗)) ≥˜Kε(X)I(X|GX), (26)

where the function is effectively dimension-free in that it only depends on the (one-dimensional) quadratic Wasserstein distance between and . For , the law of large numbers implies that a.s. Therefore, behaves similarly to high dimensions. Indeed, by the triangle inequality applied to ,

 ∣∣∣W2(R,1√d|G|)−∥R−1∥2∣∣∣≤∥∥1√d|G|−1∥∥2=O(1√d). (27)

So, we see that (18) depends very weakly on when the marginal distribution of is preserved and dimension varies.

One important question remains: As dimension , do there exist random vectors on with sufficient regularity for which the associated random variable is not necessarily concentrated around ? The answer to this is affirmative in the sense of Proposition 3: we may approximate any distribution function to within arbitrary accuracy, at the (potential) expense of increasing the regularity parameter .

###### Proof of Theorem 1.

As remarked above, inequality (18) is invariant to scaling. Hence, there is no loss of generality in assuming that is normalized according to . Also, since is radially symmetric, is equal to in distribution, therefore we seek to lower bound the quantity

 J(X)−J(1√2(X+X∗))=J(X)−2J(X−X∗). (28)

Toward this end, define , and denote its density by . By the projection property of the score function of sums of independent random variables, the following identity holds (e.g., [7, Lemma 3.4]):

 2(J(X)−2J(X−X∗)) =E|2ρW(W)−(ρ(X)−ρ(X∗))|2, (29)

where is the score function of and is the score function of .

For , let denote the orthogonal projection onto the subspace orthogonal to . Now, we have

 2J(X)−4J(X−X∗) =E|2ρW(W)−(ρ(X)−ρ(X∗))|2 (30) ≥E|2Π(W)ρW(W)−Π(X−X∗)(ρ(X)−ρ(X∗))|2 (31) =E|Π(X−X∗)(ρ(X)−ρ(X∗))|2. (32)

The inequality follows since by definition, and since is an orthogonal projection. The last equality follows since due to the fact that is the tangential gradient of , which is identically zero due to radial symmetry of .

Next, for any , use the inequality

 1≥|x−x∗|2R2−|x−x∗|2R21{|x−x∗|>R} (33)

to conclude that

 2J(X)−4J(X−X∗) ≥1R2E[|X−X∗|2|Π(X−X∗)(ρ(X)−ρ(X∗))|2] −1R2E[|X−X∗|2|Π(X−X∗)(ρ(X)−ρ(X∗))|21{|X−X∗|>R}]. (34)

We bound the second term first. By -regularity and the triangle inequality, we have

 |Π(x−x∗)(ρ(x)−ρ(x∗))|≤|ρ(x)−ρ(x∗)|≤c(|x|+|x∗|)+2cE|X|. (35)

So, noting the inclusion

 {|x−x∗|>R}⊇{|x|≥R/2,|x∗|≤|x|}∪{|x∗|≥R/2,|x|≤|x∗|}, (36)

we have the pointwise inequality

 1{|x−x∗|>R}|x−x∗|2|Π(x−x∗)(ρ(x)−ρ(x∗))|2 (37) ≤1{|x|≥R/2,|x∗|≤|x|}∪{|x∗|≥R/2,|x|≤|x∗|}|x−x∗|2|ρ(x)−ρ(x∗)|2 (38) ≤1{|x|≥R/2}4|x|2(2c|x|+2cE|X|)2+1{|x∗|≥R/2}4|x∗|2(2c|x∗|+2cE|X|)2. (39)

Taking expectations and using the fact that are IID, we have for any conjugate exponents and ,

 E[|X−X∗|2|Π(X−X∗)(ρ(X)−ρ(X∗))|21{|X−X∗|>R}] ≤16c2E[|X|2(|X|+E|X|)21{|X|≥R/2}] (40) ≤32c2E[|X|41{|X|>R/2}]+32c2(E|X|)2E[|X|21{|X|≥R/2}] (41) ≤32c2∥∥|X|2∥∥22p(Pr{|X|≥R/2})1/q+32c2(E|X|)2∥∥|X|2∥∥p(Pr{|X|≥R/2})1/q (42) ≤32c2∥∥|X|2∥∥22p(E|X|β(R/2)β)1/q+32c2(E|X|)2∥∥|X|2∥∥p(E|X|β(R/2)β)1/q (43) =32⋅2β/qc2Rβ/q(∥∥|X|2∥∥22p+∥∥|X|2∥∥1/2∥∥|X|2∥∥p)(E|X|β)1/q (44) ≤64⋅2β/qc2Rβ/q∥∥|X|2∥∥22p∥∥|X|2∥∥β/(2q)β/2. (45)

Since , radial symmetry implies . Therefore, by Lemma 1, we have

 E[|X−X∗|2|Π(X−X∗)(ρ(X)−ρ(X∗))|2]≥2(d−1)I(X|G). (46)

Continuing from above, we have proved that

 J(X)−2J(X−X∗) ≥d−1R2I(X|G)−32⋅2β/qc2R2+β/q∥∥|X|2∥∥22p∥∥|X|2∥∥β/(2q)β/2. (47)

For any , Taking yields the identity

 aR2−bR2+s=11+2/s(2/sb(1+2/s))2/sa1+2/s. (48)

So, putting , and simplifying, we obtain

 J(X)−2J(X−X∗) (49) (50) =(ε/8)εc2ε(8(1+ε))1+ε⎛⎜⎝1∥∥|X|2∥∥2+ε−1⎞⎟⎠1+2ε(dI(X|G))1+ε (51) =(ε/8)ε∥|X|2∥1+ε1c2ε(8(1+ε))1+ε∥∥|X|2∥∥1+2ε2+1/εI(X|G)1+ε, (52)

where we have made use of the crude bound and substituted . ∎

### 3.2 Entropy Production under Rescaled Convolution

As one would expect, we may ‘integrate up’ in Theorem 1 to obtain an entropic version. A precise version of the result stated in Section 1 is given as follows:

###### Theorem 2.

Let be IID radially symmetric random vectors on , , with -regular density . For any

 h(1√2(X+X∗))−h(X) ≥Cε(X)D(X)1+ε, (53)

where

 Cε(X)=(dε1+(d+2)ε)1+2ε24(d/100)ε(28(1+ε)(1+2ε))1+ε⋅∥|X|2∥1c2ε∥∥|X|2∥∥1+2ε2+1/ε. (54)
###### Remark 2.

Although the constant appears to grow favorably with dimension , this dimension-dependent growth can cancel to give a bound that is effectively dimension-free. An illustrative example follows the proof.

###### Proof.

Similar to before, the inequality (53) is scale-invariant. Indeed, all relative entropy terms are invariant to scaling , and we also have due to being -regular if is -regular and homogeneity of the norms. Thus, we may assume without loss of generality that is normalized so that . Next, define , and let denote the Ornstein-Uhlenbeck evolutes of and , respectively. That is, for

 Xt=e−tX+(1−e−2t)1/2G, Wt=e−tW+(1−e−2t)1/2G. (55)

By Proposition 2, is -regular for all . Noting that , an application of Theorem 1 gives

 I(Xt|G)−I(Wt|G) ≥(ε/8)ε∥|Xt|2∥1+ε1(5c)2ε(8(1+ε))1+ε∥∥|Xt|2∥∥1+2ε2+1/εe−4εtI(Xt|G)1+ε (56) ≥(ε/8)ε∥|X|2∥1+ε1(5c)2ε(8(1+ε))1+ε(2(1+(2+d)εdε))1+2ε∥∥|X|2∥∥1+2ε2+1/εe−4εtI(Xt|G)1+ε (57) =(dε2+(2d+4)ε)1+2ε(ε/8)ε∥|X|2∥1+ε1(5c)2ε(8(1+ε))1+ε∥∥|X|2∥∥1+2ε2+1/εe−4εtI(Xt|G)1+ε,

where (57) holds since, for ,

 ∥∥|Xt|2∥∥p=(E|Xt|2⋅p)1/p ≤2(E(e−2t|X|2+(1−e−2t)|G|2)p)1/p (58) =2∥∥e−2t|X|2+(1−e−2t)|G|2∥∥p (59) ≤2(e−2t∥∥|X|2∥∥p+(1−e−2t)∥∥|G|2∥∥p) (60) ≤2(1+pd)∥∥|X|2∥∥p. (61)

The bound (61) uses the fact that is a chi-squared random variable with degrees of freedom, and hence (using ):

 ∥∥|G|2∥∥p=⎛⎝2pΓ(p+d2)Γ(d2)⎞⎠1/p =E|X|2⎛⎜ ⎜⎝Γ(p+d2)Γ(d2)(d2)p⎞⎟ ⎟⎠1/p (62) ≤E|X|2(1+pd) (63) ≤∥∥|X|2∥∥p(1+pd). (64)

Now, the claim will follow by integrating both sides. Indeed, by the classical de Bruijn identity, we have

 ∫∞0(I(Xt|G)−I(Wt|G))dt=D(X|G)−D(W|G)=h(1√2(X+X∗))−h(X). (65)

By Jensen’s inequality,

 ∫∞0e−4εtI(Xt|G)1+εdt ≥1(4ε)ε(∫∞0e−4εtI(Xt|G)dt)1+ε (66) ≥1(4ε)ε(∫∞0I(Xt+2εt|G)dt)1+ε (67) =1(4ε)ε(1+2ε)1+εD(X|G)1+ε, (68)

where we used the bound due to exponential decay of information along the semigroup (e.g., [19]), a change of variables, and the identity . Thus, we have proved

 h(1√2(X+X∗))−h(X) ≥(dε2+(2d+4)ε)1+2ε(ε/8)ε∥|X|2∥1+ε1(5c)2ε(8(1+ε))1+ε∥∥|X|2∥∥1+2ε2+1/ε⋅1(4ε)ε(1+2ε)1+ε<