Uniform Hanson-Wright type concentration inequalities for unbounded entries via the entropy method

# Uniform Hanson-Wright type concentration inequalities for unbounded entries via the entropy method

Yegor Klochkov
Humboldt Universität zu Berlin
klochkoy@hu-berlin.de
Financial support from the German Research Foundation (DFG) via the International Research Training Group 1792 “High Dimensional Nonstationary Time Series” in Humboldt-Universität zu Berlin is gratefully acknowledged.
Nikita Zhivotovskiy
HSE, IITP RAS
nikita.zhivotovskiy@phystech.edu
Parts of this work were done while the author was a postdoctoral fellow at Technion. Nikita Zhivotovskiy was supported by RSF grant No. 18-11-00132.
###### Abstract

This paper is devoted to uniform versions of the Hanson-Wright inequality for a random vector with independent subgaussian components. The core technique of the paper is based on the entropy method combined with truncations of both gradients of functions of interest and of the coordinates of vector itself. Our results recover, in particular, the classic uniform bound of Talagrand (1996) for Rademacher chaoses and the more recent uniform result of Adamczak (2015) which holds under certain rather strong assumptions on the distribution of . We provide several applications of our techniques: we establish a version of the standard Hanson-Wright inequality, which is tighter in some regimes. Extending our results we show a version of the dimension-free matrix Bernstein inequality that holds for random matrices with a subexponential spectral norm. We apply the derived inequality to the problem of covariance estimation with missing observations and prove an almost optimal high probability version of the recent result of Lounici (2014). Finally, we show a uniform Hanson-Wright type inequality in the Ising model under Dobrushin’s condition. A closely related question was posed by Marton (2003).

Keywords: concentration inequalities, modified logarithmic Sobolev inequalities, uniform Hanson-Wright inequalities, Rademacher chaos, matrix Bernstein inequality

## 1 Introduction

The concentration properties of quadratic forms of random variables is a classic topic in probability. The well-known result is due to Hanson and Wright (we refer to the form of this inequality presented in Rudelson and Vershynin (2013)) which claims that if is an real matrix and is a random vector in with independent centered coordinates satisfying (we will recall the definition of below) then for all

 (1.1)

for some absolute and defines the Hilbert-Schmidt norm and is an operator norm of . An important extension of these results is when instead of just one matrix we have a family of matrices and want to understand the behaviour of random quadratic forms simultaneously for all matrices in the family. As a concrete example we consider an order-2 Rademacher chaos: given a family of real symmetric matrices with zero diagonal, that is for all we have for all , one wants to study the following random variable

 Z=supA∈An∑i,j=1Aijεiεj=supA∈Aε⊤Aε,

where is a sequence of independent Rademacher signs, taking values with equal probabilities. In the celebrated paper Talagrand (1996) it was shown, in particular, that there is an absolute constant , such that for any

 P(|Z−EZ|≥t)≤2exp(−cmin(t2(EsupA∈A∥AX∥)2,tsupA∈A∥A∥)). (1.2)

Apart from the new techniques the significance of this result is that previously (see, for example, Ledoux and Talagrand (2013)) similar bounds were one-sided and had a multiplicative constant greater than before . These results are sometimes called deviation inequlities in contrast to the concentration bounds of the form (1.2) that will be studied below. A simplified proof of the upper-tail of (1.2) appeared later in Boucheron et al. (2003). Similar inequalities in the Gaussian case follow from the results in Borell (1984) and Arcones and Gine (1993).

Observe, that when the diagonal elements are zero, for each the corresponding quadratic form is centered, . In a general situation we will be interested in the analysis of

 Z=supA∈A(X⊤AX−EX⊤AX), (1.3)

for a random vector taking its values in . As before, the analysis of both the expectation and the concentration properties of this random variable have appeared recently in many papers. Just to name a few: Kramer et al. (2014) study and deviations of for classes of positive semidefinite matrices with applications to compressive sensing, Dicker and Erdogdu (2017) prove deviation inequalities for and subgaussian vectors under some extra assumptions. Additionally, a recent paper Adamczak et al. (2018b) studies deviation bounds for with Banach space-valued matrices and Gaussian variables, providing upper and lower bounds for the moments. Finally, it was shown in Adamczak (2015) that if satisfies the so-called concentration property with constant , that is for every -Lipschitz function and any it holds and

 P(|φ(X)−Eφ(X)|≥t)≤2exp(−t2/2K2), (1.4)

then the following bound (similar to (1.2)) holds for every

 P(|Z−EZ|≥t)≤2exp(−cmin(t2K2(EsupA∈A∥AX∥)2,tK2supA∈A∥A∥)). (1.5)

This result has an application in the covariance estimation and recovers another recent concentration result of Koltchinskii and Lounici (2017); we will discuss this in what follows. The drawback of (1.5) is that the concentration property is quite restrictive: it works when has standard Gaussian distribution, for some log-concave distributions (see Ledoux (2001)), but at the same time does not hold for general subgaussian entries and even in the simplest case of Rademacher random vector .

In this paper we extend the mentioned results in two directions. On one hand we revisit the result of Boucheron et al. (2003) for bounded variables allowing non-zero diagonal values of the matrices, and on the other we allow unbounded subgaussian variables . First, let us recall the following definition. For denote the -norm of a random variable by

 ∥Y∥ψα=inf{t≥0:Eexp(|Y|αtα)≤2},

which is a proper norm whenever . A random variable with will be refereed to as subexponential and will be refereed to as subgaussian and the corresponding norm is usually named a subgaussian norm. We also use the norm. For we set . One of our main contributions is the following upper-tail bound.

###### Theorem 1.1.

Suppose that components of are independent centered random variables and is a finite family of real symmetric matrices. Denote . Then, for any it holds

 P(Z−EZ≥t)≤exp(−cmin(t2M2(EsupA∈A∥AX∥)2,tM2supA∈A∥A∥)),

where is an absolute constant and is defined by (1.3).

###### Remark 1.1.

In Theorem 1.1 and below we assume that all is symmetric. This was done only for the convenience of presentation and in fact, the analysis may be performed for general square matrixes. The only difference will be that in many places should be replaced by .

In particular, Theorem 1.1 recovers the right-tail of the result of Talagrand (1.2) up to absolute constants, since in this case we obviously have . Furthermore, the result of Theorem 1.1 works without the assumption used in Talagrand (1996) and Boucheron et al. (2003) that diagonals of all matrices in are zero. Moreover, it is also applicable in some situations when the concentration property (1.4) holds: indeed, if is a standard normal vector in then it is well known (see Ledoux and Talagrand (2013)) that and at the same time if the identity matrix then . Therefore, in this case the factor is only of at most logarithmic order when compared to .

In a special case when consists of just one matrix our bound recovers the bound which is similar to the original Hanson-Wright inequality. On the one hand our bound may have an extra logarithmic factor that depends on the dimension . On the other hand the original term is replaced by the better term . We will discuss this phenomenon below. The core of the proof of the Hanson-Wright inequality in Rudelson and Vershynin (2013) is based on the decoupling technique which may be used (at least in a straightforward way) to prove the deviation, but not the concentration inequality for in the case when consists of more than one matrix.

A natural question to ask is whether one may improve Theorem 1.1 and replace by . In what follows we discuss that in the deviation version of Theorem 1.1 this replacement is not possible in some cases. This is quite unexpected in light of the fact that does not appear in the original Hanson-Wright inequality. Therefore, we believe that the form of our result is close to optimal. We also provide the following extension of Theorem 1.1, which may be better in some cases.

###### Proposition 1.2.

Suppose that components of are independent centered random variables. Suppose also, that the variables have symmetric distribution ( has the same distribution as ). Let be a finite family of real symmetric matrices. Denote and and let be a standard Gaussian vector in . Then, for any it holds

 P(Z−EZ≥t)≤exp(−cmin(t2M2K2(EsupA∈A∥AG∥)2,tMKsupA∈A∥A∥)),

where are absolute constants and is defined by (1.3).

###### Remark 1.2.

Proposition 1.2 is closer to the standard Hanson-Wright inequality (1.1). Indeed, in the case when we have . The difference is that and are replaced by and respectively.

We proceed with some notations that will be used below. For a non-negative random variable , define its entropy as

 Ent(Y)=EYlogY−EYlogEY.

Instead of the concentration property (1.4) we also discuss the following property:

###### Assumption 1.

We say that the random vector taking its values in satisfies the logarithmic Sobolev inequality with constant if for any continuously differentiable function it holds

 Ent(f2)≤2K2E∥∇f(X)∥2, (1.6)

whenever both sides of the inequality are not infinite.

To show that logarithmic Sobolev property is closely related to the concentration property we remind (Theorem 5.3 Ledoux (2001)) that Assumption 1 implies the concentration property (1.4) and the proof of this fact is based essentially on taking for which implies

 Ent(exp(λ(φ(X)−Eφ(X))))≤K2λ22Eexp(λ(φ(X)−Eφ(X))).

This is known to imply (1.4) through Herbst argument, see Boucheron et al. (2013). Moreover, the last inequality is equivalent to concentration property. Indeed, from the concentration property we know that and this implies (see van Handel (2016)) that for all

 Ent(exp(λ(φ(X)−Eφ(X))))≲K2λ2Eexp(λ(φ(X)−Eφ(X))).

One of the technical contributions of the paper is that we use a similar scheme to prove Theorem 1.1 and to recover (1.5) under the logarithmic Sobolev Assumption 1. The application of logarithmic Sobolev inequalities requires computation of the gradient of the function of interest, that is in our case the gradient of . It appears that in the analysis we need to control the behaviour of (or its analogs) and, as in Boucheron et al. (2003) and Adamczak (2015), we will use a truncation argument to do so. However, in both cases our proofs will pass through the entropy variational formula of Boucheron et al. (2013), that states that for random variables with it holds

 E(Wexp(λY))≤Eexp(λY)log(Eexp(W))+Ent(exp(λY)). (1.7)

This will allow us to shorten the proofs and avoid some technicalities appearing in previous papers. Finally, to prove Theorem 1.1 we use a second truncation argument: that will be based on Hoffman-Jørgensen inequality (see Ledoux and Talagrand (2013)). We also present two lemmas, which will be used several times in the text. Both results have short proofs and may be of independent interest.

###### Lemma 1.3.

Suppose, that for random variables and any it holds

 Ent(eλZ)≤λ2EWeλZandP(W>L+θt)≤e−t, (1.8)

where are positive constants. Then, the following concentration result holds

 P(Z−EZ>t)≤exp(−cmin{t2L+θ,t√θ}), (1.9)

where is an absolute constant. Moreover, if (1.8) holds as well for , we have

 P(|Z−EZ|>t)≤2exp(−cmin{t2L+θ,t√θ}).

The second technical result is a version of the convex concentration inequality of Talagrand (1996), which does not require the boundedness of components of .

###### Lemma 1.4.

Let be a convex, -Lipschitz function with respect to Euclidian norm in and be a random vector with independent components. Then, it holds for any

 P(|f(X)−Ef(X)|>t)≤exp(−ct2L2∥∥∥maxi|Xi|∥∥∥2ψ2),

where are absolute constants.

We discuss the optimality of this result in what follows. Finally, we sum up the structure of the paper and outline the main contributions:

• Section 2 is devoted to applications and discussions and consists of several parts. At first, we give a simple proof of the uniform bound of Adamczak (2015) under the logarithmic Sobolev assumption. The second paragraph is devoted to improvements in the non-uniform Hanson-Wright inequality (1.1) in the subgaussian regime. Furthermore, we apply our techniques to obtain a uniform concentration result similar to Theorem 1.1 in a particular case of non-independent components. We consider the Ising model under Dobrushin’s condition that caught some attention recently (see Adamczak et al. (2018a) and Götze et al. (2018)). The question we study was raised by Marton (2003) in a closely related scenario. Finally, we show that it is not possible in general to replace with in Theorem 1.1 by providing an appropriate counterexample.

• In Section 3 we present the proof of Theorem 1.1. Between the lines, we prove Lemma 1.8 and Lemma 1.4. Finally, we give a proof of Proposition 1.2.

• In Section 4 we prove a dimension-free matrix Bernstein inequality that holds for random matrices with the subexponential spectral norm. The proof is based on the same truncation approach as in the proof of Theorem 1.1. We demonstrate how our Bernstein inequality can be used in the context of covariance estimation for subgaussian observations, improving the state-of-the-art result of Lounici (2014) for covariance estimation with missing observations.

## 2 Some applications and discussions

We begin with some notations that will be used throughout the paper. For a random vector taking its values in let denote its components. In the case when all the components of are independent let denote the independent copy of the component . Symbol denotes equivalence up to absolute constants and denotes an inequality up to some absolute constant. Throughout the paper are absolute constants which may change from line to line.

#### A uniform Hanson-Wright inequality under the logarithmic Sobolev condition

In this paragraph we recover the result of Adamczak (2015) under the Assumption 1. Consider a random variables defined by (1.3) as a function of , that satisfies logarithmic Sobolev assumption (1.6).

Following Adamczak (2015) we assume without the loss of generality, that is a finite set of matrices, then is Lebesgue-a.e. differentiable and

 ∥∇Z(X)∥≤2supA∥AX∥,

bounded by a Lipschitz function of with good concentration properties.

###### Remark 2.1.

Note, that Assumption 1 applies only for smooth functions, so that a standard smoothing argument should be used (see e.g. Ledoux (2001)). For sake of completeness we recover this argument in Section A. In what follows in this section we assume that none of these potential technical problems appear.

In particular, since satisfies log-Sobolev condition with constant , we have (Theorem 5.3 in Ledoux (2001))

 P(supA∥AX∥≥EsupA∥AX∥+K√tsupA∥A∥)≤e−t.

Taking square and using , we get

 P(supA∥AX∥2≥2(EsupA∥AX∥)2+2K2supA∥A∥2t)≤e−t.

Furthermore, the logarithmic Sobolev condition implies for any

 Ent(eλZ)≤4K2λ2EsupA∥AX∥2eλZ.

Therefore, by Lemma 1.3 it holds for any ,

 P(|Z−EZ|>C(KEsupA∥AX∥√t+K2supA∥A∥t))≤2e−t,

which coincides with (1.5) for -concentrated vectors up to absolute constant factors.

###### Remark 2.2.

This result may be used directly to prove the concentration for , where is the sample covariance defined as and are centered Gaussian vectors with the covariance matrix (see Theorem 4.1 in Adamczak (2015)). We return to the covariance estimation problem in Paragraph 4.

#### Improving Hanson-Wright inequality in the subgaussian regime

Our analysis implies, in particular, an improved version of Hanson-Wright inequality (1.1) in some cases. We consider a centered random vector with independent subgaussian components and set , . In this case (1.1) implies that with probability at least it holds

 X⊤AX−EX⊤AX≲K2(∥A∥HS√t+∥A∥t). (2.1)

At the same time, Theorem 1.1 for a single matrix implies with the same probability

 X⊤AX−EX⊤AX≲ME∥AX∥√t+M2∥A∥t. (2.2)

Observe that when almost surely for each , we have . The following example illustrates the difference between these two bounds.

###### Example 2.1.

Assume, is a sequence of independent Bernoulli random variables with the mean and let . For we easily get

 E∥AX∥≤√EXTA2X≤√δ∥A∥% HS.

On the other hand, for it holds

 ∥X1∥2ψ2 =∥δ1−δ∥2ψ2∼supλ∈Rlog(Eexp(λ(δ1−δ)))λ2 =supλ∈Rlog(δexp(λ(1−δ))+(1−δ)exp(−λδ))λ2=1−2δ4log((1−δ)/δ)∼1|logδ|,

where the last line follows directly from Theorem 1.1 in Schlemm (2016). Therefore, the standard Hanson-Wright inequality implies that with probability at least it holds,

 X⊤AX−EX⊤AX≲1|logδ|(∥A∥HS√t+∥A∥t),

while (2.2) and imply that for and it holds with probability at least

 X⊤AX−EX⊤AX≲min{√δlogn|logδ|,√δ}∥A∥HS√t+min{logn|logδ|,1}∥A∥t. (2.3)

It is easy to verify that , thus the inequality (2.3) is better than Hanson-Wright inequality for this in the subgaussian regime (when the -term is dominated by the -term).

#### Uniform concentration results in the Ising model

Suppose, we have a discrete random vector with the distribution defined by

 π(σ)=1Z′exp(n∑i,j=1Jijσiσj−n∑i=1hiσi),

where is a normalizing factor. This distribution defines the Ising model with parameters and .

For an arbitrary function on denote a difference operator,

 |df|2(σ)=12n∑i=1(f(σ)−f(Tiσ))2π(−σi|σ1,…,σi−1,σi+1,…),

where the operator flips the sign of the th coordinate, and is conditional distribution of the th coordinate, given the rest of the elements. The following recent result provides log-Sobolev inequality for vector under Dobrushin-type conditions.

###### Theorem 2.1 (Proposition 1.1, Götze et al. (2018)).

Suppose, and satisfies and

 ∥J∥1↦1=maxi=1,…,nn∑j=1|Jij|≤1−ρ (2.4)

There is a constant , such that for an arbitrary function on it holds,

 Ent(f2)≤2CE|df|2.
###### Remark 2.3.

Following Götze et al. (2018) the condition (2.4) will be called Dobrushin’s condition.

We may obtain the following uniform concentration result which is a simple outcome of our Lemma 1.3 and Theorem 2.1.

###### Proposition 2.2.

Let be a finite set of symmetric matrices with zero diagonal. It holds in the Ising model under Dobrushin’s condition and that for any

 P(supA∈Aσ⊤Aσ−EsupA∈Aσ⊤Aσ≥t)≤exp(−cmin(t2(EsupA∈A∥Aσ∥+supA∈A∥A∥)2,tsupA∈A∥A∥)), (2.5)

where depends only on .

###### Proof.

Let given all but the -th element, the variables and are independent and are distributed according to . Obviously, we may have all and defined on the same discrete probability space, and thus we will use the notation and for the distribution and the conditional distribution. Then, we have

 E|df|2(σ) =12n∑i=1E(f(σ)−f(Tiσ))2π(−σi|σ1,…,σi−1,σi+1,…) =n∑i=1∑σ∈{−1,1}nπ(σ)∑σ′i∈{−1,1}(f(σ)−f(σ′(i)))2+π(σ′i|σ1,…,σi−1,σi+1,…)

where we switched from to due to the symmetry between and .

Observe, that denoting for short and using the independence of and given , we have , and therefore by the chain rule,

 π(σ)π(σ′i|σ1,…,σi−1,σi+1,…) =π(σ−i)π(σi|σ−i)π(σ′i|σ−i) =π(σ−i)π(σi,σ′i|σ−i)=π(σ′i,σi,σ−i).

Finally, we get

 E|df|2(σ)=n∑i=1∑(σ,σ′i)∈{−1,1}n+1(f(σ)−f(σ′(i)))2+π(σ,σ′i)=n∑i=1E(f(σ)−f(σ′(i)))2+.

Now we want to consider the function

 Z=supA∈Aσ⊤Aσ, (2.6)

where is a given set of symmetric matrices with zero diagonal (the diagonal is not important here, since ). Applying Theorem 2.1 to , we have

 E|df|2(σ) =En∑i=1(f(σ)−f(σ′(i)))2+=EeλZn∑i=1(1−eλ(Z(σ)−Z(σ′(i)))/2)2+ ≤λ24EeλZn∑i=1(Z−Z(σ′(i)))2+,

where for being maximizer of (2.6) we have,

 n∑i=1(Z−Z(σ′(i)))2+ ≤n∑i=1(σ⊤˜Aσ−[σ′(i)]⊤˜Aσ′(i))2+=n∑i=1(2(σi−σ′i)n∑j=1˜Aijσj)2+ ≤16supA∈A∥Aσ∥2.

Note, that concentration for is implied by the same result. Indeed, we have

 n∑i=1(supA∈A,γ∈Sn−1γ⊤Aσ−supA∈A,γ∈Sn−1γ⊤Aσ′(i))2+ ≤n∑i=1(~w⊤σ−~w⊤σ′(i))2+ =n∑i=1(~wi(σi−σ′i))2+≤4supA∥A∥,

where is such that . Thus, the expectation of corresponding difference operator is bounded by , so that due to standard Herbst argument, Theorem 2.1 implies

To sum up, by Theorem 2.1 it holds,

 Ent(eλZ)≤λ2E(4supA∈A∥Aσ∥)eλZ.

It is left to apply Lemma 1.3, which brings us to a uniform Hanson-Wright-type concentration bound for the Ising model

 P(supAσ⊤Aσ−EsupAσ⊤Aσ>C(√tEsupA∥Aσ∥+(√t+t)supA∥A∥))≥1−e−t, (2.7)

where only depends on from Theorem 2.1. The claim follows. ∎

###### Remark 2.4.

In the case when our result implies the upper tail of the recent concentration inequality proved in Adamczak et al. (2018a) (see Theorem 2.2 and Example 2.5). To show this fact (denoting ) we observe that

 E∥Aσ∥≤E∥A¯¯¯σ∥+∥AEσ∥=E∥A¯¯¯σ∥+(n∑i=1(n∑j=1Ai,jEσj)2)12.

Now, it is well known that implies Poincaré’s inequlity and therefore,

 ∥E¯¯¯σ ¯¯¯σ⊤∥=supu∈Sn−1Var(uT¯¯¯σ)≤(c(α,ρ)/2)supu∈Sn−14∥u∥2=2c(α,ρ).

This implies,

 E∥A¯¯¯σ∥2=ETr(A2¯¯¯σ ¯¯¯σ⊤)≤∥A∥2HS∥E¯¯¯σ ¯¯¯σT∥≤2c(ρ,α)∥A∥2HS,

where we used that , which holds for any symmetric and nonnegative . Finally,

 ∥Aσ∥≤C(ρ,α)∥A∥HS+(n∑i=1(n∑j=1Ai,jEσj)2)12.

The right-hand side term appears instead of in Example 2.5 mentioned above.

#### Replacing ∥maxi|Xi|∥ψ2 with maxi∥Xi∥ψ2 in Theorem 1.1

Here we show that it is essentially not possible in general to substitute with in Theorem 1.1 by presenting a concrete counterexample, which was kindly suggested by Radosław Adamczak. Suppose the opposite, that there is an absolute constant such that for any set of matrices and any subgaussian random variables it holds with probability at least ,

 Z≤C(EZ+maxi∥Xi∥ψ2√tEsupA∥AX∥+maxi∥Xi∥2ψ2supA∥A∥t), (2.8)

which implies with some other constant

 E1/2Z2≤C′(EZ+maxi∥Xi∥ψ2EsupA∥AX∥+maxi∥Xi∥2ψ2supA∥A∥).

Notice, that here we also allow a constant in front of the expectation.

Let us take with having only one nonzero element . For simplicity take i.i.d. with , so that

 Z=maxi≤n(X2i−1),supA∥AX∥=maxi≤n|Xi|,supA∥A∥=1.

Then, assuming, say we have

 ∥∥maxiX2i−1∥∥L2≤C′(Emaxi(X2i−1)+4Emaxi|Xi|+16),

which since implies

 ∥∥maxiX2i∥L2≤1+C′(∥maxiX2i∥L1+4Emaxi|Xi|+15)≤(1+20C′)∥maxiX2i∥L1.

Note, that this inequality also holds if we rescale for an arbitrary . Therefore, if we have a moment equivalence , we can always rescale to have and , so that the above inequality holds.

Taking the latter into account, we conclude that there is a constant , such that if a centred random satisfies , then for any the following holds,

 ∥∥maxi≤nX2i∥∥L2≤D∥maxi≤nX2i∥L1. (2.9)

It is known that such hypercontractivity of maxima implies certain regularity of tails of the distribution of . In this case by Theorem 4.6 in Hitczenko et al. (1998) for any there is another constant such that for all it holds,

 AqP(X21>At)≤εP(X21>t),

so that in our case of and and taking , there is such that for all it holds

 P(X21>At)≤1AP(X21>t). (2.10)

The latter does not have to hold for any subgaussian random variable . For instance, taking a symmetric random variable with and for we have which implies . Moreover, for we also have thus and the conditions of (2.9) are satisfied. But for large enough for , we have

 P(X21>At)=P(X21>t)=e−r,

therefore breaking the tail regularity (2.10). Thus, it is impossible to establish inequality of form (2.8). We also note that it is also possible to prove that (2.9) may not hold for defined above via some direct computations.

By the same reason it is not possible to replace