Introduction to the non-asymptotic analysis of random matrices

# Introduction to the non-asymptotic analysis of random matrices

Roman Vershynin111Partially supported by NSF grants FRG DMS 0918623, DMS 1001829
University of Michigan
romanv@umich.edu
Chapter 5 of: Compressed Sensing, Theory and Applications. Edited by Y. Eldar and G. Kutyniok. Cambridge University Press, 2012. pp. 210–268.

August 11, 2010; final revision November 23, 2011

]0 0]

This is a tutorial on some basic non-asymptotic methods and concepts in random matrix theory. The reader will learn several tools for the analysis of the extreme singular values of random matrices with independent rows or columns. Many of these methods sprung off from the development of geometric functional analysis since the 1970’s. They have applications in several fields, most notably in theoretical computer science, statistics and signal processing. A few basic applications are covered in this text, particularly for the problem of estimating covariance matrices in statistics and for validating probabilistic constructions of measurement matrices in compressed sensing. These notes are written particularly for graduate students and beginning researchers in different areas, including functional analysts, probabilists, theoretical statisticians, electrical engineers, and theoretical computer scientists.

## 5.1 Introduction

#### Asymptotic and non-asymptotic regimes

Random matrix theory studies properties of matrices chosen from some distribution on the set of all matrices. As dimensions and grow to infinity, one observes that the spectrum of tends to stabilize. This is manifested in several limit laws, which may be regarded as random matrix versions of the central limit theorem. Among them is Wigner’s semicircle law for the eigenvalues of symmetric Gaussian matrices, the circular law for Gaussian matrices, the Marchenko-Pastur law for Wishart matrices where is a Gaussian matrix, the Bai-Yin and Tracy-Widom laws for the extreme eigenvalues of Wishart matrices . The books [51, 5, 23, 6] offer thorough introduction to the classical problems of random matrix theory and its fascinating connections.

The asymptotic regime where the dimensions is well suited for the purposes of statistical physics, e.g. when random matrices serve as finite-dimensional models of infinite-dimensional operators. But in some other areas including statistics, geometric functional analysis, and compressed sensing, the limiting regime may not be very useful [69]. Suppose, for example, that we ask about the largest singular value (i.e. the largest eigenvalue of ); to be specific assume that is an matrix whose entries are independent standard normal random variables. The asymptotic random matrix theory answers this question as follows: the Bai-Yin law (see Theorem 5.31) states that

 smax(A)/2√n→1almost surely

as the dimension . Moreover, the limiting distribution of is known to be the Tracy-Widom law (see [71, 27]). In contrast to this, a non-asymptotic answer to the same question is the following: in every dimension , one has

 smax(A)≤C√nwith probability at least 1−e−n,

here is an absolute constant (see Theorems 5.32 and 5.39). The latter answer is less precise (because of an absolute constant ) but more quantitative because for fixed dimensions it gives an exponential probability of success.222For this specific model (Gaussian matrices),Theorems 5.32 and 5.35 even give a sharp absolute constant here. But the result mentioned here is much more general as we will see later; it only requires independence of rows or columns of . This is the kind of answer we will seek in this text – guarantees up to absolute constants in all dimensions, and with large probability.

#### Tall matrices are approximate isometries

The following heuristic will be our guideline: tall random matrices should act as approximate isometries. So, an random matrix with should act almost like an isometric embedding of into :

 (1−δ)K∥x∥2≤∥Ax∥2≤(1+δ)K∥x∥2for all x∈Rn

where is an appropriate normalization factor and . Equivalently, this says that all the singular values of are close to each other:

 (1−δ)K≤smin(A)≤smax(A)≤(1+δ)K,

where and denote the smallest and the largest singular values of . Yet equivalently, this means that tall matrices are well conditioned: the condition number of is .

In the asymptotic regime and for random matrices with independent entries, our heuristic is justified by Bai-Yin’s law, which is Theorem 5.31 below. Loosely speaking, it states that as the dimensions increase to infinity while the aspect ratio is fixed, we have

 √N−√n≈smin(A)≤smax(A)≈√N+√n. (5.1)

In these notes, we study random matrices with independent rows or independent columns, but not necessarily independent entries. We develop non-asymptotic versions of (5.1) for such matrices, which should hold for all dimensions and . The desired results should have the form

 √N−C√n≤smin(A)≤smax(A)≤√N+C√n (5.2)

with large probability, e.g. , where is an absolute constant.333More accurately, we should expect to depend on easily computable quantities of the distribution, such as its moments. This will be clear from the context. For tall matrices, where , both sides of this inequality would be close to each other, which would guarantee that is an approximate isometry.

#### Models and methods

We shall study quite general models of random matrices – those with independent rows or independent columns that are sampled from high-dimensional distributions. We will place either strong moment assumptions on the distribution (sub-gaussian growth of moments), or no moment assumptions at all (except finite variance). This leads us to four types of main results:

1. Matrices with independent sub-gaussian rows: Theorem 5.39

2. Matrices with independent heavy-tailed rows: Theorem 5.41

3. Matrices with independent sub-gaussian columns: Theorem 5.58

4. Matrices with independent heavy-tailed columns: Theorem 5.62

These four models cover many natural classes of random matrices that occur in applications, including random matrices with independent entries (Gaussian and Bernoulli in particular) and random sub-matrices of orthogonal matrices (random Fourier matrices in particular).

The analysis of these four models is based on a variety of tools of probability theory and geometric functional analysis, most of which have not been covered in the texts on the “classical” random matrix theory. The reader will learn basics on sub-gaussian and sub-exponential random variables, isotropic random vectors, large deviation inequalities for sums of independent random variables, extensions of these inequalities to random matrices, and several basic methods of high dimensional probability such as symmetrization, decoupling, and covering (-net) arguments.

#### Applications

In these notes we shall emphasize two applications, one in statistics and one in compressed sensing. Our analysis of random matrices with independent rows immediately applies to a basic problem in statistics – estimating covariance matrices of high-dimensional distributions. If a random matrix has i.i.d. rows , then is the sample covariance matrix. If has independent columns , then is the Gram matrix. Thus our analysis of the row-independent and column-independent models can be interpreted as a study of sample covariance matrices and Gram matrices of high dimensional distributions. We will see in Section 5.4.3 that for a general distribution in , its covariance matrix can be estimated from a sample of size drawn from the distribution. Moreover, for sub-gaussian distributions we have an even better bound . For low-dimensional distributions, much fewer samples are needed – if a distribution lies close to a subspace of dimension in , then a sample of size is sufficient for covariance estimation.

In compressed sensing, the best known measurement matrices are random. A sufficient condition for a matrix to succeed for the purposes of compressed sensing is given by the restricted isometry property. Loosely speaking, this property demands that all sub-matrices of given size be well-conditioned. This fits well in the circle of problems of the non-asymptotic random matrix theory. Indeed, we will see in Section 5.6 that all basic models of random matrices are nice restricted isometries. These include Gaussian and Bernoulli matrices, more generally all matrices with sub-gaussian independent entries, and even more generally all matrices with sub-gaussian independent rows or columns. Also, the class of restricted isometries includes random Fourier matrices, more generally random sub-matrices of bounded orthogonal matrices, and even more generally matrices whose rows are independent samples from an isotropic distribution with uniformly bounded coordinates.

#### Related sources

This text is a tutorial rather than a survey, so we focus on explaining methods rather than results. This forces us to make some concessions in our choice of the subjects. Concentration of measure and its applications to random matrix theory are only briefly mentioned. For an introduction into concentration of measure suitable for a beginner, see [9] and [49, Chapter 14]; for a thorough exposition see [56, 43]; for connections with random matrices see [21, 44]. The monograph [45] also offers an introduction into concentration of measure and related probabilistic methods in analysis and geometry, some of which we shall use in these notes.

We completely avoid the important (but more difficult) model of symmetric random matrices with independent entries on and above the diagonal. Starting from the work of Füredi and Komlos [29], the largest singular value (the spectral norm) of symmetric random matrices has been a subject of study in many works; see e.g. [50, 83, 58] and the references therein.

We also did not even attempt to discuss sharp small deviation inequalities (of Tracy-Widom type) for the extreme eigenvalues. Both these topics and much more are discussed in the surveys [21, 44, 69], which serve as bridges between asymptotic and non-asymptotic problems in random matrix theory.

Because of the absolute constant in (5.2), our analysis of the smallest singular value (the “hard edge”) will only be useful for sufficiently tall matrices, where . For square and almost square matrices, the hard edge problem will be only briefly mentioned in Section 5.3. The surveys [76, 69] discuss this problem at length, and they offer a glimpse of connections to other problems of random matrix theory and additive combinatorics.

Many of the results and methods presented in these notes are known in one form or another. Some of them are published while some others belong to the folklore of probability in Banach spaces, geometric functional analysis, and related areas. When available, historic references are given in Section 5.7.

#### Acknowledgements

The author is grateful to the colleagues who made a number of improving suggestions for the earlier versions of the manuscript, in particular to Richard Chen, Subhroshekhar Ghosh, Alexander Litvak, Deanna Needell, Holger Rauhut, S V N Vishwanathan and the anonymous referees. Special thanks are due to Ulas Ayaz and Felix Krahmer who thoroughly read the entire text, and whose numerous comments led to significant improvements of this tutorial.

## 5.2 Preliminaries

### 5.2.1 Matrices and their singular values

The main object of our study will be an matrix with real or complex entries. We shall state all results in the real case; the reader will be able to adjust them to the complex case as well. Usually but not always one should think of tall matrices , those for which . By passing to the adjoint matrix , many results can be carried over to “flat” matrices, those for which .

It is often convenient to study through the symmetric positive-semidefinite matrix the matrix . The eigenvalues of are therefore non-negative real numbers. Arranged in a non-decreasing order, they are called the singular values444In the literature, singular values are also called s-numbers. of and denoted . Many applications require estimates on the extreme singular values

 smax(A):=s1(A),smin(A):=sn(A).

The smallest singular value is only of interest for tall matrices, since for one automatically has .

Equivalently, and are respectively the smallest number and the largest number such that

 m∥x∥2≤∥Ax∥2≤M∥x∥2for all x∈Rn. (5.3)

In order to interpret this definition geometrically, we look at as a linear operator from into . The Euclidean distance between any two points in can increase by at most the factor and decrease by at most the factor under the action of . Therefore, the extreme singular values control the distortion of the Euclidean geometry under the action of . If then acts as an approximate isometry, or more accurately an approximate isometric embedding of into .

The extreme singular values can also be described in terms of the spectral norm of , which is by definition

 ∥A∥=∥A∥ℓn2→ℓN2=supx∈Rn∖{0}∥Ax∥2∥x∥2=supx∈Sn−1∥Ax∥2. (5.4)

(5.3) gives a link between the extreme singular values and the spectral norm:

 smax(A)=∥A∥,smin(A)=1/∥A†∥

where denotes the pseudoinverse of ; if is invertible then .

### 5.2.2 Nets

Nets are convenient means to discretize compact sets. In our study we will mostly need to discretize the unit Euclidean sphere in the definition of the spectral norm (5.4). Let us first recall a general definition of an -net.

###### Definition 5.1 (Nets, covering numbers).

Let be a metric space and let . A subset of is called an -net of if every point can be approximated to within by some point , i.e. so that . The minimal cardinality of an -net of , if finite, is denoted and is called the covering number555Equivalently, is the minimal number of balls with radii and with centers in needed to cover . of (at scale ).

From a characterization of compactness we remember that is compact if and only if for each . A quantitative estimate on would give us a quantitative version of compactness of .666In statistical learning theory and geometric functional analysis, is called the metric entropy of . In some sense it measures the “complexity” of metric space . Let us therefore take a simple example of a metric space, the unit Euclidean sphere equipped with the Euclidean metric777A similar result holds for the geodesic metric on the sphere, since for small these two distances are equivalent. , and estimate its covering numbers.

###### Lemma 5.2 (Covering numbers of the sphere).

The unit Euclidean sphere equipped with the Euclidean metric satisfies for every that

 N(Sn−1,ε)≤(1+2ε)n.
###### Proof.

This is a simple volume argument. Let us fix and choose to be a maximal -separated subset of . In other words, is such that for all , , and no subset of containing has this property.888One can in fact construct inductively by first selecting an arbitrary point on the sphere, and at each next step selecting a point that is at distance at least from those already selected. By compactness, this algorithm will terminate after finitely many steps and it will yield a set as we required.

The maximality property implies that is an -net of . Indeed, otherwise there would exist that is at least -far from all points in . So would still be an -separated set, contradicting the minimality property.

Moreover, the separation property implies via the triangle inequality that the balls of radii centered at the points in are disjoint. On the other hand, all such balls lie in where denotes the unit Euclidean ball centered at the origin. Comparing the volume gives . Since for all , we conclude that as required. ∎

Nets allow us to reduce the complexity of computations with linear operators. One such example is the computation of the spectral norm. To evaluate the spectral norm by definition (5.4) one needs to take the supremum over the whole sphere . However, one can essentially replace the sphere by its -net:

###### Lemma 5.3 (Computing the spectral norm on a net).

Let be an matrix, and let be an -net of for some . Then

 maxx∈Nε∥Ax∥2≤∥A∥≤(1−ε)−1maxx∈Nε∥Ax∥2
###### Proof.

The lower bound in the conclusion follows from the definition. To prove the upper bound let us fix for which , and choose which approximates as . By the triangle inequality we have . It follows that

 ∥Ay∥2≥∥Ax∥2−∥Ax−Ay∥2≥∥A∥−ε∥A∥=(1−ε)∥A∥.

Taking maximum over all in this inequality, we complete the proof. ∎

A similar result holds for symmetric matrices , whose spectral norm can be computed via the associated quadratic form: . Again, one can essentially replace the sphere by its -net:

###### Lemma 5.4 (Computing the spectral norm on a net).

Let be a symmetric matrix, and let be an -net of for some . Then

 ∥A∥=supx∈Sn−1|⟨Ax,x⟩|≤(1−2ε)−1supx∈Nε|⟨Ax,x⟩|.
###### Proof.

Let us choose for which , and choose which approximates as . By the triangle inequality we have

 |⟨Ax,x⟩−⟨Ay,y⟩| =|⟨Ax,x−y⟩+⟨A(x−y),y⟩| ≤∥A∥∥x∥2∥x−y∥2+∥A∥∥x−y∥2∥y∥2≤2ε∥A∥.

It follows that . Taking the maximum over all in this inequality completes the proof. ∎

### 5.2.3 Sub-gaussian random variables

In this section we introduce the class of sub-gaussian random variables,999It would be more rigorous to say that we study sub-gaussian probability distributions. The same concerns some other properties of random variables and random vectors we study later in this text. However, it is convenient for us to focus on random variables and vectors because we will form random matrices out of them. those whose distributions are dominated by the distribution of a centered gaussian random variable. This is a convenient and quite wide class, which contains in particular the standard normal and all bounded random variables.

Let us briefly recall some of the well known properties of the standard normal random variable . The distribution of has density and is denoted . Estimating the integral of this density between and one checks that the tail of a standard normal random variable decays super-exponentially:

 P{|X|>t}=2√2π∫∞te−x2/2dx≤2e−t2/2,t≥1, (5.5)

see e.g. [26, Theorem 1.4] for a more precise two-sided inequality. The absolute moments of can be computed as

 (E|X|p)1/p=√2[Γ((1+p)/2)Γ(1/2)]1/p=O(√p),p≥1. (5.6)

The moment generating function of equals

 Eexp(tX)=et2/2,t∈R. (5.7)

Now let be a general random variable. We observe that these three properties are equivalent – a super-exponential tail decay like in (5.5), the moment growth (5.6), and the growth of the moment generating function like in (5.7). We will then focus on the class of random variables that satisfy these properties, which we shall call sub-gaussian random variables.

###### Lemma 5.5 (Equivalence of sub-gaussian properties).

Let be a random variable. Then the following properties are equivalent with parameters differing from each other by at most an absolute constant factor.101010The precise meaning of this equivalence is the following. There exists an absolute constant such that property implies property with parameter for any two properties .

1. Tails: for all ;

2. Moments: for all ;

3. Super-exponential moment: .

Moreover, if then properties 1–3 are also equivalent to the following one:

1. Moment generating function: for all .

###### Proof.

1. 2. Assume property 1 holds. By homogeneity, rescaling to we can assume that . Recall that for every non-negative random variable , integration by parts yields the identity . We apply this for . After change of variables , we obtain using property 1 that

Taking the -th root yields property 2 with a suitable absolute constant .

2. 3. Assume property 2 holds. As before, by homogeneity we may assume that . Let be a sufficiently small absolute constant. Writing the Taylor series of the exponential function, we obtain

 Eexp(cX2)=1+∞∑p=1cpE(X2p)p!≤1+∞∑p=1cp(2p)pp!≤1+∞∑p=1(2c/e)p.

The first inequality follows from property 2; in the second one we use . For small this gives , which is property 3 with .

3. 1. Assume property 3 holds. As before we may assume that . Exponentiating and using Markov’s inequality111111This simple argument is sometimes called exponential Markov’s inequality. and then property 3, we have

 P{|X|>t}=P{eX2≥et2}≤e−t2EeX2≤e1−t2.

This proves property 1 with .

2. 4. Let us now assume that and property 2 holds; as usual we can assume that . We will prove that property 4 holds with an appropriately large absolute constant . This will follow by estimating Taylor series for the exponential function

 Eexp(tX)=1+tEX+∞∑p=2tpEXpp!≤1+∞∑p=2tppp/2p!≤1+∞∑p=2(e|t|√p)p. (5.8)

The first inequality here follows from and property 2; the second one holds since . We compare this with Taylor’s series for

 (5.9)

The first inequality here holds because ; the second one is obtained by substitution . One can show that the series in (5.8) is bounded by the series in (5.9) with large absolute constant . We conclude that , which proves property 4.

4. 1. Assume property 4 holds; we can also assume that . Let be a parameter to be chosen later. By exponential Markov inequality, and using the bound on the moment generating function given in property 4, we obtain

 P{X≥t}=P{eλX≥eλt}≤e−λtEeλX≤e−λt+λ2.

Optimizing in and thus choosing we conclude that . Repeating this argument for , we also obtain . Combining these two bounds we conclude that . Thus property 1 holds with . The lemma is proved. ∎

###### Remark 5.6.
1. The constants and in properties 1 and 3 respectively are chosen for convenience. Thus the value can be replaced by any positive number and the value can be replaced by any number greater than .

2. The assumption is only needed to prove the necessity of property 4; the sufficiency holds without this assumption.

###### Definition 5.7 (Sub-gaussian random variables).

A random variable that satisfies one of the equivalent properties 1 – 3 in Lemma 5.5 is called a sub-gaussian random variable. The sub-gaussian norm of , denoted , is defined to be the smallest in property 2. In other words,121212The sub-gaussian norm is also called norm in the literature.

 ∥X∥ψ2=supp≥1p−1/2(E|X|p)1/p.

The class of sub-gaussian random variables on a given probability space is thus a normed space. By Lemma 5.5, every sub-gaussian random variable satisfies:

 P{|X|>t}≤exp(1−ct2/∥X∥2ψ2)for all t≥0; (5.10) (E|X|p)1/p≤∥X∥ψ2√p% for all p≥1; (5.11) Eexp(cX2/∥X∥2ψ2)≤e; if EX=0 then Eexp(tX)≤exp(Ct2∥X∥2ψ2)for all t∈R, (5.12)

where are absolute constants. Moreover, up to absolute constant factors, is the smallest possible number in each of these inequalities.

###### Example 5.8.

Classical examples of sub-gaussian random variables are Gaussian, Bernoulli and all bounded random variables.

1. (Gaussian): A standard normal random variable is sub-gaussian with where is an absolute constant. This follows from (5.6). More generally, if is a centered normal random variable with variance , then is sub-gaussian with .

2. (Bernoulli): Consider a random variable with distribution . We call a symmetric Bernoulli random variable. Since , it follows that is a sub-gaussian random variable with .

3. (Bounded): More generally, consider any bounded random variable , thus almost surely for some . Then is a sub-gaussian random variable with . We can write this more compactly as .

A remarkable property of the normal distribution is rotation invariance. Given a finite number of independent centered normal random variables , their sum is also a centered normal random variable, obviously with . Rotation invariance passes onto sub-gaussian random variables, although approximately:

###### Lemma 5.9 (Rotation invariance).

Consider a finite number of independent centered sub-gaussian random variables . Then is also a centered sub-gaussian random variable. Moreover,

 ∥∥∑iXi∥∥2ψ2≤C∑i∥Xi∥2ψ2

where is an absolute constant.

###### Proof.

The argument is based on estimating the moment generating function. Using independence and (5.12) we have for every :

 Eexp(t∑iXi) =E∏iexp(tXi)=∏iEexp(tXi)≤∏iexp(Ct2∥Xi∥2ψ2) =exp(t2K2)where K2=C∑i∥Xi∥2ψ2.

Using the equivalence of properties 2 and 4 in Lemma 5.5 we conclude that where is an absolute constant. The proof is complete. ∎

The rotation invariance immediately yields a large deviation inequality for sums of independent sub-gaussian random variables:

###### Proposition 5.10 (Hoeffding-type inequality).

Let be independent centered sub-gaussian random variables, and let . Then for every and every , we have

 P{∣∣N∑i=1aiXi∣∣≥t}≤e⋅exp(−ct2K2∥a∥22)

where is an absolute constant.

###### Proof.

The rotation invariance (Lemma 5.9) implies the bound . Property (5.10) yields the required tail decay. ∎

###### Remark 5.11.

One can interpret these results (Lemma 5.9 and Proposition 5.10) as one-sided non-asymptotic manifestations of the central limit theorem. For example, consider the normalized sum of independent symmetric Bernoulli random variables . Proposition 5.10 yields the tail bounds for any number of terms . Up to the absolute constants and , these tails coincide with those of the standard normal random variable (5.5).

Using moment growth (5.11) instead of the tail decay (5.10), we immediately obtain from Lemma 5.9 a general form of the well known Khintchine inequality:

###### Corollary 5.12 (Khintchine inequality).

Let be a finite number of independent sub-gaussian random variables with zero mean, unit variance, and . Then, for every sequence of coefficients and every exponent we have

 (∑ia2i)1/2≤(E∣∣∑iaiXi∣∣p)1/p≤CK√p(∑ia2i)1/2

where is an absolute constant.

###### Proof.

The lower bound follows by independence and Hölder’s inequality: indeed, . For the upper bound, we argue as in Proposition 5.10, but use property (5.11). ∎

### 5.2.4 Sub-exponential random variables

Although the class of sub-gaussian random variables is natural and quite wide, it leaves out some useful random variables which have tails heavier than gaussian. One such example is a standard exponential random variable – a non-negative random variable with exponential tail decay

 P{X≥t}=e−t,t≥0. (5.13)

To cover such examples, we consider a class of sub-exponential random variables, those with at least an exponential tail decay. With appropriate modifications, the basic properties of sub-gaussian random variables hold for sub-exponentials. In particular, a version of Lemma 5.5 holds with a similar proof for sub-exponential properties, except for property 4 of the moment generating function. Thus for a random variable the following properties are equivalent with parameters differing from each other by at most an absolute constant factor:

 P{|X|>t}≤exp(1−t/K1)for all t≥0; (5.14) (E|X|p)1/p≤K2pfor all p≥1; (5.15) Eexp(X/K3)≤e. (5.16)
###### Definition 5.13 (Sub-exponential random variables).

A random variable that satisfies one of the equivalent properties (5.14) – (5.16) is called a sub-exponential random variable. The sub-exponential norm of , denoted , is defined to be the smallest parameter . In other words,

 ∥X∥ψ1=supp≥1p−1(E|X|p)1/p.
###### Lemma 5.14 (Sub-exponential is sub-gaussian squared).

A random variable is sub-gaussian if and only if is sub-exponential. Moreover,

 ∥X∥2ψ2≤∥X2∥ψ1≤2∥X∥2ψ2.
###### Proof.

This follows easily from the definition. ∎

The moment generating function of a sub-exponential random variable has a similar upper bound as in the sub-gaussian case (property 4 in Lemma 5.5). The only real difference is that the bound only holds in a neighborhood of zero rather than on the whole real line. This is inevitable, as the moment generating function of an exponential random variable (5.13) does not exist for .

###### Lemma 5.15 (Mgf of sub-exponential random variables).

Let be a centered sub-exponential random variable. Then, for such that , one has

 Eexp(tX)≤exp(Ct2∥X∥2ψ1)

where are absolute constants.

###### Proof.

The argument is similar to the sub-gaussian case. We can assume that by replacing with and with . Repeating the proof of the implication 2 4 of Lemma 5.5 and using this time, we obtain that . If then the right hand side is bounded by . This completes the proof. ∎

Sub-exponential random variables satisfy a large deviation inequality similar to the one for sub-gaussians (Proposition 5.10). The only significant difference is that two tails have to appear here – a gaussian tail responsible for the central limit theorem, and an exponential tail coming from the tails of each term.

###### Proposition 5.16 (Bernstein-type inequality).

Let  be independent centered sub-exponential random variables, and . Then for every and every , we have

 P{∣∣N∑i=1aiXi∣∣≥t}≤2exp[−cmin(t2K2∥a∥22,tK∥a∥∞)]

where is an absolute constant.

###### Proof.

Without loss of generality, we assume that by replacing with and with . We use the exponential Markov inequality for the sum and with a parameter :

 P{S≥t}=P{eλS≥eλt}≤e−λtEeλS=e−λt∏iEexp(λaiXi).

If then for all , so Lemma 5.15 yields

 P{S≥t}≤e−λt∏iexp(Cλ2a2i)=exp(−λt+Cλ2∥a∥22).

Choosing , we obtain that

 P{S≥t}≤exp[−min(t24C∥a∥22,ct2∥a∥∞)].

Repeating this argument for instead of , we obtain the same bound for . A combination of these two bounds completes the proof. ∎

###### Corollary 5.17.

Let be independent centered sub-exponential random variables, and let . Then, for every , we have

 P{∣∣N∑i=1Xi∣∣≥εN}≤2exp[−cmin(ε2K2,εK)N]

where is an absolute constant.

###### Proof.

This follows from Proposition 5.16 for and . ∎

###### Remark 5.18 (Centering).

The definitions of sub-gaussian and sub-exponential random variables do not require them to be centered. In any case, one can always center using the simple fact that if is sub-gaussian (or sub-exponential), then so is . Moreover,

 ∥X−EX∥ψ2≤2∥X∥ψ2,∥X−EX∥ψ1≤2∥X∥ψ1.

This follows by triangle inequality along with , and similarly for the sub-exponential norm.

### 5.2.5 Isotropic random vectors

Now we carry our work over to higher dimensions. We will thus be working with random vectors in , or equivalently probability distributions in .

While the concept of the mean of a random variable remains the same in higher dimensions, the second moment is replaced by the second moment matrix of a random vector , defined as

 Σ=Σ(X)=EX⊗X=EXXT

where denotes the outer product of vectors in . Similarly, the concept of variance of a random variable is replaced in higher dimensions with the covariance matrix of a random vector , defined as

 Cov(X)=E(X−μ)⊗(X−μ)=EX⊗X−μ⊗μ

where . By translation, many questions can be reduced to the case of centered random vectors, for which and . We will also need a higher-dimensional version of unit variance:

###### Definition 5.19 (Isotropic random vectors).

A random vector in is called isotropic if . Equivalently, is isotropic if

 E⟨X,x⟩2=∥x∥22for all x∈Rn. (5.17)

Suppose is an invertible matrix, which means that the distribution of is not essentially supported on any proper subspace of . Then is an isotropic random vector in . Thus every non-degenerate random vector can be made isotropic by an appropriate linear transformation.131313This transformation (usually preceded by centering) is a higher-dimensional version of standardizing of random variables, which enforces zero mean and unit variance. This allows us to mostly focus on studying isotropic random vectors in the future.

###### Lemma 5.20.

Let be independent isotropic random vectors in . Then and .

###### Proof.

The first part follows from . The second part follows by conditioning on , using isotropy of and using the first part for : this way we obtain . ∎

###### Example 5.21.
1. (Gaussian): The (standard) Gaussian random vector in chosen according to the standard normal distribution is isotropic. The coordinates of are independent standard normal random variables.

2. (Bernoulli): A similar example of a discrete isotropic distribution is given by a Bernoulli random vector in whose coordinates are independent symmetric Bernoulli random variables.

3. (Product distributions): More generally, consider a random vector in whose coordinates are independent random variables with zero mean and unit variance. Then clearly is an isotropic vector in .

4. (Coordinate): Consider a coordinate random vector , which is uniformly distributed in the set where is the canonical basis of . Clearly is an isotropic random vector in .141414The examples of Gaussian and coordinate random vectors are somewhat opposite – one is very continuous and the other is very discrete. They may be used as test cases in our study of random matrices.

5. (Frame): This is a more general version of the coordinate random vector. A frame is a set of vectors in which obeys an approximate Parseval’s identity, i.e. there exist numbers called frame bounds such that

 A∥x∥22≤M∑i=1⟨ui,x⟩2≤B∥x∥22for all x∈Rn.

If the set is called a tight frame. Thus, tight frames are generalizations of orthogonal bases without linear independence. Given a tight frame with bounds , the random vector uniformly distributed in the set is clearly isotropic in .151515There is clearly a reverse implication, too, which shows that the class of tight frames can be identified with the class of discrete isotropic random vectors.

6. (Spherical): Consider a random vector uniformly distributed on the unit Euclidean sphere in with center at the origin and radius . Then is isotropic. Indeed, by rotation invariance is proportional to ; the correct normalization is derived from Lemma 5.20.

7. (Uniform on a convex set): In convex geometry, a convex set in is called isotropic if a random vector chosen uniformly from according to the volume is isotropic. As we noted, every full dimensional convex set can be made into an isotropic one by an affine transformation. Isotropic convex sets look “well conditioned”, which is advantageous in geometric algorithms (e.g. volume computations).

We generalize the concepts of sub-gaussian random variables to higher dimensions using one-dimensional marginals.

###### Definition 5.22 (Sub-gaussian random vectors).

We say that a random vector in is sub-gaussian if the one-dimensional marginals are sub-gaussian random variables for all . The sub-gaussian norm of is defined as

 ∥X∥ψ2=supx∈Sn−1∥⟨X,x⟩∥ψ2.
###### Remark 5.23 (Properties of high-dimensional distributions).

The definitions of isotropic and sub-gaussian distributions suggest that more generally, natural properties of high-dimensional distributions may be defined via one-dimensional marginals. This is a natural way to generalize properties of random variables to random vectors. For example, we shall call a random vector sub-exponential if all of its one-dimensional marginals are sub-exponential random variables, etc.

One simple way to create sub-gaussian distributions in is by taking a product of sub-gaussian distributions on the line:

###### Lemma 5.24 (Product of sub-gaussian distributions).

Let be independent centered sub-gaussian random variables. Then is a centered sub-gaussian random vector in , and

 ∥X∥ψ2≤Cmaxi≤n∥Xi∥ψ2

where is an absolute constant.

###### Proof.

This is a direct consequence of the rotation invariance principle, Lemma 5.9. Indeed, for every we have

 ∥⟨X,x⟩∥ψ2=∥∥n∑i=1xiXi∥∥ψ2≤Cn∑i=1x2i∥Xi∥2ψ2≤Cmaxi≤n∥Xi∥ψ2

where we used that . This completes the proof. ∎

###### Example 5.25.

Let us analyze the basic examples of random vectors introduced earlier in Example 5.21.

1. (Gaussian, Bernoulli): Gaussian and Bernoulli random vectors are sub-gaussian; their sub-gaussian norms are bounded by an absolute constant. These are particular cases of Lemma 5.24.

2. (Spherical): A spherical random vector is also sub-gaussian; its sub-gaussian norm is bounded by an absolute constant. Unfortunately, this does not follow from Lemma 5.24 because the coordinates of the spherical vector are not independent. Instead, by rotation invariance, the claim clearly follows from the following geometric fact. For every , the spherical cap makes up at most proportion of the total area on the sphere.161616This fact about spherical caps may seem counter-intuitive. For example, for the cap looks similar to a hemisphere, but the proportion of its area goes to zero very fast as dimension increases. This is a starting point of the study of the concentration of measure phenomenon, see [43]. This can be proved directly by integration, and also by elementary geometric considerations [9, Lemma 2.2].

3. (Coordinate): Although the coordinate random vector is formally sub-gaussian as its support is finite, its sub-gaussian norm is too big: . So we would not think of as a sub-gaussian random vector.

4. (Uniform on a convex set): For many isotropic convex sets (called bodies), a random vector uniformly distributed in is sub-gaussian with . For example, the cube is a body by Lemma 5.24, while the appropriately normalized cross-polytope is not. Nevertheless, Borell’s lemma (which is a consequence of Brunn-Minkowski inequality) implies a weaker property, that is always sub-exponential, and is bounded by absolute constant. See [33, Section 2.2.b] for a proof and discussion of these ideas.

### 5.2.6 Sums of independent random matrices

In this section, we mention without proof some results of classical probability theory in which scalars can be replaced by matrices. Such results are useful in particular for problems on random matrices, since we can view a random matrix as a generalization of a random variable. One such remarkable generalization is valid for Khintchine inequality, Corollary 5.12. The scalars can be replaced by matrices, and the absolute value by the Schatten norm. Recall that for , the -Schatten norm of an matrix is defined as the norm of the sequence of its singular values:

 ∥A∥Cnp=∥(si(A))ni=1∥p=(n∑i=1si(A)p)1/p.

For , the Schatten norm equals the spectral norm . Using this one can quickly check that already for the Schatten and spectral norms are equivalent: .

###### Theorem 5.26 (Non-commutative Khintchine inequality, see [61] Section 9.8).

Let be self-adjoint matrices and be independent symmetric Bernoulli random variables. Then, for every , we have

where is an absolute constant.

###### Remark 5.27.
1. The scalar case of this result, for , recovers the classical Khintchine inequality, Corollary 5.12, for .

2. By the equivalence of Schatten and spectral norms for , a version of non-commutative Khintchine inequality holds for the spectral norm:

 E∥∥N∑i=1εiAi∥∥≤C1√logn∥∥(N∑i=1A2i)1/2∥∥ (5.18)

where is an absolute constant. The logarithmic factor is unfortunately essential; it role will be clear when we discuss applications of this result to random matrices in the next sections.

###### Corollary 5.28 (Rudelson’s inequality [65]).

Let be vectors in and be independent symmetric Bernoulli random variables. Then

 E∥∥N∑i=1εix