Introduction to the nonasymptotic analysis of random matrices
August 11, 2010; final revision November 23, 2011
]0 0]
Contents
This is a tutorial on some basic nonasymptotic methods and concepts in random matrix theory. The reader will learn several tools for the analysis of the extreme singular values of random matrices with independent rows or columns. Many of these methods sprung off from the development of geometric functional analysis since the 1970’s. They have applications in several fields, most notably in theoretical computer science, statistics and signal processing. A few basic applications are covered in this text, particularly for the problem of estimating covariance matrices in statistics and for validating probabilistic constructions of measurement matrices in compressed sensing. These notes are written particularly for graduate students and beginning researchers in different areas, including functional analysts, probabilists, theoretical statisticians, electrical engineers, and theoretical computer scientists.
5.1 Introduction
Asymptotic and nonasymptotic regimes
Random matrix theory studies properties of matrices chosen from some distribution on the set of all matrices. As dimensions and grow to infinity, one observes that the spectrum of tends to stabilize. This is manifested in several limit laws, which may be regarded as random matrix versions of the central limit theorem. Among them is Wigner’s semicircle law for the eigenvalues of symmetric Gaussian matrices, the circular law for Gaussian matrices, the MarchenkoPastur law for Wishart matrices where is a Gaussian matrix, the BaiYin and TracyWidom laws for the extreme eigenvalues of Wishart matrices . The books [51, 5, 23, 6] offer thorough introduction to the classical problems of random matrix theory and its fascinating connections.
The asymptotic regime where the dimensions is well suited for the purposes of statistical physics, e.g. when random matrices serve as finitedimensional models of infinitedimensional operators. But in some other areas including statistics, geometric functional analysis, and compressed sensing, the limiting regime may not be very useful [69]. Suppose, for example, that we ask about the largest singular value (i.e. the largest eigenvalue of ); to be specific assume that is an matrix whose entries are independent standard normal random variables. The asymptotic random matrix theory answers this question as follows: the BaiYin law (see Theorem 5.31) states that
as the dimension . Moreover, the limiting distribution of is known to be the TracyWidom law (see [71, 27]). In contrast to this, a nonasymptotic answer to the same question is the following: in every dimension , one has
here is an absolute constant (see Theorems 5.32 and 5.39). The latter answer is less precise (because of an absolute constant ) but more quantitative because for fixed dimensions it gives an exponential probability of success.^{2}^{2}2For this specific model (Gaussian matrices),Theorems 5.32 and 5.35 even give a sharp absolute constant here. But the result mentioned here is much more general as we will see later; it only requires independence of rows or columns of . This is the kind of answer we will seek in this text – guarantees up to absolute constants in all dimensions, and with large probability.
Tall matrices are approximate isometries
The following heuristic will be our guideline: tall random matrices should act as approximate isometries. So, an random matrix with should act almost like an isometric embedding of into :
where is an appropriate normalization factor and . Equivalently, this says that all the singular values of are close to each other:
where and denote the smallest and the largest singular values of . Yet equivalently, this means that tall matrices are well conditioned: the condition number of is .
In the asymptotic regime and for random matrices with independent entries, our heuristic is justified by BaiYin’s law, which is Theorem 5.31 below. Loosely speaking, it states that as the dimensions increase to infinity while the aspect ratio is fixed, we have
(5.1) 
In these notes, we study random matrices with independent rows or independent columns, but not necessarily independent entries. We develop nonasymptotic versions of (5.1) for such matrices, which should hold for all dimensions and . The desired results should have the form
(5.2) 
with large probability, e.g. , where is an absolute constant.^{3}^{3}3More accurately, we should expect to depend on easily computable quantities of the distribution, such as its moments. This will be clear from the context. For tall matrices, where , both sides of this inequality would be close to each other, which would guarantee that is an approximate isometry.
Models and methods
We shall study quite general models of random matrices – those with independent rows or independent columns that are sampled from highdimensional distributions. We will place either strong moment assumptions on the distribution (subgaussian growth of moments), or no moment assumptions at all (except finite variance). This leads us to four types of main results:
These four models cover many natural classes of random matrices that occur in applications, including random matrices with independent entries (Gaussian and Bernoulli in particular) and random submatrices of orthogonal matrices (random Fourier matrices in particular).
The analysis of these four models is based on a variety of tools of probability theory and geometric functional analysis, most of which have not been covered in the texts on the “classical” random matrix theory. The reader will learn basics on subgaussian and subexponential random variables, isotropic random vectors, large deviation inequalities for sums of independent random variables, extensions of these inequalities to random matrices, and several basic methods of high dimensional probability such as symmetrization, decoupling, and covering (net) arguments.
Applications
In these notes we shall emphasize two applications, one in statistics and one in compressed sensing. Our analysis of random matrices with independent rows immediately applies to a basic problem in statistics – estimating covariance matrices of highdimensional distributions. If a random matrix has i.i.d. rows , then is the sample covariance matrix. If has independent columns , then is the Gram matrix. Thus our analysis of the rowindependent and columnindependent models can be interpreted as a study of sample covariance matrices and Gram matrices of high dimensional distributions. We will see in Section 5.4.3 that for a general distribution in , its covariance matrix can be estimated from a sample of size drawn from the distribution. Moreover, for subgaussian distributions we have an even better bound . For lowdimensional distributions, much fewer samples are needed – if a distribution lies close to a subspace of dimension in , then a sample of size is sufficient for covariance estimation.
In compressed sensing, the best known measurement matrices are random. A sufficient condition for a matrix to succeed for the purposes of compressed sensing is given by the restricted isometry property. Loosely speaking, this property demands that all submatrices of given size be wellconditioned. This fits well in the circle of problems of the nonasymptotic random matrix theory. Indeed, we will see in Section 5.6 that all basic models of random matrices are nice restricted isometries. These include Gaussian and Bernoulli matrices, more generally all matrices with subgaussian independent entries, and even more generally all matrices with subgaussian independent rows or columns. Also, the class of restricted isometries includes random Fourier matrices, more generally random submatrices of bounded orthogonal matrices, and even more generally matrices whose rows are independent samples from an isotropic distribution with uniformly bounded coordinates.
Related sources
This text is a tutorial rather than a survey, so we focus on explaining methods rather than results. This forces us to make some concessions in our choice of the subjects. Concentration of measure and its applications to random matrix theory are only briefly mentioned. For an introduction into concentration of measure suitable for a beginner, see [9] and [49, Chapter 14]; for a thorough exposition see [56, 43]; for connections with random matrices see [21, 44]. The monograph [45] also offers an introduction into concentration of measure and related probabilistic methods in analysis and geometry, some of which we shall use in these notes.
We completely avoid the important (but more difficult) model of symmetric random matrices with independent entries on and above the diagonal. Starting from the work of Füredi and Komlos [29], the largest singular value (the spectral norm) of symmetric random matrices has been a subject of study in many works; see e.g. [50, 83, 58] and the references therein.
We also did not even attempt to discuss sharp small deviation inequalities (of TracyWidom type) for the extreme eigenvalues. Both these topics and much more are discussed in the surveys [21, 44, 69], which serve as bridges between asymptotic and nonasymptotic problems in random matrix theory.
Because of the absolute constant in (5.2), our analysis of the smallest singular value (the “hard edge”) will only be useful for sufficiently tall matrices, where . For square and almost square matrices, the hard edge problem will be only briefly mentioned in Section 5.3. The surveys [76, 69] discuss this problem at length, and they offer a glimpse of connections to other problems of random matrix theory and additive combinatorics.
Many of the results and methods presented in these notes are known in one form or another. Some of them are published while some others belong to the folklore of probability in Banach spaces, geometric functional analysis, and related areas. When available, historic references are given in Section 5.7.
Acknowledgements
The author is grateful to the colleagues who made a number of improving suggestions for the earlier versions of the manuscript, in particular to Richard Chen, Subhroshekhar Ghosh, Alexander Litvak, Deanna Needell, Holger Rauhut, S V N Vishwanathan and the anonymous referees. Special thanks are due to Ulas Ayaz and Felix Krahmer who thoroughly read the entire text, and whose numerous comments led to significant improvements of this tutorial.
5.2 Preliminaries
5.2.1 Matrices and their singular values
The main object of our study will be an matrix with real or complex entries. We shall state all results in the real case; the reader will be able to adjust them to the complex case as well. Usually but not always one should think of tall matrices , those for which . By passing to the adjoint matrix , many results can be carried over to “flat” matrices, those for which .
It is often convenient to study through the symmetric positivesemidefinite matrix the matrix . The eigenvalues of are therefore nonnegative real numbers. Arranged in a nondecreasing order, they are called the singular values^{4}^{4}4In the literature, singular values are also called snumbers. of and denoted . Many applications require estimates on the extreme singular values
The smallest singular value is only of interest for tall matrices, since for one automatically has .
Equivalently, and are respectively the smallest number and the largest number such that
(5.3) 
In order to interpret this definition geometrically, we look at as a linear operator from into . The Euclidean distance between any two points in can increase by at most the factor and decrease by at most the factor under the action of . Therefore, the extreme singular values control the distortion of the Euclidean geometry under the action of . If then acts as an approximate isometry, or more accurately an approximate isometric embedding of into .
The extreme singular values can also be described in terms of the spectral norm of , which is by definition
(5.4) 
(5.3) gives a link between the extreme singular values and the spectral norm:
where denotes the pseudoinverse of ; if is invertible then .
5.2.2 Nets
Nets are convenient means to discretize compact sets. In our study we will mostly need to discretize the unit Euclidean sphere in the definition of the spectral norm (5.4). Let us first recall a general definition of an net.
Definition 5.1 (Nets, covering numbers).
Let be a metric space and let . A subset of is called an net of if every point can be approximated to within by some point , i.e. so that . The minimal cardinality of an net of , if finite, is denoted and is called the covering number^{5}^{5}5Equivalently, is the minimal number of balls with radii and with centers in needed to cover . of (at scale ).
From a characterization of compactness we remember that is compact if and only if for each . A quantitative estimate on would give us a quantitative version of compactness of .^{6}^{6}6In statistical learning theory and geometric functional analysis, is called the metric entropy of . In some sense it measures the “complexity” of metric space . Let us therefore take a simple example of a metric space, the unit Euclidean sphere equipped with the Euclidean metric^{7}^{7}7A similar result holds for the geodesic metric on the sphere, since for small these two distances are equivalent. , and estimate its covering numbers.
Lemma 5.2 (Covering numbers of the sphere).
The unit Euclidean sphere equipped with the Euclidean metric satisfies for every that
Proof.
This is a simple volume argument. Let us fix and choose to be a maximal separated subset of . In other words, is such that for all , , and no subset of containing has this property.^{8}^{8}8One can in fact construct inductively by first selecting an arbitrary point on the sphere, and at each next step selecting a point that is at distance at least from those already selected. By compactness, this algorithm will terminate after finitely many steps and it will yield a set as we required.
The maximality property implies that is an net of . Indeed, otherwise there would exist that is at least far from all points in . So would still be an separated set, contradicting the minimality property.
Moreover, the separation property implies via the triangle inequality that the balls of radii centered at the points in are disjoint. On the other hand, all such balls lie in where denotes the unit Euclidean ball centered at the origin. Comparing the volume gives . Since for all , we conclude that as required. ∎
Nets allow us to reduce the complexity of computations with linear operators. One such example is the computation of the spectral norm. To evaluate the spectral norm by definition (5.4) one needs to take the supremum over the whole sphere . However, one can essentially replace the sphere by its net:
Lemma 5.3 (Computing the spectral norm on a net).
Let be an matrix, and let be an net of for some . Then
Proof.
The lower bound in the conclusion follows from the definition. To prove the upper bound let us fix for which , and choose which approximates as . By the triangle inequality we have . It follows that
Taking maximum over all in this inequality, we complete the proof. ∎
A similar result holds for symmetric matrices , whose spectral norm can be computed via the associated quadratic form: . Again, one can essentially replace the sphere by its net:
Lemma 5.4 (Computing the spectral norm on a net).
Let be a symmetric matrix, and let be an net of for some . Then
Proof.
Let us choose for which , and choose which approximates as . By the triangle inequality we have
It follows that . Taking the maximum over all in this inequality completes the proof. ∎
5.2.3 Subgaussian random variables
In this section we introduce the class of subgaussian random variables,^{9}^{9}9It would be more rigorous to say that we study subgaussian probability distributions. The same concerns some other properties of random variables and random vectors we study later in this text. However, it is convenient for us to focus on random variables and vectors because we will form random matrices out of them. those whose distributions are dominated by the distribution of a centered gaussian random variable. This is a convenient and quite wide class, which contains in particular the standard normal and all bounded random variables.
Let us briefly recall some of the well known properties of the standard normal random variable . The distribution of has density and is denoted . Estimating the integral of this density between and one checks that the tail of a standard normal random variable decays superexponentially:
(5.5) 
see e.g. [26, Theorem 1.4] for a more precise twosided inequality. The absolute moments of can be computed as
(5.6) 
The moment generating function of equals
(5.7) 
Now let be a general random variable. We observe that these three properties are equivalent – a superexponential tail decay like in (5.5), the moment growth (5.6), and the growth of the moment generating function like in (5.7). We will then focus on the class of random variables that satisfy these properties, which we shall call subgaussian random variables.
Lemma 5.5 (Equivalence of subgaussian properties).
Let be a random variable. Then the following properties are equivalent with parameters differing from each other by at most an absolute constant factor.^{10}^{10}10The precise meaning of this equivalence is the following. There exists an absolute constant such that property implies property with parameter for any two properties .

Tails: for all ;

Moments: for all ;

Superexponential moment: .
Moreover, if then properties 1–3 are also equivalent to the following one:

Moment generating function: for all .
Proof.
1. 2. Assume property 1 holds. By homogeneity, rescaling to we can assume that . Recall that for every nonnegative random variable , integration by parts yields the identity . We apply this for . After change of variables , we obtain using property 1 that
Taking the th root yields property 2 with a suitable absolute constant .
2. 3. Assume property 2 holds. As before, by homogeneity we may assume that . Let be a sufficiently small absolute constant. Writing the Taylor series of the exponential function, we obtain
The first inequality follows from property 2; in the second one we use . For small this gives , which is property 3 with .
3. 1. Assume property 3 holds. As before we may assume that . Exponentiating and using Markov’s inequality^{11}^{11}11This simple argument is sometimes called exponential Markov’s inequality. and then property 3, we have
This proves property 1 with .
2. 4. Let us now assume that and property 2 holds; as usual we can assume that . We will prove that property 4 holds with an appropriately large absolute constant . This will follow by estimating Taylor series for the exponential function
(5.8) 
The first inequality here follows from and property 2; the second one holds since . We compare this with Taylor’s series for
(5.9) 
The first inequality here holds because ; the second one is obtained by substitution . One can show that the series in (5.8) is bounded by the series in (5.9) with large absolute constant . We conclude that , which proves property 4.
4. 1. Assume property 4 holds; we can also assume that . Let be a parameter to be chosen later. By exponential Markov inequality, and using the bound on the moment generating function given in property 4, we obtain
Optimizing in and thus choosing we conclude that . Repeating this argument for , we also obtain . Combining these two bounds we conclude that . Thus property 1 holds with . The lemma is proved. ∎
Remark 5.6.

The constants and in properties 1 and 3 respectively are chosen for convenience. Thus the value can be replaced by any positive number and the value can be replaced by any number greater than .

The assumption is only needed to prove the necessity of property 4; the sufficiency holds without this assumption.
Definition 5.7 (Subgaussian random variables).
A random variable that satisfies one of the equivalent properties 1 – 3 in Lemma 5.5 is called a subgaussian random variable. The subgaussian norm of , denoted , is defined to be the smallest in property 2. In other words,^{12}^{12}12The subgaussian norm is also called norm in the literature.
The class of subgaussian random variables on a given probability space is thus a normed space. By Lemma 5.5, every subgaussian random variable satisfies:
(5.10)  
(5.11)  
(5.12) 
where are absolute constants. Moreover, up to absolute constant factors, is the smallest possible number in each of these inequalities.
Example 5.8.
Classical examples of subgaussian random variables are Gaussian, Bernoulli and all bounded random variables.

(Gaussian): A standard normal random variable is subgaussian with where is an absolute constant. This follows from (5.6). More generally, if is a centered normal random variable with variance , then is subgaussian with .

(Bernoulli): Consider a random variable with distribution . We call a symmetric Bernoulli random variable. Since , it follows that is a subgaussian random variable with .

(Bounded): More generally, consider any bounded random variable , thus almost surely for some . Then is a subgaussian random variable with . We can write this more compactly as .
A remarkable property of the normal distribution is rotation invariance. Given a finite number of independent centered normal random variables , their sum is also a centered normal random variable, obviously with . Rotation invariance passes onto subgaussian random variables, although approximately:
Lemma 5.9 (Rotation invariance).
Consider a finite number of independent centered subgaussian random variables . Then is also a centered subgaussian random variable. Moreover,
where is an absolute constant.
Proof.
The rotation invariance immediately yields a large deviation inequality for sums of independent subgaussian random variables:
Proposition 5.10 (Hoeffdingtype inequality).
Let be independent centered subgaussian random variables, and let . Then for every and every , we have
where is an absolute constant.
Proof.
Remark 5.11.
One can interpret these results (Lemma 5.9 and Proposition 5.10) as onesided nonasymptotic manifestations of the central limit theorem. For example, consider the normalized sum of independent symmetric Bernoulli random variables . Proposition 5.10 yields the tail bounds for any number of terms . Up to the absolute constants and , these tails coincide with those of the standard normal random variable (5.5).
Using moment growth (5.11) instead of the tail decay (5.10), we immediately obtain from Lemma 5.9 a general form of the well known Khintchine inequality:
Corollary 5.12 (Khintchine inequality).
Let be a finite number of independent subgaussian random variables with zero mean, unit variance, and . Then, for every sequence of coefficients and every exponent we have
where is an absolute constant.
5.2.4 Subexponential random variables
Although the class of subgaussian random variables is natural and quite wide, it leaves out some useful random variables which have tails heavier than gaussian. One such example is a standard exponential random variable – a nonnegative random variable with exponential tail decay
(5.13) 
To cover such examples, we consider a class of subexponential random variables, those with at least an exponential tail decay. With appropriate modifications, the basic properties of subgaussian random variables hold for subexponentials. In particular, a version of Lemma 5.5 holds with a similar proof for subexponential properties, except for property 4 of the moment generating function. Thus for a random variable the following properties are equivalent with parameters differing from each other by at most an absolute constant factor:
(5.14)  
(5.15)  
(5.16) 
Definition 5.13 (Subexponential random variables).
Lemma 5.14 (Subexponential is subgaussian squared).
A random variable is subgaussian if and only if is subexponential. Moreover,
Proof.
This follows easily from the definition. ∎
The moment generating function of a subexponential random variable has a similar upper bound as in the subgaussian case (property 4 in Lemma 5.5). The only real difference is that the bound only holds in a neighborhood of zero rather than on the whole real line. This is inevitable, as the moment generating function of an exponential random variable (5.13) does not exist for .
Lemma 5.15 (Mgf of subexponential random variables).
Let be a centered subexponential random variable. Then, for such that , one has
where are absolute constants.
Proof.
The argument is similar to the subgaussian case. We can assume that by replacing with and with . Repeating the proof of the implication 2 4 of Lemma 5.5 and using this time, we obtain that . If then the right hand side is bounded by . This completes the proof. ∎
Subexponential random variables satisfy a large deviation inequality similar to the one for subgaussians (Proposition 5.10). The only significant difference is that two tails have to appear here – a gaussian tail responsible for the central limit theorem, and an exponential tail coming from the tails of each term.
Proposition 5.16 (Bernsteintype inequality).
Let be independent centered subexponential random variables, and . Then for every and every , we have
where is an absolute constant.
Proof.
Without loss of generality, we assume that by replacing with and with . We use the exponential Markov inequality for the sum and with a parameter :
If then for all , so Lemma 5.15 yields
Choosing , we obtain that
Repeating this argument for instead of , we obtain the same bound for . A combination of these two bounds completes the proof. ∎
Corollary 5.17.
Let be independent centered subexponential random variables, and let . Then, for every , we have
where is an absolute constant.
Proof.
This follows from Proposition 5.16 for and . ∎
Remark 5.18 (Centering).
The definitions of subgaussian and subexponential random variables do not require them to be centered. In any case, one can always center using the simple fact that if is subgaussian (or subexponential), then so is . Moreover,
This follows by triangle inequality along with , and similarly for the subexponential norm.
5.2.5 Isotropic random vectors
Now we carry our work over to higher dimensions. We will thus be working with random vectors in , or equivalently probability distributions in .
While the concept of the mean of a random variable remains the same in higher dimensions, the second moment is replaced by the second moment matrix of a random vector , defined as
where denotes the outer product of vectors in . Similarly, the concept of variance of a random variable is replaced in higher dimensions with the covariance matrix of a random vector , defined as
where . By translation, many questions can be reduced to the case of centered random vectors, for which and . We will also need a higherdimensional version of unit variance:
Definition 5.19 (Isotropic random vectors).
A random vector in is called isotropic if . Equivalently, is isotropic if
(5.17) 
Suppose is an invertible matrix, which means that the distribution of is not essentially supported on any proper subspace of . Then is an isotropic random vector in . Thus every nondegenerate random vector can be made isotropic by an appropriate linear transformation.^{13}^{13}13This transformation (usually preceded by centering) is a higherdimensional version of standardizing of random variables, which enforces zero mean and unit variance. This allows us to mostly focus on studying isotropic random vectors in the future.
Lemma 5.20.
Let be independent isotropic random vectors in . Then and .
Proof.
The first part follows from . The second part follows by conditioning on , using isotropy of and using the first part for : this way we obtain . ∎
Example 5.21.

(Gaussian): The (standard) Gaussian random vector in chosen according to the standard normal distribution is isotropic. The coordinates of are independent standard normal random variables.

(Bernoulli): A similar example of a discrete isotropic distribution is given by a Bernoulli random vector in whose coordinates are independent symmetric Bernoulli random variables.

(Product distributions): More generally, consider a random vector in whose coordinates are independent random variables with zero mean and unit variance. Then clearly is an isotropic vector in .

(Coordinate): Consider a coordinate random vector , which is uniformly distributed in the set where is the canonical basis of . Clearly is an isotropic random vector in .^{14}^{14}14The examples of Gaussian and coordinate random vectors are somewhat opposite – one is very continuous and the other is very discrete. They may be used as test cases in our study of random matrices.

(Frame): This is a more general version of the coordinate random vector. A frame is a set of vectors in which obeys an approximate Parseval’s identity, i.e. there exist numbers called frame bounds such that
If the set is called a tight frame. Thus, tight frames are generalizations of orthogonal bases without linear independence. Given a tight frame with bounds , the random vector uniformly distributed in the set is clearly isotropic in .^{15}^{15}15There is clearly a reverse implication, too, which shows that the class of tight frames can be identified with the class of discrete isotropic random vectors.

(Spherical): Consider a random vector uniformly distributed on the unit Euclidean sphere in with center at the origin and radius . Then is isotropic. Indeed, by rotation invariance is proportional to ; the correct normalization is derived from Lemma 5.20.

(Uniform on a convex set): In convex geometry, a convex set in is called isotropic if a random vector chosen uniformly from according to the volume is isotropic. As we noted, every full dimensional convex set can be made into an isotropic one by an affine transformation. Isotropic convex sets look “well conditioned”, which is advantageous in geometric algorithms (e.g. volume computations).
We generalize the concepts of subgaussian random variables to higher dimensions using onedimensional marginals.
Definition 5.22 (Subgaussian random vectors).
We say that a random vector in is subgaussian if the onedimensional marginals are subgaussian random variables for all . The subgaussian norm of is defined as
Remark 5.23 (Properties of highdimensional distributions).
The definitions of isotropic and subgaussian distributions suggest that more generally, natural properties of highdimensional distributions may be defined via onedimensional marginals. This is a natural way to generalize properties of random variables to random vectors. For example, we shall call a random vector subexponential if all of its onedimensional marginals are subexponential random variables, etc.
One simple way to create subgaussian distributions in is by taking a product of subgaussian distributions on the line:
Lemma 5.24 (Product of subgaussian distributions).
Let be independent centered subgaussian random variables. Then is a centered subgaussian random vector in , and
where is an absolute constant.
Proof.
This is a direct consequence of the rotation invariance principle, Lemma 5.9. Indeed, for every we have
where we used that . This completes the proof. ∎
Example 5.25.
Let us analyze the basic examples of random vectors introduced earlier in Example 5.21.

(Gaussian, Bernoulli): Gaussian and Bernoulli random vectors are subgaussian; their subgaussian norms are bounded by an absolute constant. These are particular cases of Lemma 5.24.

(Spherical): A spherical random vector is also subgaussian; its subgaussian norm is bounded by an absolute constant. Unfortunately, this does not follow from Lemma 5.24 because the coordinates of the spherical vector are not independent. Instead, by rotation invariance, the claim clearly follows from the following geometric fact. For every , the spherical cap makes up at most proportion of the total area on the sphere.^{16}^{16}16This fact about spherical caps may seem counterintuitive. For example, for the cap looks similar to a hemisphere, but the proportion of its area goes to zero very fast as dimension increases. This is a starting point of the study of the concentration of measure phenomenon, see [43]. This can be proved directly by integration, and also by elementary geometric considerations [9, Lemma 2.2].

(Coordinate): Although the coordinate random vector is formally subgaussian as its support is finite, its subgaussian norm is too big: . So we would not think of as a subgaussian random vector.

(Uniform on a convex set): For many isotropic convex sets (called bodies), a random vector uniformly distributed in is subgaussian with . For example, the cube is a body by Lemma 5.24, while the appropriately normalized crosspolytope is not. Nevertheless, Borell’s lemma (which is a consequence of BrunnMinkowski inequality) implies a weaker property, that is always subexponential, and is bounded by absolute constant. See [33, Section 2.2.b] for a proof and discussion of these ideas.
5.2.6 Sums of independent random matrices
In this section, we mention without proof some results of classical probability theory in which scalars can be replaced by matrices. Such results are useful in particular for problems on random matrices, since we can view a random matrix as a generalization of a random variable. One such remarkable generalization is valid for Khintchine inequality, Corollary 5.12. The scalars can be replaced by matrices, and the absolute value by the Schatten norm. Recall that for , the Schatten norm of an matrix is defined as the norm of the sequence of its singular values:
For , the Schatten norm equals the spectral norm . Using this one can quickly check that already for the Schatten and spectral norms are equivalent: .
Theorem 5.26 (Noncommutative Khintchine inequality, see [61] Section 9.8).
Let be selfadjoint matrices and be independent symmetric Bernoulli random variables. Then, for every , we have
where is an absolute constant.
Remark 5.27.

The scalar case of this result, for , recovers the classical Khintchine inequality, Corollary 5.12, for .

By the equivalence of Schatten and spectral norms for , a version of noncommutative Khintchine inequality holds for the spectral norm:
(5.18) where is an absolute constant. The logarithmic factor is unfortunately essential; it role will be clear when we discuss applications of this result to random matrices in the next sections.
Corollary 5.28 (Rudelson’s inequality [65]).
Let be vectors in and be independent symmetric Bernoulli random variables. Then