The lower tail of random quadratic forms, with applications to ordinary least squares and restricted eigenvalue properties
Finite sample properties of random covariance-type matrices have been the subject of much research. In this paper we focus on the “lower tail”’ of such a matrix, and prove that it is subgaussian under a simple fourth moment assumption on the one-dimensional marginals of the random vectors. A similar result holds for more general sums of random positive semidefinite matrices, and the (relatively simple) proof uses a variant of the so-called PAC-Bayesian method for bounding empirical processes.
We give two applications of the main result. In the first one we obtain a new finite-sample bound for ordinary least squares estimator in linear regression with random design. Our result is model-free, requires fairly weak moment assumptions and is almost optimal. Our second application is to bounding restricted eigenvalue constants of certain random ensembles with “heavy tails”. These constants are important in the analysis of problems in Compressed Sensing and High Dimensional Statistics, where one recovers a sparse vector from a small umber of linear measurements. Our result implies that heavy tails still allow for the fast recovery rates found in efficient methods such as the LASSO and the Dantzig selector. Along the way we strengthen, with a fairly short argument, a recent result of Rudelson and Zhou on the restricted eigenvalue property.
Let be i.i.d. random (column) vectors in with finite second moments. This paper contributes to the problem of obtaining finite-sample concentration bounds for the random covariance-type operator
with mean . This problem has received a great deal of attention recently, and has important applications to the estimation of covariance matrices [SrivastavaVershynin2013, MendelsonPaouris2014], to the analysis of methods for least squares problems [HsuKakadeZhang2012] and to compressed sensing and high dimensional, small sample size statistics [AdamczakRIP2011, RaskuttiEtAl2010, RudelsonZhou2013].
The most basic problem is computing how many samples are needed to bring close to . One needs at least to bring close to , so that the ranks of the two matrices can match. A basic problem is to find conditons under which samples are enough for guaranteeing
where depends only on and on moment assumptions on the ’s.
A well known bound by Rudelson [Rudelson1999, Oliveira2010] implies samples are necessary and sufficient if the vectors have uniformly bounded norms. Removing the factor is relatively easy for subgaussian vectors , but even the seemingly nice case of logconcave random vectors (which have subexponential moments) had to wait for the breakthrough papers by Adamczak et al [AdamczakEtAl2010, AdamczakEtAl2011]. The current best results hold when the and all of their projections have moments [SrivastavaVershynin2013], and when their one-dimensional marginals have moments [MendelsonPaouris2014]; in the latter case one also needs (necessarily) a high probability bound on . None of those finite-moment results gives strong concentration bounds.
It turns out that, for many important applications, only the lower tail of matters. That is, we only need that is not much smaller than for all vectors in a suitable set. Our main result in this paper is that this lower tail is subgaussian under extremely weak conditions. More precisely, we will prove that if there exists a such that
then samples are enough to guarantee an asymmetric version of (2), to wit:
This follows from a more precise result – Theorem 3.1 in Section 3 below – about the more general case of sums of independent and identically distributed positive semidefinite random matrices. We note that the dependence on in our bound is optimal for vectors with independent coordinates, as can be shown via the Bai-Yin theorem [BaiYin1993].
We will give two applications to illustrate our main result. One is to least squares linear regression with random design, which we discuss in Section 4. In this problem one is given data in the form of i.i.d. copies of a random pair , and the goal is to find some such that is as good a approximation to as possible. The most basic method for this problem is the ordinary least squares estimator, and recent finite-sample bounds by Hsu et al. [HsuKakadeZhang2012] and Audibert and Catoni [AudibertCatoni2011] have shown that the error of the ordinary least squares method is , where measures the intensity of the noise. Both results hold in a model-free setting, where the data generating mechanism is not assumed to correspond to a linear model, but their assumptions are stringent in that they involve infinitely many moments and/or . We prove here a result – Theorem 4.1 below – that gives improved bounds under weaker assumptions. In particular, it seems to be the first bound of this form that only assumes finitely many moments of and .
The second application, discussed in Section 5, deals with so-called restricted eigenvalue constants. These values quantify how acts on vectors which are constrained to have a positive fraction of their norm on a set of coordinates. Restricted eigenvalues are used in the analysis of Compressed Sensing and High Dimensional Statistics problems, where one wants to estimate a vector from a number of linear measurements . Estimators such as the LASSO and the Dantzig selector [Tibshirani1996, CandesTao2007] have been analyzed under the condition that is sparse (with nonzero coordinates) and the linear measurement vectors have positive restricted eigenvalues [BickelRT2009, BuhlmannVanDerGeer2008]. It is thus natural to enquire whether random ensembles satisfy this property [RaskuttiEtAl2010, RudelsonZhou2013]. Theorem 5.2 shows that this property may be expected even when the measurement vectors have relatively heavy tails, as long as the sparsity parameter satisfies and one “normalizes” the matrix (which is in fact quite natural). In particular, we sketch in Section 5.3 what this result implies for random design linear regression when .
Let us briefly comment on some proof ideas we think might be useful elsewhere. Theorem 3.1, our main result, is proven via so called PAC Bayesian methods and is inspired by the recent paper by Audibert and Catoni [AudibertCatoni2011]. We will see that this method allows one to translate properties of moment generating functions of individual random variables into uniform control of certain empirical processes. This is discussed in more detail in Section 3.2.
Later on, when we move to the problem of restricted eigenvalues, we will see that we need to control uniformly over vectors satisfying certain norm constraits. We will prove a “transfer principle” (Lemma 5.1 below ) that implies that this control can be deduced from a (logically) weaker control of over sparse vectors. In spite of its very short proof, this result is stronger than a similar theorem in a recent paper by Rudelson and Zhou [RudelsonZhou2013]; this connection is discussed in Appendix A.
Organization: The next section covers some preliminaries and defines the notation we use. Section 3 contains the statement and proof of the main result, Theorem 3.1, along with a discussion of the assumptions and a proof overview. Section 4 presents our result on ordinary least squares, giving some background for the problem. Section 5 follows a similar format for restricted eigenvalues. The final section presents some remarks and open problems. Two Appendices contain a discussion of our improvement over [RudelsonZhou2013], and some estimates used in the main text.
2 Notation and preliminaries
The coordinates of a vector are denoted by . The support of is the set:
The restriction of to a subset is the vector with for and for .
The norm of , denoted by , is simply the cardinality of . For , the norm is defined as:
is the space of matrices with rows, columns and real entries. Given , we denote by its transpose. is symmetric if . Given we let denote the trace of and denote its largest eigenvalue. The identity matrix is denoted by . We identify with the space of column vectors , so that the standard Euclidean inner product of is .
We say that is positive semidefinite if it is symmetric and for all . In this case one can easily show that
The norm of is
For symmetric this is the largest absolute value of its eigenvalues. Moreover, if is positive semidefinite . If is symmetric and invertible, we also have
We use asymptotic notation somewhat informally, in order to illustrate our results with clean statements. We write or to indicate that is very small, and to say that is bounded by a universal constant.
Finally, we state for later use the Burkholder-Davis-Gundy inequality. Let denote a martingale with finite -th moments () and . Then:
Note that the first inequality above is the BDG inequality with optimal constant, and the second inequality follows from Minkowski’s inequality for the norm. We also observe that (7) implies a result for which are i.i.d. random variables:
Better inequalities are known in this case, but we will use (8) for simplicity.
3 The subgaussian lower tail
The goal of this section is to discuss and prove our main result.
Theorem 3.1 (Proven in Section 3.3)
Assume are i.i.d. random positive semidefinite matrices whose coordinates have bounded second moments. Define (this is an entrywise expectation) and
Let be such that for all . Then for any :
Notice that a particular case of this Theorem is when where are i.i.d.. Therefore Theorem 3.1 corresponds to our discussion in the Introduction. In what follows we discuss what our assumption (3) entails and when it is verified. We then discuss the main ideas in the proof, and finally move to the proof itself.
3.1 On the assumption
Let us recall that in the vector case the main assumption we need is that
for some , where and . Note that an inequality in the opposite direction always holds, thanks to Jensen’s inequality:
An obvious case where (9) holds is when are independent, have finite fourth moments and mean . A short calculation shows that we may take
Significantly, the same calculations also work when are four-wise independent; this will be interesting when considering compressed sensing-type applications (cf. Example 1 below). Changing to allows us to consider translations and linear transformations of .
These particular cases include many important examples, such as gaussian, subgaussian, logconcave vectors and their affine transformations. There are also many examples with unbounded moments. If we multiply by an independent scalar with
we just need to replace with . Interestingly, the upper tail of is quite sensitive to this kind of transformation. Even multiplying by a Gaussian random variable may result in an ensemble that does not obey the analogue of the main theorem (cf. the discussion in [SrivastavaVershynin2013, Section 1.8]).
3.2 Proof overview and a preliminary PAC Bayesian result
At first sight it may seem odd that we can obtain such strong concentration from finite moment assumptions. The key point here is that, for any , the expression
is a sum of random variables which are independent, identically distributed and non negative. Such sums are well known to have subgaussian lower tails under weak assumptions; see eg. Lemma B.2 below.
This fact may be used to show concentration of for any fixed . It is less obvious how to turn this into a uniform bound. The standard techniques for this, such as chaining, involve looking at a discretized subset of and moving from this finite set to the whole space. In our case this second step is problematic, because it requires upper bounds on , and we know that our assumptions are not strong enough to obtain this.
What we use instead is the so-called PAC Bayesian method [CatoniBook] for controlling empirical processes. At a very high level this method replaces chaining and union bounds with arguments based on the relative entropy. What this means in our case is that a “smoothened-out” version of the process (), where is averaged over a Gaussian measure, automatically enjoys very strong concentration properties. This implies that the original process is also well behaved as long as the effect of the smoothing can be shown to be negligible. Many of our ideas come from Audibert and Catoni [AudibertCatoni2011], who in turn credit Langford and Shawe-Taylor [LangfordShaweT2002] for the idea of Gaussian smoothing.
To make these ideas more definite we present a technical result that encapsulates the main ideas in our PAC Bayesian approach. This requires some conditions.
is a family of random variables defined on a common probability space . We assume that the map
is continuous for each . Given and a positive semidefinite , we let denote the Gaussian probability measure over with mean and covariance matrix . We will also assume that for all the integrals
are well defined and depend continuously on . We will use the notation to denote the integral of (which may also depend on other parameteres) over the variable with the measure .
Proposition 3.1 (PAC Bayesian Proposition)
Assume the above setup, and also that is invertible and for all . Then for any ,
In the next subsection we will apply this to prove Theorem 3.1. Here is a brief overview: we will performe a change of cordinates under which . We will then define as
where will be chosen in terms of and the “other terms” will ensure that . Taking will result in
is a new term introduced by the“smoothing operator” . The choice will ensure that this term is small, and the “other terms” will also turn out to be manageable. The actual proof will be slightly complicated by the fact that we need to truncate the operator to ensure that is highly concentrated.
Proof: [of Proposition 3.1] As a preliminary step, we note that under our assumptions the map:
is measurable. This implies that the event in the statement of the proposition is indeed a measurable set.
To continue, recall the definition of Kullback Leiber divergence (or relative entropy) for probability measures over a measurable space :
A well-known variational principle [LedouxConcentration, eqn. (5.13)] implies that for any measurable function :
We apply this when , , and . In this case it is well-known that the relative entropy of the two measures is . This implies:
To finish, we prove that:
But this follows from Markov’s inequality and Fubini’s Theorem:
because for any fixed .
3.3 Proof of the main result
Proof: [of Theorem 3.1] We will assume throughout the proof that is invertible. If that is not the case, we can restrict ourselves to the range of , which is isometric to for some , noting that and almost surely for any that is orthogonal to the range (this follows from for orthogonal to the range, combined with (5) above).
Granted invertibility, we may define:
and note that are i.i.d. positive semidefinite with . Moreover,
The goal of our proof is to show that, for any :
Replacing with above and using homogeneity reduces this goal to showing:
This is what we will show in the remainder of the proof.
Fix some and define (with hindsight) truncated operators
with the convention that this is simply if . We collect some estimates for later use.
Lemma 3.1 (Proven subsequently)
We have for all with
Fix . We will apply Proposition 3.1 with and
The continuity and integrability assumptions of the Proposition are trivial to check. The assumption follows from independence, which implies:
plus the fact that, for any non-negative, square-integrable random variable ,
which is the same as saying that, with probability , the following inequality holds for all with :
Let us now compute all the integrals with respect to that appear above, for with :
|(use Lemma 3.1)||(18)|
|(use Lemma 3.1)||(19)|
We also need estimates for . Standard calculations with the normal distribution show that:
The first two terms inside the brackets are non-negative and, by Cauchy Schwartz, the absolute value of the rightmost term is at most the sum of the other two. We deduce:
Taking expectations, applying Lemma 3.1 and recalling gives:
This holds for any choice of . Optimizing over this parameter shows that, with probability , we have the following inequality simultaneously for all with .
We now take care of the term between curly brackets in the RHS. This is precisely the moment when the truncation of is useful, as it allows for the use of Bennett’s concentration inequality. More specifically, note that the term under consideration is a sum of iid random variables that lie between and . Moreover, the variance of each term is at most by Lemma 3.1. We may use Bennett’s inequality to deduce that with probability :
Combining this with (21) implies that, for any , the following inequality holds with probability , simultaneously for all with :
This holds for any . Optimizing over gives:
To finish, we prove the second assertion. Fix some with norm one. We have
by Cauchy Schwartz. Now note that
Moreover, by the previous estimate on ,
Combining the last three inequalities finishes the proof.
It is instructive to compare this proof with what one would obtain without truncation. In that case everything would go through except for the step where we apply Bennett’s inequality.
4 Ordinary least squares under random design
Linear regression with random design is a central problem in Statistics and Machine Learning. In it one is given data in the form of independent and identically distributed copies of a square-integrable pair , where is a vector of so-called covariates and is a response variable. The goal is to find a vector , which is a function of the data, which makes the square loss
as small as possible. In other words, one is trying to find a linear combination of the coordinates of that is as close as possible to in terms of mean-square error. The random design setting should be contrasted with the technically simpler case of fixed design, where the ’s are assumed fixed and all randomness is in the ’s. Results about this setting are not indicative about out-of-sample prediction, a crucial property in many tasks where least squares is routinely used, as well as in theoretical problems such as linear aggregration; see [AudibertCatoni2011] for further discussion.
The most basic method for minimizing from data – the so-called ordinary least squares (OLS) estimator – replaces the expectation in the definition of by an empirical average.
This estimator is not hard to study when is large, is much smaller than and a linear model is assumed:
Here we want to consider a completely model-free, non-parametric setting where no specific relationship between and is assumed. Moreover, we want to allow for large , with the only condition is that should be small. This rules out using classical asymptotic theory (which is not quantitative) as well as Barry-Esséen-type bounds (which do not work for ; see [Bentkus2003] for the best known bounds).
The theoretically optimal choice of that minimizes is simply a vector such that the coordinates of are -orthogonal to . This corresponds to the following generalization of (22).
Moreover, approximating the minimum loss corresponds to approximating itself in the following sense:
4.2 Our result, and previous work
Here is the precise statement we prove.
Theorem 4.1 (Proven in Section 4.3)
Theorem 4.1 implies
whenever and . This can be shown to be essentially sharp in the particular case of a linear model (22) with Gaussian noise, where OLS satisfies with positive probability in that case, since os simply the variance of the noise in this case.
The proof of Theorem 4.1 consists of three steps. One is to use an explicit expression for OLS in order to express . Theorem 3.1 is used to prove that a matrix that appears in the expression for this difference has bounded norm. The third step is to control the remaining expression, which is a sum of i.i.d. random vectors that we analyze via Lemma 4.1 below.
Given the widespread use of OLS, it seems surprising that all finite-sample results for it prior to 2011 were either considerably weaker (eg. did not bound directly) or required much stronger assumptions on the data generating mechanism; see [AudibertCatoni2009, Section 1] and [HsuKakadeZhang2012] for more details on previous results. In the last two years of Audibert and Catoni [AudibertCatoni2011] and Hsu et al. [HsuKakadeZhang2012] both proved results related to our own Theorem 4.1 (below). However, our result is less restrictive in important ways. Hsu et al assumed i.i.d. subgaussian noise and bounded covariate vectors; moreover, they also need the condition , whereas our Theorem works for (assuming bounded in both cases). The conditions of Audibert and Catoni are weaker but they assume uniformly for some constant . It transpires from this brief discussion that Theorem 4.1 seems to be the first finite sample bound of optimal order that only assumes finitely many moments for and .
Hsu et al [HsuKakadeZhang2012] also derive finite sample performance bounds for ridge regression, a regularized version of OLS with an extra term. Theorem 3.1 and Theorem 4.1 can be adapted to that setting. Audibert and Catoni [AudibertCatoni2011] also propose a “robust” least squares method based on a non-convex optimization problem, which we do not analyze here. It turns out, however, that this robust estimator depends on a quantity which is the same as our , so all computations in [AudibertCatoni2011, Section 3.2] are directly relevant to our setting.