Regularity Properties for Sparse RegressionFan’s research was partially supported by NIH grants R01GM100474-04 and NIH R01-GM072611-10 and NSF grants DMS-1206464 and DMS-1406266. The bulk of the research was carried out while Edgar Dobriban was an undergraduate student at Princeton University.

Regularity Properties for Sparse Regressionthanks: Fan’s research was partially supported by NIH grants R01GM100474-04 and NIH R01-GM072611-10 and NSF grants DMS-1206464 and DMS-1406266. The bulk of the research was carried out while Edgar Dobriban was an undergraduate student at Princeton University.

Edgar Dobriban E. Dobriban Department of Statistics, Stanford University
22email: dobriban@stanford.eduJ. Fan Department of Operations Research and Financial Engineering, Princeton University
44email: jqfan@princeton.edu
   Jianqing Fan E. Dobriban Department of Statistics, Stanford University
22email: dobriban@stanford.eduJ. Fan Department of Operations Research and Financial Engineering, Princeton University
44email: jqfan@princeton.edu
Received: date / Accepted: date
Abstract

Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown if these conditions can be checked efficiently on any given data set. This is problematic, because they are at the core of the theory of sparse regression.

Here we provide a rigorous proof that these conditions are NP-hard to check. This shows that the conditions are computationally infeasible to verify, and raises some questions about their practical applications.

However, by taking an average-case perspective instead of the worst-case view of NP-hardness, we show that a particular condition, sensitivity, has certain desirable properties. This condition is weaker and more general than the others. We show that it holds with high probability in models where the parent population is well behaved, and that it is robust to certain data processing steps. These results are desirable, as they provide guidance about when the condition, and more generally the theory of sparse regression, may be relevant in the analysis of high-dimensional correlated observational data.

Keywords:
high-dimensional statistics sparse regression restricted eigenvalue sensitivity computational complexity
Msc:
62J05 68Q17 62H12
journal: Communications in Mathematics and Statistics

1 Introduction

1.1 Prologue

Open up any recent paper on sparse linear regression – the model , where is an matrix of features, , and most coordinates of are zero – and you are likely to find that the main result is of the form: “If the data matrix has the restricted eigenvalue/compatibility/ sensitivity property, then our method will successfully estimate the unknown sparse parameter , if the sample size is at least …”

In addition to the sparsity of the parameter, the key condition here is the regularity of the matrix of features, such as restricted eigenvalue/ compatibility/ sensitivity. It states that every suitable submatrix of the feature matrix is “nearly orthogonal”. Such a property is crucial for the success of popular estimators like the Lasso and Dantzig selector. However, these conditions are somewhat poorly understood. For instance, as the conditions are combinatorial, it is not known how to check them efficiently – in polynomial time – on any given data matrix. Without this knowledge, it is difficult to see whether or not the whole framework is relevant to any particular data analysis setting.

In this paper we seek a better understanding of these problems. We first establish that the most popular conditions for sparse regression – restricted eigenvalue/compatibility/ sensitivity – are all NP-hard to check. This implies that there is likely no efficient way to verify them for deterministic matrices, and raises some questions about their practical applications.

Next, we move away from the worst-case analysis entailed by NP-hardness, and consider an average-case, non-adversarial analysis. We show that the weakest of these conditions, sensitivity, has some desirable properties, including that it holds with high probability in well-behaved random design models, and that it is preserved under certain data processing operations.

1.2 Formal introduction

We now turn to a more formal and thorough introduction. The context of this paper is that high-dimensional data analysis is becoming commonplace in statistics and machine learning. Recent research shows that estimation of high-dimensional parameters may be possible if they are suitably sparse. For instance, in linear regression where most of the regression coefficients are zero, popular estimators such as the Lasso (chen_atomic_2001, ; tibshirani_regression_1996, ), SCAD (fan_li01, ), and the Dantzig selector (candes_dantzig_2007, ) can have small estimation error – as long as the matrix of covariates is sufficiently “regular”.

There is a large number of suitable regularity conditions, starting with the incoherence condition of Donoho and Huo donoho_uncertainty_2001 (), followed by more sophisticated properties such as Candes and Tao’s restricted isometry property (“RIP”) candes_decoding_2005 (), Bickel, Ritov and Tsybakov’s weaker and more general restricted eigenvalue (RE) condition bickel_simultaneous_2009 (), and Gautier and Tsybakov’s even more general sensitivity properties gautier_high-dimensional_2011 (), which also apply to instrumental variables regression.

While it is known that these properties lead to desirable guarantees on the performance of popular statistical methods, it is largely unknown whether they hold in practice. Even more, it is not known how to efficiently check if they hold for any given data set. Due to their combinatorial nature, it is thought that they may be computationally hard to verify (tao_open_2007, ; raskutti_restricted_2010, ; daspremont_testing_2011, ). The assumed difficulty of the computation has motivated convex relaxations for approximating the restricted isometry constant (daspremont_optimal_2008, ; lee_computing_2008, ) and sensitivity (gautier_high-dimensional_2011, ).

However, a rigorous proof is missing. A proof would be desirable for several reasons: (1) to show definitively that there is no computational “shortcut” to find their values, (2) to increase our understanding of why these conditions are difficult to check, and therefore (3) to guide the development of the future theory of sparse regression, based instead on efficiently verifiable conditions.

In this paper we provide such a proof. We show that checking any of the restricted eigenvalue, compatibility, and sensitivity properties for general data matrices is -hard (Theorem 3.1). This implies that there is no polynomial-time algorithm to verify them, under the widely believed assumption that . This raises some questions about the relevance of these conditions to practical data analysis.

We do not attempt to give a definitive answer here, and instead provide some positive results to enhance our understanding of these conditions. While the previous NP-hardness analysis referred to a worst-case scenario, we next take an average-case, non-adversarial perspective. Previous authors studied RIP, RE and compatibility from this perspective, as well as the relations between these conditions (van_de_geer_conditions_2009, ). We study sensitivity, for two reasons: First, it is more general than other regularity properties in terms of the correlation structures it can capture, and thus potentially applicable to more highly correlated data. Second, it applies not just to ordinary linear regression, but also to instrumental variables regression, which is relevant in applications such as economics.

Finding conditions under which sensitivity holds is valuable for several reasons: (1) since it is hard to check the condition computationally on any given data set, it is desirable to have some other way to ascertain it, even if that method is somewhat speculative, and (2) it helps us to compare the situations – and statistical models – where this condition is most suitable to the cases where the other conditions are applicable, and thus better understand its scope.

Hence, to increase our understanding of when sensitivity may be relevant, we perform a probabilistic – or “average case” – analysis, and consider a model where the data is randomly sampled from suitable distributions. In this case, we show that there is a natural “population” condition which is sufficient to ensure that sensitivity holds with high probability (Theorem 3.2). This complements the results for RIP (e.g., rauhut_compressed_2008, ; vershynin_introduction_2010, ), and RE (raskutti_restricted_2010, ; rudelson_reconstruction_2012, ). Further, we define an explict k-comprehensive property (Definition 1) which implies sensitivity (Theorem 3.3). Such a condition is of interest because there are very few explicit examples where one can ascertain that sensitivity holds.

Finally, we show that the sensitivity property is preserved under several data processing steps that may be used in practice (Proposition 1). This shows that, while it is initially hard to ascertain this property, it may be somewhat robust to downstream data processing.

We introduce the problem in Section 2. Then, in Section 3 we present our results, with a discussion in Section 4, and provide the proofs in Section 5.

2 Setup

We introduce the problems and properties studied, followed by some notions from computational complexity.

2.1 Regression problems and estimators

Consider the linear model , where is an response vector, is an matrix of covariates, is a vector of coefficients, and is an noise vector of independent entries. The observables are and , where may be deterministic or random, and we want to estimate the fixed unknown . Below we will briefly present the modeling and the estimation procedures that are required, while for the full details we refer to the original publications.

In the case when , it is common to assume sparsity, viz., most of the coordinates of are zero. We do not know the locations of nonzero coordinates. A popular estimator in this case is the Lasso (tibshirani_regression_1996, ; chen_atomic_2001, ), which for a given regularization parameter solves the optimization problem:

The Dantzig selector is another estimator for this problem, which for a known noise level , and with a tuning parameter , takes the form (candes_dantzig_2007, ):

See fan14 () for a view from the sparest solution in high-confidence set and its generalizations.

In instrumental variables regression we start with the same linear model . Now some covariates may be correlated with the noise , in which case they are called endogenous. Further, we have additional variables , , called instruments, that are uncorrelated with the noise. In addition to , we observe independent samples of , which are arranged in the matrix . In this setting, gautier_high-dimensional_2011 () propose the Self-Tuning Instrumental Variables (STIV) estimator, a generalization of the Dantzig selector, which solves the optimization problem:

(1)

with the minimum over the polytope , Here and are diagonal matrices with , , , and is a constant whose choice is described in gautier_high-dimensional_2011 (). When is exogenous, we can take , which reduces to Dantzig type of selector.

2.2 Regularity properties

The performance of the above estimators is characterized under certain “regularity properties”. These depend on the union of cones – called “the cone” for brevity – which is the set of vectors such that the norm is concentrated on some coordinates:

where is the subvector of with the entries from the subset .

The properties discussed here depend on a triplet of parameters , where is the sparsity size of the problem, is the cone opening parameter in , and is the lower bound. First, the Restricted Eigenvalue condition from bickel_simultaneous_2009 (); koltchinskii_dantzig_2009 () holds for a fixed matrix if

We emphasize that this property, and the ones below, are defined for arbitrary deterministic matrices – but later we will consider them for randomly sampled data. bickel_simultaneous_2009 () shows that if the normalized data matrix obeys and is -sparse, then the estimation error is small in the sense that and , for both the Dantzig and Lasso selectors. See fan14 () for more general results and simpler arguments. The “cone opening” required in the restricted eigenvalue property equals 1 for the Dantzig selector, and 3 for the Lasso.

Next, the determinstic matrix obeys the compatibility condition with positive parameters (van_de_geer_deterministic_2007, ), if

The two conditions are very similar. The only difference is the change from to norm in the denominator. The inequality shows that the compatibility conditions are – formally at least – weaker than the RE assumptions. van_de_geer_deterministic_2007 () provides an oracle inequality for the Lasso under the compatibility condition, see also van_de_geer_conditions_2009 (); buhlmann_statistics_2011 ().

Finally, for , the deterministic matrices of size and of size satisfy the sensitivity property with parameters , if

If , the definition is similar to the cone invertibility factors ye_rate_2010 (). gautier_high-dimensional_2011 () shows that sensitivity is weaker than the RE and compatibility conditions, meaning that in the special case when , the RE property of implies the sensitivity of . We note that the definition in gautier_high-dimensional_2011 () differs in normalization, but that is not essential. The details are that we have an additional factor (this is to ensure direct comparability to the other conditions), and we do not normalize by the diagonal matrices for simplicity (to avoid the dependencies introduced by this process). One can easily show that the un-normalized condition is sufficient for the good performance of an un-normalized version of the STIV estimator.

Finally, we introduce incoherence and the restricted isometry property, which are not analyzed in this paper, but are instead used for illustration purposes. For a deterministic matrix whose columns are normalized to length , the mutual incoherence condition holds if for some positive . Such a notion was defined in donoho_uncertainty_2001 (), and later used by bunea_sparsity_2007 () to derive oracle inequalities for the Lasso.

A deterministic matrix obeys the restricted isometry property with parameters and if for all -sparse vectors (candes_decoding_2005, ).

2.3 Notions from computational complexity

To state formally that the regularity conditions are hard to verify, we need some basic notions from computational complexity theory. Here problems are classified according to the computational resources – such as time and memory – needed to solve them (arora_computational_2009, ). A well-known complexity class is , consisting of the problems decidable in polynomial time in the size of the input. For input encoded in bits, a yes or no answer must be found in time for some fixed . A larger class is , the decision problems for which already existing solutions can be verified in polynomial time. This is usually much easier than solving the question itself in polynomial time. For instance, the subset-sum problem: “Given an input set of integers, does there exist a subset with zero sum?” is in , since one can easily check a candidate solution – i.e., a subset of the given integers – to see if it indeed sums to zero. However, finding this subset seems harder, as simply enumerating all subsets is not a polynomial-time algorithm.

Formally, the definition of requires that if the answer is yes, then there exists an easily verifiable proof. We have , since a polynomial-time solution is a certificate verifiable in polynomial time. However, it is a famous open problem to decide if equals (cook_p_2000, ). It is widely believed in the complexity community that .

To compare the computational hardness of various problems, one can reduce known hard problems to the novel questions of interest, thereby demonstrating the difficulty of the novel problems. Specifically, a problem is polynomial-time reducible to a problem , if an oracle solving – that is, an immediate solver for an instance of – can be queried once to give a polynomial-time algorithm to solve . This is also known as a polynomial-time many-one reduction, strong reduction, or Karp reduction. A problem is -hard if every problem in reduces to it, namely it is at least as difficult as all other problems in . If one reduces a known -hard problem to a new question, this demonstrates the -hardness of the new problem.

If indeed , then there are no polynomial time algorithms for -hard problems, implying that these are indeed computationally difficult.

3 Results

3.1 Computational Complexity

We now show that the common conditions needed for successful sparse estimation are unfortunately -hard to verify. These conditions appear prominently in the theory of high-dimensional statistics, large-scale machine learning, and compressed sensing. In compressed sensing, one can often choose, or “engineer”, the matrix of covariates such that it is as regular as possible – choosing for instance a matrix with iid Gaussian entries. It is well known that the restricted isometry property and its cousins will then hold with high probability.

In contrast, in statistics and machine learning, the data matrix is often observational – or “given to us” – in the application. In this case, it is not known a priori whether the matrix is regular, and one may be tempted to try and verify it. Unfortunately, our results show that this is hard. This distinction between compressed sensing and statistical data analysis was the main motivation for us to write this paper, after the computational difficulty of verifying the restricted isometry property has been established in the information theory literature (bandeira_certifying_2013, ). We think that researchers in high-dimensional statistics will benefit from the broader view which shows that not just RIP, but also RE, sensitivity, etc., are hard to check. Formally:

Theorem 3.1

Let be an matrix, an matrix, , and . It is -hard to decide any of the following problems:

  1. Does obey the restricted eigenvalue condition with parameters ?

  2. Does satisfy the compatibility conditions with parameters ?

  3. Does have the sensitivity property with parameters ?

The proof of Theorem 3.1 is relegated to Section 5.1, and builds on the recent results that computing the spark and checking restricted isometry are -hard (bandeira_certifying_2013, ; tillmann_computational_2012, ).

3.2 sensitivity for correlated designs

Since it is hard to check the properties in the worst case on a generic data matrix, it may be interesting to know that they hold at least under certain conditions. To understand when this may occur, we consider probabilistic models for the data, which amounts to an average case analysis. This type of analysis is common in statistics. To this end, we first need to define a “population” version of sensitivity that refers to the parent population from which the data is sampled. Let and be - and -dimensional zero-mean random vectors and denote by the matrix of covariances with . We say that satisfies the sensitivity property with parameters if . One sees that we simply replaced from the original definition with its expectation, .

It is then expected that for sufficiently large samples, random matrices with rows sampled independently from a population with the sensitivity property will inherit this condition. However, it is non-trivial to understand the required sample size, and its dependence on the moments of the random quantities. To state precisely the required probabilistic assumptions, we recall that the sub-gaussian norm of a random variable is defined as (see e.g., vershynin_introduction_2010, ). The sub-gaussian norm (or sub-gaussian constant) of a -dimensional random vector is then defined as .

Our result establishes sufficient conditions for sensitivity to hold for random matrices, under three broad conditions including sub-gaussianity:

Theorem 3.2

Let and be zero-mean random vectors, such that the matrix of population covariances satisfies the sensitivity property with parameters . Given iid samples and any , the matrix has the sensitivity property with parameters , with high probability, in each of the following settings:

  1. If and are sub-gaussian with fixed constants, then sample sensitivity holds with probability at least , provided that the sample size is at least .

  2. If the entries of the vectors are bounded by fixed constants, the same statement holds.

  3. If the entries have bounded moments: , for some positive integer and all , , then the sensitivity property holds with probability at least , assuming the sample size is at least .

The constant does not depend on and , and it is given in the proofs in Section 5.2.

The general statement of the theorem is applicable to the specific case where . Related results have been obtained for the RIP (rauhut_compressed_2008, ; rudelson_reconstruction_2012, ) and RE conditions (raskutti_restricted_2010, ; rudelson_reconstruction_2012, ). Our results complement theirs for a weaker notion of sensitivity property.

Next, we aim to achieve a better understanding of the population sensitivity property by giving some explicit sufficient conditions where it holds. Modeling covariance matrices in high dimensions is challenging, as there are few known explicit models. For instance, the examples given in raskutti_restricted_2010 () to illustrate RE are quite limited, and include only diagonal, diagonal plus rank one, and ARMA covariance matrices. Therefore we think that the explicit conditions below are of interest, even if they are somewhat abstract.

We start from the case when , in which case is the covariance matrix of . In particular, if equals the identity matrix or nearly the identity, then is -sensitive. Inspired by this diagonal case, we introduce a more general condition.

Definition 1

The matrix is called s-comprehensive if for any subset of size , and for each pattern of signs , there exists either a row of such that for , and otherwise, or a row with for , and otherwise.

In particular, when , diagonal matrices with nonzero diagonal entries are 1-comprehensive. More generally, when , we have by simple counting the inequality , which shows that the number of instruments must be large for the -comprehensive property to be applicable. In problems where there are many potential instruments, this may be reasonable. To go back to our main point, we show that an -comprehensive covariance matrix is -sensitive.

Theorem 3.3

Suppose the matrix of covariances is s-comprehensive, and that all nonzero entries in have absolute value at least . Then obeys the sensitivity property with parameters and .

The proof of Theorem 3.3 is found in Section 5.3. The theorem presents a trade-off between the number of instruments and their strength, by showing that with a large subset size – and thus – a smaller minimum strength is required to achieve the same sensitivity lower bound .

Finally, to improve our understanding of the relationship between the various conditions, we now give several examples. They show that sensitivity is more general than the rest. The proofs of the following claims can be found in Section 5.4.

Example 1

If is a diagonal matrix with entries , then the restricted isometry property holds if for all . Restricted eigenvalue only requires ; the same is required for compatibility. This example shows why restricted isometry is the most stringent requirement. Further, sensitivity holds even if a finite number of go to zero at rate . In this case, all other regularity conditions fail. This is an example where regularity holds under broader conditions than the others.

The next examples further delineate between the various properties.

Example 2

For the equal correlations model , with , restricted isometry requires . In contrast, restricted eigenvalue, compatibility, and sensitivity hold for any , and the resulting lower bound is (see van_de_geer_conditions_2009 (); raskutti_restricted_2010 ()).

Example 3

If has diagonal entries equal to 1, , and all other entries are equal to zero, then compatibility and sensitivity hold as long as (Section 5.4). In such a case, however, the restricted eigenvalues are of order . This is an example where compatibility and sensitivity hold but the restricted eigenvalue condition fails.

3.3 Operations preserving regularity

In data analysis, one often processes data by normalization or feature merging. Normalization is performed to bring variables to the same scale. Features are merged via sparse linear combinations to reduce dimension and avoid multicollinearity. Our final result shows that sensitivity is preserved under the above operations, and even more general ones. This may be of interest in cases where downstream data processing is performed after an initial step where the regularity conditions are ascertained. Let and be as above. First, note that the sensitivity only depends on the inner products , therefore it is preserved under simultaneous orthogonal transformations on each covariate , for any orthogonal matrix . The next result defines broader classes of transformations that preserve sensitivity. Admittedly the transformations we consider are abstract, but they include some concrete examples, and represent a simple first step to understanding what kind of data processing steps are “admissible” and do not destroy regularity. Furthermore, the result is very elementary, but the goal here is not technical sophistication, but rather increasing our understanding of the behavior of an important property. The precise statement is:

Proposition 1
  1. Let be a cone-preserving linear transformation , such that for all we have and let . Suppose further that for all in . If has the sensitivity property with parameters , then has sensitivity with parameters .

  2. Let be a linear transformation such that for all , . If we transform , and has the sensitivity property with lower bound , then has the same property with lower bound .

One can check that normalization and feature merging on the matrix are special cases of the first class of “cone-preserving” transformations. For normalization, is the diagonal matrix of inverses of the lengths of ’s columns. Similarly, normalization on the matrix is a special case of the second class of transformations. This shows that our definitions include some concrete commonly performed data processing steps.

4 Discussion

Our work raises further questions about the theoretical foundations of sparse linear models. What is a good condition to have at the core of the theory? The regularity properties discussed in this paper yield statistical performance guarantees for popular methods such as the Lasso and the Dantzig selector. However, they are not efficiently verifiable. In contrast, incoherence can be checked efficiently, but does not guarantee performance up to the optimal rate (buhlmann_statistics_2011, ). It may be of interest to investigate if there are intermediate conditions that achieve favorable trade-offs.

5 Proofs

5.1 Proof of Theorem 3.1

The spark of a matrix , denoted , is the smallest number of linearly dependent columns. Our proof is a polynomial-time reduction from the -hard problem of computing the spark of a matrix (see bandeira_certifying_2013 (); tillmann_computational_2012 () and references therein).

Lemma 1

Given an matrix with integer entries , and a sparsity size , it is -hard to decide if the spark of is at most .

We also need the following technical lemma, which provides bounds on the singular values of matrices with bounded integer entries. For a matrix , we denote by or its operator norm, and by the submatrix of formed by the columns with indices in .

Lemma 2

Let be an matrix with integer entries, and denote . Then, we have . Further, if for some , then for subset with , we have:

Proof

The first claim follows from:

For the second claim, let denote a submatrix of with an arbitrary index set of size . Then implies that is non-singular. Since the absolute values of the entries of lie in , the entries of are integers with absolute values between 0 and , namely . Moreover, since the non-negative and nonzero determinant of is integer, it must be at least 1. Hence,

Rearranging, we get

In the middle inequality we have used . This is the desired bound.

For the proof we need the notion of encoding length, which is the size in bits of an object. Thus, an integer has size bits. Hence the size of the matrix is at least : at least one bit for each entry, and bits to represent the largest entry. To ensure that the reduction is polynomial-time, we need that the size in bits of the objects involved is polynomial in the size of the input . As usual in computational complexity, the numbers here are rational (arora_computational_2009, ).

Proof of Theorem 3.1. It is enough to prove the result for the special case of with integer entries, since this statement is in fact stronger than the general case, which also includes rational entries. For each property and given sparsity size , we will exhibit parameters of polynomial size in bits, such that:

  1. does not obey the regularity property with parameters ,

  2. obeys the regularity property with parameters .

Hence, any polynomial-time algorithm for deciding if the regularity property holds for , can decide if with one call. Here it is crucial that are polynomial in the size of , so that the whole reduction is polynomial in . Since deciding is -hard by Theorem 1, this shows the desired -hardness of checking the conditions. Now we provide the required parameters for each regularity condition. Similar ideas are used when comparing the conditions.

For the restricted eigenvalue condition, the first claim follows any , and any . To see this, if the spark of at most , there is a nonzero -sparse vector in the kernel of , and , where is any set containing the nonzero coordinates. This is clearly also in the cone , and so does not obey RE with parameters .

For the second claim, note that if , then for each index set of size , the submatrix is non-singular. This implies a nonzero lower bound on the RE constant of . Indeed, consider a vector in the cone , and assume specifically that . Using the identity , we have

Further, since is in the cone, we have

(2)

Since is non-degenerate and integer-valued, we can use the bounds from Lemma 2. Consequently, with , we obtain

By choosing, say, , , we easily conclude after some computations that . Moreover, the size in bits of the parameters is polynomially related to that of . Indeed, the size in bits of both parameters is , and the size of is at least , as discussed before the proof. Note that . This proves the claim.

The argument for the compatibility conditions is identical, and therefore omitted.

Finally, for the sensitivity property, we in fact show that the subproblem where is -hard, thus the full problem is also clearly -hard. The first condition is again satisfied for all and . Indeed, if the spark of is at most , there is a nonzero -sparse vector in its kernel, and thus .

For the second condition, we note that For in the cone, and hence

Combination of the last two results gives

Finally, since , we have , and as is in the cone, , by inequality  (2). Therefore,

Hence we essentially reduced to restricted eigenvalues. From the proof of that case, the choice gives Hence for this we also have where we have applied a number of coarse bounds. Thus obeys the sensitivity property with the parameters and . As in the previous case, the size in bits of these parameters are polynomial in the size in bits of . This proves the correctness of the reduction for, and completes the proof.

5.2 Proof of Theorem 3.2

We first establish some large deviation inequalities for random inner products, then finish the proofs directly by a union bound. We discuss the three probabilistic settings one by one.

5.2.1 Sub-gaussian variables

Lemma 3 (Deviation of Inner Products for Sub-gaussians)

Let and be zero-mean sub-gaussian random variables, with sub-gaussian norms , respectively. Then, given iid samples of and , the sample covariance satisfies the tail bound:

where .

Proof

We use the Bernstein-type inequality in Corollary 5.17 from vershynin_introduction_2010 (). Recalling that the sub-exponential norm of a random vector is , we need to bound the sub-exponential norms of . We show that if are sub-gaussian, then has sub-exponential norm bounded by

(3)

Indeed by the Cauchy-Schwartz inequality , hence . Taking the supremum over leads to (3).

The are iid random variables, and their sub-exponential norm is bounded as . Further, by definition , hence the sub-exponential norm is at most . The main result then follows by a direct application of Bernstein’s inequality, see Corollary 5.17 from vershynin_introduction_2010 ().

With these preparations, we now prove Theorem 3.2 for the sub-gaussian case. By a union bound over the entries of the matrix

By Lemma 3 each probability is bounded by a term of the form , where varies with . The largest of these bounds corresponds to the largest of the -s. Hence the in the largest term is . By the definition of sub-gaussian norm, this is at most , where the and are now and -dimensional vectors, respectively. Therefore we have the uniform bound

(4)

with .

We choose such that = , that is = . Since we can assume by assumption, the relevant term is the one quadratic in : the total probability of error is . From now on, we will work on the high-probability event that .

For any vector , . With high probability it holds uniformly for all that:

(5)

for the constant = .

For vectors in , we bound the norm by the norm, , in the usual way, to get a term depending on rather than on all coordinates:

(6)

Introducing this into (5) gives with high probability over all :

If we choose such that then the second term will be at most . Further since obeys the sensitivity assumption, the first term will be at least . This shows that satisfies the the sensitivity assumption with constant with high probability, and finishes the proof. To summarize, it suffices if the sample size is at least

(7)

5.2.2 Bounded variables

If the components of the vectors are bounded, then essentially the same proof goes through. The sub-exponential norm of is bounded – by a different argument – because , hence . Hence Lemma 3 holds with the same proof, where now the value of is different. The rest of the proof only relies on Lemma 3, so it goes through unchanged.

5.2.3 Variables with bounded moments

For variates with bounded moments, we also need a large deviation inequality for inner products. The general flow of the argument is classical, and relies on the Markov inequality and a moment-of-sum computation (e.g., petrov_limit_1995 ()). The result is a generalization of a lemma used in covariance matrix estimation (ravikumar_high-dimensional_2011, ), and our proof is shorter.

Lemma 4 (Deviation for Bounded Moments - Khintchine-Rosenthal)

Let and be zero-mean random variables, and a positive integer, such that , . Given iid samples from and , the sample covariance satisfies the tail bound:

Proof

Let , and . By the Markov inequality, we have

We now bound the -th moment of the sum using a type of classical argument, often referred to as Khintchine’s or Rosenthal’s inequality. We can write, recalling that is even,

(8)

By independence of , we have . As , the summands for which there is a singleton vanish. For the remaining terms, we bound by Jensen’s inequality for . So each term is bounded by .

Hence, each nonzero term in (8) is uniformly bounded. We count the number of sequences of non-negative integers that sum to , and such that if some , then . Thus, there are at most nonzero elements. This shows that the number of such sequences is not more than the number of ways to choose places out of , multiplied by the number of ways to distribute elements among those places, which can be bounded by . Thus, we have proved that

We can make this even more explicit by the Minkowski and Jensen inequalities: . Combining this with leads to the desired bound

To prove Theorem 3.2, we note that by a union bound, the probability that is at most Since is fixed, for simplicity of notation, we can denote . Choosing , the above probability is at most .

The bound