ExSIS: Extended Sure Independence Screening for Ultrahigh-dimensional Linear Models

# ExSIS: Extended Sure Independence Screening for Ultrahigh-dimensional Linear Models

Talal Ahmed and Waheed U. Bajwa This work is supported in part by the National Science Foundation under awards CCF-1525276 and CCF-1453073, and by the Army Research Office under award W911NF-14-1-0295.The authors are with the Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, 94 Brett Road, Piscataway, NJ 08854, USA. (Emails: talal.ahmed@rutgers.edu and waheed.bajwa@rutgers.edu).
###### Abstract

Statistical inference can be computationally prohibitive in ultrahigh-dimensional linear models. Correlation-based variable screening, in which one leverages marginal correlations for removal of irrelevant variables from the model prior to statistical inference, can be used to overcome this challenge. Prior works on correlation-based variable screening either impose strong statistical priors on the linear model or assume specific post-screening inference methods. This paper first extends the analysis of correlation-based variable screening to arbitrary linear models and post-screening inference techniques. In particular, () it shows that a condition—termed the screening condition—is sufficient for successful correlation-based screening of linear models, and () it provides insights into the dependence of marginal correlation-based screening on different problem parameters. Numerical experiments confirm that these insights are not mere artifacts of analysis; rather, they are reflective of the challenges associated with marginal correlation-based variable screening. Second, the paper explicitly derives the screening condition for two families of linear models, namely, sub-Gaussian linear models and arbitrary (random or deterministic) linear models. In the process, it establishes that—under appropriate conditions—it is possible to reduce the dimension of an ultrahigh-dimensional, arbitrary linear model to almost the sample size even when the number of active variables scales almost linearly with the sample size.

## I Introduction

The ordinary linear model , despite its apparent simplicity, has been the bedrock of signal processing, statistics, and machine learning for decades. The last decade, however, has witnessed a marked transformation of this model: instead of the classical low-dimensional setting in which the dimension, , of (henceforth, referred to as the sample size) exceeds the dimension, , of (henceforth, referred to as the number of features/predictors/variables), we are increasingly having to operate in the high-dimensional setting in which the number of variables far exceeds the sample size (i.e., ). While the high-dimensional setting should ordinarily lead to ill-posed problems, the principle of parsimony—which states that only a small number of variables typically affect the response —helps obtain unique solutions to inference problems based on high-dimensional linear models.

Our focus in this paper is on ultrahigh-dimensional linear models, in which the number of variables can scale exponentially with the sample size: for .111Recall Landau’s big- notation: if and if . Such linear models are increasingly becoming common in application areas ranging from genomics [1, 2, 3, 4] and proteomics [5, 6, 7] to sentiment analysis [8, 9, 10] and hyperspectral imaging [11, 12, 13]. While there exist a number of techniques in the literature—such as forward selection/matching pursuit, backward elimination [14], least absolute shrinkage and selection operator (LASSO) [15], elastic net [16], smoothly clipped absolute deviation (SCAD) [17], bridge regression [18, 19], adaptive LASSO [20], group LASSO [21], and Dantzig selector [22]—that can be employed for inference from high-dimensional linear models, all these techniques have super-linear (in the number of variables ) computational complexity. In the ultrahigh-dimensional setting, therefore, use of the aforementioned methods for statistical inference can easily become computationally prohibitive. Variable selection-based dimensionality reduction, commonly referred to as variable screening, has been put forth as a practical means of overcoming this curse of dimensionality [23]: since only a small number of (independent) variables actually contribute to the response (dependent variable) in the ultrahigh-dimensional setting, one can first—in principle—discard most of the variables (the screening step) and then carry out inference on a relatively low-dimensional linear model using any one of the sparsity-promoting techniques. There are two main challenges that arise in the context of variable screening in ultrahigh-dimensional linear models. First, the screening algorithm should have low computational complexity (ideally, ). Second, the screening algorithm should be accompanied with mathematical guarantees that ensure the reduced linear model contains all relevant variables that affect the response. Our goal in this paper is to revisit one of the simplest screening algorithms, which uses marginal correlations between the variables and the response for screening purposes [24, 25], and provide a comprehensive theoretical understanding of its screening performance for arbitrary (random or deterministic) ultrahigh-dimensional linear models.

### I-a Relationship to Prior Work

Researchers have long intuited that the (absolute) marginal correlation is a strong indicator of whether the -th variable contributes to the response variable. Indeed, methods such as stepwise forward regression are based on this very intuition. It is only recently, however, that we have obtained a rigorous understanding of the role of marginal correlations in variable screening. One of the earliest screening works in this regard that is agnostic to the choice of the subsequent inference techniques is termed sure independence screening (SIS) [26]. SIS is based on simple thresholding of marginal correlations and satisfies the so-called sure screening property—which guarantees that all important variables survive the screening stage with high probability—for the case of normally distributed variables. An iterative variant of SIS, termed ISIS, is also discussed in [26], while [27] presents variants of SIS and ISIS that can lead to reduced false selection rates of the screening stage. Extensions of SIS to generalized linear models are discussed in [27, 28], while its generalizations for semi-parametric (Cox) models and non-parametric models are presented in [29, 30] and [31, 32], respectively.

The marginal correlation can be considered an empirical measure of the Pearson correlation coefficient, which is a natural choice for discovering linear relations between the independent variables and the response. In order to perform ultrahigh-dimensional variable screening in the presence of non-linear relations between ’s and and/or heavy-tailed variables, [33] and [34] have put forth screening using generalized (empirical) correlation and Kendall rank correlation, respectively.

The defining characteristics of the works referenced above is that they are agnostic to the inference technique that follows the screening stage. In recent years, screening methods have also been proposed for specific optimization-based inference techniques. To this end, [35] formulates a marginal correlations-based screening method, termed SAFE, for the LASSO problem and shows that SAFE results in zero false selection rate. In [36], the so-called strong rules for variable screening in LASSO-type problems are proposed that are still based on marginal correlations and that result in discarding of far more variables than the SAFE method. The screening tests of [35, 36] for the LASSO problem are further improved in [37, 38, 39] by analyzing the dual of the LASSO problem. We refer the reader to [40] for an excellent review of these different screening tests for LASSO-type problems.

Notwithstanding these prior works, we have holes in our understanding of variable screening in ultrahigh-dimensional linear models. Works such as [35, 36, 37, 38, 39] necessitate the use of LASSO-type inference techniques after the screening stage. In addition, these works do not help us understand the relationship between the problem parameters and the dimensions of the reduced model. Stated differently, it is difficult to a priori quantify the computational savings associated with the screening tests proposed in [35, 36, 37, 38, 39]. Similar to [26, 27, 33, 34], and in contrast to [35, 36, 37, 38, 39], our focus in this paper is on screening that is agnostic to the post-screening inference technique. To this end, [33] lacks a rigorous theoretical understanding of variable screening using the generalized correlation. While [26, 27, 34] overcome this shortcoming of [33], these works have two major limitations. First, their results are derived under the assumption of restrictive statistical priors on the linear model (e.g., normally distributed ’s). In many applications, however, it can be a challenge to ascertain the distribution of the independent variables. Second, the analyses in [26, 27, 34] assume the variance of the response variable to be bounded by a constant; this assumption, in turn, imposes the condition . In contrast, defining , we establish in the sequel that the ratio (and not ) directly influences the performance of marginal correlation-based screening procedures.

### I-B Our Contributions

Our focus in this paper is on marginal correlation-based screening of ultrahigh-dimensional linear models that is agnostic to the post-screening inference technique. To this end, we provide an extended analysis of the thresholding-based SIS procedure of [26]. The resulting screening procedure, which we term extended sure independence screening (ExSIS), provides new insights into marginal correlation-based screening of arbitrary (random or deterministic) ultrahigh-dimensional linear models. Specifically, we first provide a simple, distribution-agnostic sufficient condition—termed the screening condition—for (marginal correlation-based) screening of linear models. This sufficient condition, which succinctly captures joint interactions among both the active and the inactive variables, is then leveraged to explicitly characterize the performance of ExSIS as a function of various problem parameters, including noise variance, the ratio , and model sparsity. The numerical experiments reported at the end of this paper confirm that the dependencies highlighted in this screening result are reflective of the actual challenges associated with marginal correlation-based screening and are not mere artifacts of our analysis.

Next, despite the theoretical usefulness of the screening condition, it cannot be explicitly verified in polynomial time for any given linear model. This is reminiscent of related conditions such as the incoherence condition [41], the irrepresentable condition [42], the restricted isometry property [43], and the restricted eigenvalue condition [44] studied in the literature on high-dimensional linear models. In order to overcome this limitation of the screening condition, we explicitly derive it for two families of linear models. The first family corresponds to sub-Gaussian linear models, in which the independent variables are independently drawn from (possibly different) sub-Gaussian distributions. We show that the ExSIS results for this family of linear models generalize the SIS results derived in [26] for normally distributed linear models. The second family corresponds to arbitrary (random or deterministic) linear models in which the (empirical) correlations between independent variables satisfy certain polynomial-time verifiable conditions. The ExSIS results for this family of linear models establish that, under appropriate conditions, it is possible to reduce the dimension of an ultrahigh-dimensional linear model to almost the sample size even when the number of active variables scales almost linearly with the sample size. This, to the best of our knowledge, is the first screening result that provides such explicit and optimistic guarantees without imposing a statistical prior on the distribution of the independent variables.

### I-C Notation and Organization

The following notation is used throughout this paper. Lower-case letters are used to denote scalars and vectors, while upper-case letters are used to denote matrices. Given , denotes the smallest integer greater than or equal to . Given , we use as a shorthand for . Given a vector , denotes its norm. Given a matrix , denotes its -th column and denotes the entry in its -th row and -th column. Further, given a set , (resp., ) denotes a submatrix (resp., subvector) obtained by retaining columns of (resp., entries of ) corresponding to the indices in . Finally, the superscript denotes the transpose operation.

The rest of this paper is organized as follows. We formulate the problem of marginal correlation-based screening in Sec. II. Next, in Sec. III, we define the screening condition and present one of our main results that establishes the screening condition as a sufficient condition for ExSIS. In Sec. IV, we derive the screening condition for sub-Gaussian linear models and discuss the resulting ExSIS guarantees in relation to prior work. In Sec. V, we derive the screening condition for arbitrary linear models based on the correlations between independent variables and discuss implications of the derived ExSIS results. Finally, results of extensive numerical experiments on both synthetic and real data are reported in Sec. VI, while concluding remarks are presented in Sec. VII.

## Ii Problem Formulation

Our focus in this paper is on the ultrahigh-dimensional ordinary linear model , where , , and for . In the statistics literature, is referred to as data/design/observation matrix with the rows of corresponding to individual observations and the columns of corresponding to individual features/predictors/variables, is referred to as observation/response vector with individual responses given by , is referred to as the parameter vector, and is referred to as modeling error or observation noise. Throughout this paper, we assume has unit norm columns, is sparse with (i.e., ), and is a zero-mean Gaussian vector with (entry-wise) variance and covariance . Here, is taken to be Gaussian with covariance for the sake of this exposition, but our analysis is trivially generalizable to other noise distributions and/or covariance matrices. Further, we make no a priori assumption on the distribution of . Finally, we define to be the set that indexes the non-zero components of . Using this notation, the linear model can equivalently be expressed as

 y=Xβ+η=XSβS+η. (1)

Given (1), the goal of variable screening is to reduce the number of variables in the linear model from (since ) to a moderate scale (with ) using a fast and efficient method. Our focus here is in particular on screening methods that satisfy the so-called sure screening property [26]; specifically, a method is said to carry out sure screening if the dimensional model returned by it is guaranteed with high probability to retain all the columns of that are indexed by . The motivation here is that once one obtains a moderate-dimensional model through sure screening of (1), one can use computationally intensive model selection, regression and estimation techniques on the dimensional model for reliable model selection (identification of ), prediction (estimation of ), and reconstruction (estimation of ), respectively.

In this paper, we study sure screening using marginal correlations between the response vector and the columns of . The resulting screening procedure is outlined in Algorithm 1, which is based on the principle that the higher the correlation of a column of with the response vector, the more likely it is that the said column contributes to the response vector (i.e., it is indexed by the set ).

The computational complexity of Algorithm 1 is only and its ability to screen ultrahigh-dimensional linear models has been investigated in recent years by a number of researchers [24, 25]. The fundamental difference among these works stems from the manner in which the parameter (the dimension of the screened model) is computed from (1). Our goal in this paper is to provide an extended understanding of the screening performance of Algorithm 1 for arbitrary (random or deterministic) design matrices. The term sure independence screening (SIS) was coined in [26] to refer to screening of ultrahigh-dimensional Gaussian linear models using Algorithm 1. In this vein, we refer to variable screening using Algorithm 1 and the analysis of this paper as extended sure independence screening (ExSIS). The main research challenge for ExSIS is specification of for arbitrary matrices such that with high probability. Note that there is an inherent trade-off in addressing this challenge: the higher the value of , the more likely is to satisfy the sure screening property; however, the smaller the value of , the lower the computational cost of performing model selection, regression, estimation, etc., on the dimensional problem. This leads us to the following research questions for ExSIS: () What are the conditions on under which ? () How small can be for arbitrary matrices such that ? () What are the constraints on the sparsity parameter under which ? Note that there is also an interplay between the sparsity level and the allowable value of for sure screening: the lower the sparsity level, the easier it should be to screen a larger number of columns of . Thus, an understanding of ExSIS also requires characterization of this relationship between and for marginal correlation-based screening. In the sequel, we not only address the aforementioned questions for ExSIS, but also characterize this relationship.

## Iii Sufficient Conditions for Sure Screening

In this section, we derive the most general sufficient conditions for ExSIS of ultrahigh-dimensional linear models. The results reported in this section provide important insights into the workings of ExSIS without imposing any statistical priors on and . We begin with a definition of the screening condition for the design matrix .

###### Definition 1 ((k,b)−Screening Condition).

Fix an arbitrary that is sparse. The (normalized) matrix satisfies the screening condition if there exists such that the following hold:

 maxi∈S|∑\mathclapj∈Sj≠iX⊤iXjβj|≤b(n,p)∥β∥2, maxi∈Sc|∑j∈SX⊤iXjβj|≤b(n,p)∥β∥2. (SC-2)

The screening condition is a statement about the collinearity of the independent variables in the design matrix. The parameter in the screening condition captures the similarity between () the columns of , and () the columns of and ; the smaller the parameter is, the less similar the columns are. Furthermore, since in the screening condition, the parameter reflects constraints on the sparsity parameter .

We now present one of our main screening results for arbitrary design matrices, which highlights the significance of the screening condition and the role of the parameter within ExSIS.

###### Theorem 1 (Sufficient Conditions for ExSIS).

Let with a sparse vector and the entries of independently distributed as . Define and , and let be the event . Suppose satisfies the screening condition and assume . Then, conditioned on , Algorithm 1 satisfies as long as .

We refer the reader to Sec. III-B for a proof of this theorem.

### Iii-a Discussion

Theorem 1 highlights the dependence of ExSIS on the observation noise, the ratio , the parameter , and model sparsity. We first comment on the relationship between ExSIS and observation noise . Notice that the statement of Theorem 1 is dependent upon the event . However, for any , we have (see, e.g., [45, Lemma 6])

 Pr(∥~η∥∞≥σϵ)<4pϵ√2πexp(−ϵ22). (2)

Therefore, substituting in (2), we obtain

 Pr(Gη)≥1−2(p√2πlogp)−1. (3)

Thus, Algorithm 1 possesses the sure screening property in the case of the observation noise distributed as . We further note from the statement of Theorem 1 that the higher the signal-to-noise ratio (SNR), defined here as , the more Algorithm 1 can screen irrelevant/inactive variables. It is also worth noting here trivial generalizations of Theorem 1 for other noise distributions. In the case of distributed as , Theorem 1 has replaced by the largest eigenvalue of the covariance matrix . In the case of following a non-Gaussian distribution, Theorem 1 has replaced by distribution-specific upper bound on that holds with high probability.

In addition to the noise distribution, the performance of ExSIS also seems to be impacted by the minimum-to-signal ratio (MSR), defined here as . Specifically, the higher the MSR, the more Algorithm 1 can screen inactive variables. Stated differently, the independent variable with the weakest contribution to the response determines the size of the screened model. Finally, the parameter in the screening condition also plays a central role in characterization of the performance of ExSIS. First, the smaller the parameter , the more Algorithm 1 can screen inactive variables. Second, the smaller the parameter , the more independent variables can be active in the original model; indeed, we have from the screening condition that . Third, the smaller the parameter , the lower the smallest allowable value of MSR; indeed, we have from the theorem statement that .

It is evident from the preceding discussion that the screening condition (equivalently, the parameter ) is one of the most important factors that helps understand the workings of ExSIS and helps quantify its performance. Unfortunately, the usefulness of this knowledge is limited in the sense that the screening condition cannot be utilized in practice. Specifically, the screening condition is defined in terms of the set , which is of course unknown. We overcome this limitation of Theorem 1 by implicitly deriving the screening condition for sub-Gaussian design matrices in Sec. IV and for a class of arbitrary (random or deterministic) design matrices in Sec. V.

### Iii-B Proof of Theorem 1

We first provide an outline of the proof of Theorem 1, which is followed by its formal proof. Define , , and . Next, fix a positive integer and define

 ˆSp1:={i∈ˆSp0:|wi| is among the p1 largest of all marginal correlations}.

The idea is to first derive an initial upper bound on , denoted by , and then choose ; trivially, we have . As a result, we get

 y=Xβ+η=XˆSp0βˆSp0+η=XˆSp1βˆSp1+η. (4)

Note that while deriving , we need to ensure ; this in turn imposes some conditions on that also need to be specified. Next, we can repeat the aforementioned steps to obtain from for a fixed positive integer . Specifically, define

 ˆSp2:={i∈ˆSp1:|wi| is among the p2 largest of all marginal correlations}

and . We can then derive an upper bound on , denoted by , and then choose ; once again, we have . Notice further that we do require , which again will impose conditions on .

In similar vein, we can keep on repeating this procedure to obtain a decreasing sequence of numbers and sets as long as , where and . The complete proof of Theorem 1 follows from a careful combination of these (analytical) steps. In order for us to be able to do that, however, we need two lemmas. The first lemma provides an upper bound on for , denoted by . The second lemma provides conditions on the design matrix such that . The proof of the theorem follows from repeated application of the two lemmas.

###### Lemma 1.

Fix and suppose , where and . Further, suppose the design matrix satisfies the screening condition for the sparse vector and the event holds true. Finally, define . Under these conditions, we have

 ti≤pi−1b(n,p)∥β∥2+∥β∥1+2pi−1√σ2logpβmin−b(n,p)∥β∥2−2√σ2logp=:¯ti. (5)

The proof of this lemma is provided in Appendix A. The second lemma, whose proof is given in Appendix B, provides conditions on under which the upper bound derived on for , denoted by , is non-trivial.

###### Lemma 2.

Fix . Suppose and . Then, we have .

We are now ready to present a complete technical proof of Theorem 1.

###### Proof.

The idea is to use Lemma 1 and Lemma 2 repeatedly to screen columns of . Note, however, that this is simply an analytical technique and we do not actually need to perform such an iterative procedure to specify in Algorithm 1. To begin, recall that we have , ,

 ˆSp1:={i∈ˆSp0:|wi| is among the p1 largest of all marginal correlations},

and , where is defined in (5). By Lemma 1 and Lemma 2, we have and , respectively. Next, given , we can use Lemma 1 and Lemma 2 to obtain from in a similar fashion. Specifically, let

 ˆSp2:={i∈ˆSp1:|wi| is among the p2 largest of all marginal correlations}

and , where is defined in (5). Then, by Lemma 1 and Lemma 2, we have and , respectively.

Notice that we can keep on repeating this procedure to obtain sub-models such that and . By repeated applications of Lemma 1 and Lemma 2, we have . Further, we are also guaranteed that . Thus, we can choose in Algorithm 1 in one shot and have . ∎

## Iv Screening of Sub-Gaussian Design Matrices

In this section, we characterize the implications of Theorem 1 for ExSIS of the family of sub-Gaussian design matrices. As noted in Sec. III, this effort primarily involves establishing the screening condition for sub-Gaussian matrices and specifying the parameter for such matrices. We begin by first recalling the definition of a sub-Gaussian random variable.

###### Definition 2.

A zero-mean random variable is said to follow a sub-Gaussian distribution if there exists a sub-Gaussian parameter such that for all .

In words, a random variable is one whose moment generating function is dominated by that of a random variable. Some common examples of sub-Gaussian random variables include:

• .

• .

• .

• .

Our focus in this paper is on design matrices in which entries are first independently drawn from sub-Gaussian distributions and then the columns are normalized. In contrast to prior works, however, we do not require the (pre-normalized) entries to be identically distributed. Rather, we allow each independent variable to be distributed as a sub-Gaussian random variable with a different sub-Gaussian parameter. Thus, the ExSIS analysis of this section is applicable to design matrices in which different columns might have different sub-Gaussian distributions. It is also straightforward to extend our analysis to the case where all (and not just across column) entries of the design matrix are non-identically distributed; we do not focus on this extension in here for the sake of notational clarity.

### Iv-a Main Result

The ExSIS of linear models involving sub-Gaussian design matrices mainly requires establishing the screening condition and characterization of the parameter for sub-Gaussian matrices. We accomplish this by individually deriving (1) and (SC-2) in Definition 1 for sub-Gaussian design matrices in the following two lemmas.

###### Lemma 3.

Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Finally, fix an arbitrary that is sparse, define , and let . Then, with probability exceeding , we have

 maxi∈S|∑\mathclapj∈Sj≠iX⊤iXjβj|≤√8logpn(b∗σ∗)∥β∥2.
###### Lemma 4.

Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Finally, fix an arbitrary that is sparse, define , and let . Then, with probability exceeding , we have

 maxi∈Sc|∑\mathclapj∈SX⊤iXjβj|≤√8logpn(b∗σ∗)∥β∥2.

The proofs of Lemma 3 and Lemma 4 are provided in Appendix C and Appendix D, respectively. It now follows from a simple union bound argument that the screening condition holds for sub-Gaussian design matrices with probability exceeding . In particular, we have from Lemma 3 and Lemma 4 that for sub-Gaussian matrices. We can now use this knowledge and Theorem 1 to provide the main result for ExSIS of ultrahigh-dimensional linear models involving sub-Gaussian design matrices.

###### Theorem 2 (ExSIS and Sub-Gaussian Matrices).

Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Next, let with a sparse vector and the entries of independently distributed as . Finally, define and , and let and . Then Algorithm 1 guarantees with probability exceeding as long as

 d≥⎡⎢ ⎢ ⎢ ⎢ ⎢⎢√kβmin∥β∥2−2√8logpn(b∗σ∗)−4√σ2logp∥β∥2⎤⎥ ⎥ ⎥ ⎥ ⎥⎥.
###### Proof.

Let be the event that the design matrix satisfies the screening condition with parameter . Further, let be the event as defined in Theorem 1. It then follows from Lemma 3, Lemma 4, (3), and the union bound that the event holds with probability exceeding . The advertised claim now follows directly from Theorem 1. ∎

### Iv-B Discussion

Since Theorem 2 follows from Theorem 1, it shares many of the insights discussed in Sec. III-A. In particular, Theorem 2 allows for exponential scaling of the number of independent variables, , and dictates that the number of independent variables, , retained after the screening stage be increased with an increase in the sparsity level and/or the number of independent variables, while it can be decreased with an increase in the SNR, MSR, and/or the number of samples. Notice that the lower bound on in Theorem 2 does require knowledge of the sparsity level. However, this limitation can be overcome in a straightforward manner, as shown below.

###### Corollary 2.1.

Let be an matrix with the entries independently distributed as with variances . Suppose the design matrix is obtained by normalizing the columns of , i.e., . Next, let with a sparse vector and the entries of independently distributed as . Further, define and . Finally, let , , and for some constants . Then Algorithm 1 guarantees with probability exceeding as long as .

###### Proof.

Theorem 2 and the condition dictates

 d≥⎡⎢ ⎢ ⎢ ⎢ ⎢⎢√k2(c1−1)√8logpn(b∗σ∗)+4(c2−1)√σ2logp∥β∥2⎤⎥ ⎥ ⎥ ⎥ ⎥⎥ (6)

for sure screening of sub-Gaussian design matrices. The claim now follows by noting that is a sufficient condition for (6) since and for sub-Gaussian random variables. ∎

A few remarks are in order now concerning our analysis of ExSIS for sub-Gaussian design matrices and that of SIS for random matrices in the existing literature. To this end, we focus on the results reported in [26], which is one of the most influential SIS works. In contrast to the screening condition presented in this paper, the analysis in [26] is carried out for design matrices that satisfy a certain concentration property. Since the said concentration property has only been shown in [26] to hold for Gaussian matrices, our discussion in the following is limited to Gaussian design matrices with independent entries.

The SIS results reported in [26] hold under four specific conditions. In particular, Condition 3 in [26] requires that: () the variance of the response variable is , () for some , , and () for some . Notice, however, that the variance condition is equivalent to having . Our analysis, in contrast, imposes no such restriction. Rather, Theorem 2 shows that marginal correlation-based sure screening is fundamentally affected by the MSR . While Theorem 2 is only concerned with sufficient conditions, numerical experiments reported in Sec. VI confirm this dependence. Next, notice that implies . It therefore follows that (1) in the screening condition is a non-statistical variant of the condition in [26].

We next assume for the sake of simplicity of argument and explicitly compare Theorem 2 and [26, Theorem 1] for the case of Gaussian design matrices with independent entries. Similar to [26], we also impose the condition for comparison purposes. In this setting, both the theorems guarantee sure screening with high probability. In [26, Theorem 1], this requires for and for some . It is, however, easy to verify that substituting and in Theorem 2 results in identical constraints of and for our analysis. Next, [26, Theorem 1] also imposes the sparsity constraint for the sure screening result to hold. However, the condition with reduces this constraint to , which matches the sparsity constraint imposed by Theorem 2 (cf. Corollary 2.1). To summarize, the ExSIS results derived in this paper coincide with the ones in [26] for the case of Gaussian design matrices. However, our results are more general in the sense that they explicitly bring out the dependence of Algorithm 1 on the SNR and the MSR, which is something missing in [26], and they are applicable to sub-Gaussian design matrices.

## V Screening of Arbitrary Design Matrices

The ExSIS analysis in Sec. IV specializes Theorem 1 for sub-Gaussian design matrices. But what about the design matrices in which either the entries do not follow sub-Gaussian distributions or the statistical distributions of entries are unknown? We address this particular question in this section by deriving verifiable sufficient conditions that guarantee the screening condition for any arbitrary (random or deterministic) design matrix. These sufficient conditions are presented in terms of two measures of similarity among the columns of a design matrix. These measures, termed worst-case coherence and average coherence, are defined as follows.

###### Definition 3 (Worst-case and Average Coherences).

Let be an matrix with unit -norm columns. The worst-case coherence of is denoted by and is defined as [46]:

 μ:=maxi,j:i≠j∣∣X⊤iXj∣∣.

On the other hand, the average coherence of is denoted by and is defined as [45]:

 ν:=1p−1maxi∣∣∣∑j:j≠iX⊤iXj∣∣∣.

Notice that both the worst-case and the average coherences are readily computable in polynomial time. Heuristically, the worst-case coherence is an indirect measure of pairwise similarity among the columns of : with as the columns of become less similar and as at least two columns of become more similar. The average coherence, on the other hand, is an indirect measure of both the collective similarity among the columns of and the spread of the columns of within the unit sphere: with as the columns of become more spread out in and as the columns of become less spread out. We refer the reader to [47] for further discussion of these two measures as well as their values for commonly encountered matrices.

We are now ready to describe the main results of this section. The first result connects the screening condition to the worst-case coherence. We will see, however, that this result suffers from the so-called square-root bottleneck: ExSIS analysis based solely on the worst-case coherence can, at best, handle scaling of the sparsity parameter. The second result overcomes this bottleneck by connecting the screening condition to both worst-case and average coherences. The caveat here is that this result imposes a mild statistical prior on the set .

### V-a ExSIS and the Worst-case Coherence

We begin by relating the worst-case coherence of an arbitrary design matrix with unit-norm columns to the screening condition.

###### Lemma 5 (Worst-case Coherence and the Screening Condition).

Let be an design matrix with unit-norm columns. Then, we have

 maxi∈S|∑\mathclapj∈Sj≠iX⊤iXjβj| ≤μ√k∥β∥2, and maxi∈Sc|∑j∈SX⊤iXjβj| ≤μ√k∥β∥2.

The proof of this lemma is provided in Appendix E. It follows from Lemma 5 that a design matrix satisfies the screening condition with parameter as long as . We now combine this implication of Lemma 5 with Theorem 1 to provide a result for ExSIS of arbitrary linear models.

###### Theorem 3.

Let with a sparse vector and the entries of independently distributed as . Suppose and . Then, Algorithm 1 satisfies with probability exceeding as long as .

The proof of this theorem follows directly from Lemma 5 and Theorem 1. Next, a straightforward corollary of Theorem 3 shows that ExSIS of arbitrary linear models can in fact be carried out without explicit knowledge of the sparsity parameter .

###### Corollary 3.1.

Let with a sparse vector and the entries of independently distributed as . Suppose , , and for some . Then, Algorithm 1 satisfies with probability exceeding as long as .

###### Proof.

Under the assumption of , notice that

 d≥⎡⎢ ⎢ ⎢ ⎢ ⎢⎢√k2(c1−1)μ√k+4(c2−1)√σ2logp∥β∥2⎤⎥ ⎥ ⎥ ⎥ ⎥⎥ (7)

is a sufficient condition for . Further, note that is a sufficient condition for (7). Next, since , we also have from the Welch bound on the worst-case coherence of design matrices [48]. Thus, is a sufficient condition for . ∎

It is interesting to compare this result for arbitrary linear models with Corollary 2.1 for sub-Gaussian linear models. Corollary 2.1 requires the size of the screened model to scale as , whereas this result requires to scale only as . While this may seem to suggest that Corollary 3.1 is better than Corollary 2.1, such an observation ignores the respective constraints on the sparsity parameter in the two results. Specifically, Corollary 2.1 allows for almost linear scaling of the sparsity parameter, , whereas Corollary 3.1 suffers from the so-called square-root bottleneck: because of the Welch bound. Stated differently, Corollary 3.1 fails to specialize to Corollary 2.1 for the case of being a sub-Gaussian design matrix. We overcome this limitation of the results of this section by adding the average coherence into the mix and imposing a statistical prior on the true model in the next section.

### V-B ExSIS and the Coherence Property

In order to break the square-root bottleneck for ExSIS of arbitrary linear models, we first define the notion of the coherence property.

###### Definition 4 (The Coherence Property).

An design matrix with unit-norm columns is said to obey the coherence property if there exists a constant such that and .

Heuristically, the coherence property, which was first introduced in [45], requires the independent variables to be sufficiently (marginally and jointly) uncorrelated. Notice that, unlike many conditions in high-dimensional statistics (see, e.g., [41, 42, 43, 44]), the coherence property is explicitly certifiable in polynomial time for any given design matrix. We now establish that the coherence property implies the design matrix satisfies the screening condition with high probability, where the probability is with respect to uniform prior on the true model .

###### Lemma 6 (Coherence Property and the Screening Condition).

Let be an design matrix that satisfies the coherence property with , and suppose and . Further, assume is drawn uniformly at random from -subsets of . Then, with probability exceeding , we have

 maxi∈S|∑\mathclapj∈Sj≠iX⊤iXjβj|≤cμμ√logp∥β∥2,and maxi∈Sc|∑j∈SX⊤iXjβ