Group Model Selection Using Marginal Correlations: The Good, the Bad and the Ugly
Group model selection is the problem of determining a small subset of groups of predictors (e.g., the expression data of genes) that are responsible for majority of the variation in a response variable (e.g., the malignancy of a tumor). This paper focuses on group model selection in high-dimensional linear models, in which the number of predictors far exceeds the number of samples of the response variable. Existing works on high-dimensional group model selection either require the number of samples of the response variable to be significantly larger than the total number of predictors contributing to the response or impose restrictive statistical priors on the predictors and/or nonzero regression coefficients. This paper provides comprehensive understanding of a low-complexity approach to group model selection that avoids some of these limitations. The proposed approach, termed Group Thresholding (GroTh), is based on thresholding of marginal correlations of groups of predictors with the response variable and is reminiscent of existing thresholding-based approaches in the literature. The most important contribution of the paper in this regard is relating the performance of GroTh to a polynomial-time verifiable property of the predictors for the general case of arbitrary (random or deterministic) predictors and arbitrary nonzero regression coefficients.
1.1 Motivation and Background
One of the most fundamental of problems in statistical data analysis is to learn the relationship between the samples of a dependent or response variable (e.g., the malignancy of a tumor, the health of a network) and the samples of independent or predictor variables (e.g., the expression data of genes, the traffic data in the network). This problem was relatively easy in the data-starved world of yesteryears. We had samples and predictors, and our inability to observe too many variables meant that we lived in the “ greater than or equal to ” world. Times have changed now. The data-rich world of today has enabled us to simultaneously observe an unprecedented number of variables per sample. It is nearly impossible in many of these instances to collect as many, or more, samples as the number of predictors. Imagine, for example, collecting hundreds of thousands of thyroid tumors in a clinical setting. The “ smaller than ” world is no longer a theoretical construct in statistical data analysis. It has finally arrived; and it is here to stay.
This paper concerns statistical inference in the “ smaller than ” setting for the case when the response variable depends linearly on the predictors. Mathematically, a model of this form can be expressed as
Here, denotes the -th sample of the response variable, denotes the -th sample of the -th predictor, denotes the error in the model, and the parameters are called regression coefficients. This relationship between the samples of the response variable and those of the predictors can be expressed compactly in matrix-vector form as . The matrix in this form, termed the design matrix, is an matrix whose -th column comprises the samples of the -th predictor. In tumor classification, for example, an entry in the response variable could correspond to the malignancy (expressed as a numerical number) of a tumor sample, while the corresponding row in would correspond to the expression level of genes in that tumor sample.
The linear model , despite its mathematical simplicity, continues to make profound impacts in countless application areas . Such models are used for various inferential purposes. In this paper, we focus on the problem of model selection in high-dimensional linear models, which involves determining a small subset of predictors that are responsible for majority (or all) of the variation in the response variable . High-dimensional model selection can be used to implicate a small number of genes in the development of cancerous tumors, identify a small number of genes that primarily affect prognosis of a disease, etc.
1.2 Group Model Selection and Our Contributions
There exist many applications in statistical model selection where the implication of a single predictor in the response variable implies presence of other related predictors in the true model. This happens, for instance, in the case of microarray data when the genes (predictors) share the same biological pathway . In such situations, it is better to reformulate the problem of model selection in a “group” setting. Specifically, the response variable in high-dimensional linear models in group settings can be best explained by a small number of groups of predictors:
where , an submatrix of , denotes the -th group of predictors, denotes the group of regression coefficients associated with the group of predictors , and the set denotes the underlying true (group) model, corresponding to the groups of predictors that explain .
One of the main contributions of this paper is comprehensive understanding of
a polynomial-time algorithm, which we term Group Thresholding (GroTh),
that returns an estimate of the true (group) model for the
general case of arbitrary (random or deterministic) design matrices and
arbitrary nonzero regression coefficients. To this end, we make use of two
computable geometric measures of group coherence of a design
matrix—the worst-case group coherence and the
average group coherence —to provide a nonasymptotic
analysis of GroTh (Algorithm 1). We in particular establish
that if satisfies a verifiable group coherence property then, for
all but a vanishingly small fraction of possible models , GroTh: ()
handles linear scaling of the total number of predictors contributing to the
1.3 Relationship to Previous Work
The basic idea of using grouped predictors for inference in linear models has been explored by various researchers in recent years. Some notable works in this direction in the “” setting include [3, 4, 5, 6, 7, 8, 9]. Despite these inspiring results, more work needs to be done for high-dimensional group model selection. This is because the results reported in [3, 4, 5, 6, 7, 8, 9] do not guarantee linear scaling of the total number of predictors contributing to the response for the case of arbitrary design matrices and nonzero regression coefficients.
The work in this paper is also related to another body of work in statistics and signal processing literature that studies the high-dimensional linear model for the restrictive case of having a Kronecker structure: for some matrix , where denotes the Kronecker product. An incomplete list of works in this direction includes [10, 11, 12, 13, 14, 15, 16]. These restrictive works, however, also fail to guarantee linear scaling of the total number of predictors contributing to the response for the case of arbitrary nonzero regression coefficients.
Finally, note that the group model selection procedure studied in this paper is based on analyzing the marginal correlations, , of predictors with the response variable. Therefore, our work is algorithmically similar to the group thresholding approaches of [13, 14, 8]. The main appeal of such approaches is their low computational complexity of , which is much smaller than the typical computational complexity associated with other model selection procedures . In addition to the scaling limitations of the total number of influential predictors discussed earlier, however, the works in [13, 14, 8] also incorrectly conclude that performance of thresholding-based approaches is inversely proportional to the dynamic range, , of the nonzero groups of regression coefficients.
1.4 Mathematical Convention
The predictors and the response variable are assumed to be real valued throughout the paper, with the understanding that extensions to a complex-valued setting can be carried out in a straightforward manner. Uppercase letters are reserved for matrices, while lowercase letters are used for both vectors and scalars. Constants that do not depend upon the problem parameters (such as , , , and ) are denoted by , , etc. The notation for is a shorthand for the set , while the notation signifies equality in distribution. The transpose operation is denoted by and the spectral norm of a matrix is denoted by . Finally, the norm of a vector with each is defined as for , where denotes the usual norm. Note that and for .
In Section 2, we mathematically formulate the problem of group model selection, rigorously define the notions of worst-case group coherence, average group coherence and the group coherence property, and state and discuss the main result of the paper. In Section 3, we prove the main result of the paper. Finally, we present some numerical results in Section 4 and conclude in Section 5.
2 Group Model Selection Using GroTh
2.1 Problem Formulation
The object of attention in this paper is the high-dimensional linear model relating the response variable to the predictors comprising the columns of the design matrix . Since scalings of the columns of can be absorbed into the regression vector , we assume without loss of generality that the columns of have unit norms. There are three simplifying assumptions we make in this paper that will be relaxed in a sequel to this work. First, the modeling error is zero, , and thus the response variable is exactly equal to a parsimonious linear combination of grouped predictors: . Second, the groups of predictors are characterized by the same number of predictors per group: with . Third, the groups of predictors are orthonormalized: .
The main goal of this paper is characterization of the performance of a group model selection procedure, termed GroTh, that returns an estimate of the true model by sorting the groups of marginal correlations according to their -norms, , in descending order and setting to be indices of the first sorted groups of marginal correlations (see Algorithm 1). Instead of focusing on the worst-case performance of GroTh, however, we seek to characterize its performance for an arbitrary (but fixed) set of nonzero (grouped) regression coefficients supported on most models. Specifically, we do not impose any statistical prior on the set of nonzero regression coefficients, while we assume that the true (group) model is a uniformly random -subset of . Finally, the metrics of goodness we use in this paper are the false-discovery proportion (fdp) and the non-discovery proportion (ndp), defined as
respectively. These two metrics have gained widespread usage in multiple hypotheses testing problems in recent years. In particular, the expectation of the fdp is the well-known false-discovery rate (fdr) [18, 19].
2.2 Main Result and Discussion
Heuristically, successful group model selection requires the groups of predictors contributing to the response variable to be sufficiently distinguishable from the ones outside the true model. In this paper, we capture the notion of distinguishability of predictors through two easily computable, global geometric measures of the design matrix, namely, the worst-case group coherence and the average group coherence. The worst-case group coherence of is defined as
while the average group coherence of is defined as
Note that is a trivial upper bound on . It is also worth pointing that the worst-case group coherence and its variants have existed in earlier literature [7, 20], but the average group coherence is defined for the first time in here.
The central thesis of this paper is that group model selection using GroTh can be successful if these two measures of group coherence of are small enough. In particular, we address the question of how small should these two measures be in terms of the group coherence property.
Definition 1 (The Group Coherence Property).
The design matrix is said to satisfy the group coherence property if the following two conditions hold for some positive constants and :
It is straightforward to observe from the above definition that the group coherence property is a global property of that can be explicitly verified in polynomial time. Finally, we define to be the -th largest group of nonzero regression coefficients: . We are now ready to state the main result of this paper.
Theorem 1 (Group Model Selection Using GroTh).
Suppose the design matrix satisfies the group coherence property with parameters and . Next, fix parameters , , and define parameter . Then, under the assumptions , , and , we have with probability exceeding that
resulting in and , where is defined to be the largest integer for which the inequality holds. Here, the probability is with respect to the uniform distribution of the true model over all possible models.
A proof of this theorem is given in Section 3. We now provide a brief discussion of the significance of this result. First, Theorem 1 indicates that a polynomial-time verifiable property, namely, the group coherence property, of the design matrix can be checked to ascertain whether GroTh, which has computational complexity of , is well suited for group model selection. Second, it states that if satisfies the group coherence property then GroTh handles linear scaling of the total number of predictors contributing to the response, , for all but a vanishingly small fraction of models. This is in stark contrast to the earlier works [13, 14, 8] on thresholding-based approaches in high-dimensional linear models, which do not guarantee such linear scaling for the case of arbitrary nonzero regression coefficients. Note that while we do not provide in this paper explicit examples of design matrices satisfying the group coherence property, numerical results in Section 4 show that the set of design matrices satisfying the group coherence property is not empty.
Finally, Theorem 1 offers a nice interpretation of the price one might have to pay in estimating the true model using only marginal correlations. Specifically, (6) in the theorem implies group thresholding of marginal correlations effectively gives rise to a self-noise floor of . In words, the estimate returned by GroTh is guaranteed to return the indices of all the groups of predictors whose contributions to the response variable (in the sense) are above the self-noise floor of (cf. 6). This is again a significant improvement over the earlier works [13, 14, 8], which suggest that performance of thresholding-based approaches is inversely proportional to the dynamic range, , of the nonzero groups of regression coefficients. In order to expand on this, we observe from (6) that
Theorem 1 and the left-hand side of (7) indicate that inclusion of the -th group of predictors in the estimate is in fact related to the ratio of the energy contributed by the -th group of predictors to the average energy contributed per group of nonzero predictors: . Further, this implies that an increase in the dynamic range that comes from a decrease in cannot affect the performance of GroTh too much since increases for most groups of predictors in this case. This is indeed confirmed by the numerical experiments reported in Section 4.
3 Proof of the Main Result
We begin by developing some notation to facilitate the forthcoming analysis. Notice that the -dimensional vector of marginal correlations, , can be written as groups of -dimensional marginal correlations: with the vector . In the following, we use (an submatrix of ), (an subvector of ), and (an subvector of ) to denote the groups of predictors, groups of regression coefficients, and the marginal correlations corresponding to the true model , respectively. Similarly, we use and to denote the groups of predictors and the marginal correlations corresponding to the complement set , respectively.
Proof of Theorem 1 requires understanding the behaviors of the group vector and the group vector . In this subsection, we state and prove two lemmas that help us toward this goal. We will then leverage these two lemmas to provide a proof of Theorem 1.
Before proceeding, recall that is taken to be a uniformly random -subset of , while the set of nonzero group regression coefficients is considered to be deterministic (and fixed) but unknown. It therefore follows that the -dimensional group vector can be equivalently expressed as
where is a random permutation of , denotes the first elements of , is an submatrix of , and is an (group) vector of nonzero regression coefficients. Similarly, the -dimensional group vector can be expressed as
where denotes the last elements of and is an submatrix of .
Fix and . Next, assume and let denote the first elements of a random permutation of . Then for any fixed group vector
where is an absolute constant.
The proof of this lemma relies heavily on Banach-space-valued Azuma’s inequality stated in the Appendix. To begin, note that
We next fix an and define the event for . Then conditioned on , we have
In order to make use of the concentration inequality in Proposition 1 in the Appendix for upper bounding (12), we construct an -valued Doob martingale on . We first define and then define the Doob martingale as follows:
where denotes the first elements of . The next step involves showing that the constructed martingale has bounded differences. In order for this, we use to denote the -th element of and define
In order to upper bound , we first define an random matrix
Next, we notice that for every , the random variable conditioned on has a uniform distribution over , while conditioned on has a uniform distribution over . Therefore, we get
In order to evaluate for , we consider three cases for the index . In the first case of , it can be seen that for every and for . In the second case of , it can similarly be seen that for every and , while for . In the final case of , it can be argued that for every , for , and for . Consequently, regardless of the initial choice of , we have
The next step needed to upper bound (12) involves providing an upper bound on . To this end, note that
where follows since conditioned on has a uniform distribution over and is a consequence of the definition of average group coherence. Finally, we note from [23, Lemma B.1] that defined in Proposition 1 satisfies for . Consequently, under the assumption that , it can be seen from our construction of the Doob martingale that
where follows from Banach-space-valued Azuma’s inequality stated in the Appendix. Further, it can be established using (18) through tedious algebraic manipulations that
where , follows from the union bound and the fact that ’s are identically distributed, while follows since has a uniform distribution over the set . ∎
Fix and . Next, assume , and let and denote the first elements and the last elements of a random permutation of , respectively. Then for any fixed group vector
where is an absolute constant.
We next fix an and define for . Then conditioned on , we again have the following simple equality:
where denotes the first elements of . The next step in the proof involves showing is bounded for all . To do this, we define
for and once again resort to the argument in Lemma 1 that . Further, we define an random matrix
and notice that for , for , and for . It then follows from this discussion that