A J_{\lambda} norm properties

Abstract

Sorted L-One Penalized Estimation (SLOPE, [10]) is a relatively new convex optimization procedure which allows for adaptive selection of regressors under sparse high dimensional designs. Here we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. This approach is particularly useful when variables in the same group are strongly correlated and thus true predictors are difficult to distinguish from their correlated ”neighbors”’. We formulate the respective convex optimization problem, gSLOPE (group SLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations.

\makenomenclature

Group SLOPE - adaptive selection of groups of predictors

Damian Brzyski, Weijie Su, Małgorzata Bogdan,

Faculty of Mathematics, Wroclaw University of Technology, Poland

Institute of Mathematics, Jagiellonian University, Cracow, Poland

Department of Statistics, Stanford University, Stanford, CA 94305, USA

Institute of Mathematics, Wroclaw University, Poland

Key words: Multiple Regression, Model Selection, Group selection, SLOPE, False Discovery Rate, Asymptotic Minimax

1 Introduction

Consider the classical multiple regression model of the form

 y=Xβ+z, (1.1)

where is the dimensional vector of values of the response variable, is the by experiment (design) matrix and . We assume that and are known, while is unknown. In many cases of data mining the purpose of the statistical analysis is to recover the support of , which identifies the set of important regressors. Here, the true support corresponds to truly relevant variables (i.e. variables which have impact on observations). Common procedures to solve this model selection problem rely on minimization of some objective function consisting of the weighted sum of two components: first term responsible for the goodness of fit and second term penalizing the model complexity. Among such procedures one can mention classical model selection criteria like the Akaike Information Criterion (AIC) [3] and the Bayesian Information Criterion (BIC) [17], where the penalty depends on the number of variables included in the model, or LASSO [21], where the penalty depends on the norm of regression coefficients. The main advantage of LASSO over classical model selection criteria is that it is a convex optimization problem and, as such, it can be easily solved even for very large design matrices.

LASSO solution is obtained by solving the optimization problem

 argminb  {12∥∥y−Xb∥∥2+λL∥b∥1}, (1.2)

where is a tuning parameter defining the trade-off between the model fit and the sparsity of solution. In practical applications the selection of good might be very challenging. For example it has been reported that in high dimensional settings the popular cross-validation typically leads to detection of a large number of false regressors (see e.g. [10]). The general rule is that when one reduces , then LASSO can identify more elements from the true support (true discoveries) but at the same time it produces more false discoveries. In general the numbers of true and false discoveries for a given depend on unknown properties on the data generating mechanism, like the number of true regressors and the magnitude of their effects. A very similar problem occurs when selecting thresholds for individual tests in the context of multiple testing. Here it was found that the popular Benjamini-Hochberg rule (BH, [7]), aimed at control of the False Discovery Rate (FDR), adapts to the unknown data generating mechanism and has some desirable optimality properties under a variety of statistical settings (see e.g. [1, 8, 16, 12]). The main property of this rule is that it relaxes the thresholds along the sequence of test statistics, sorted in the decreased order of magnitude. Recently the same idea was used in a new generalization of LASSO, named SLOPE (Sorted L-One Penalized Estimation, [9, 10]). Instead of the norm (as in LASSO case), the method uses FDR control properties of norm, defined as follows; for sequence satisfying and , where is the vector of sorted absolute values of coordinates of . SLOPE is the solution to a convex optimization problem

 argminb  {12∥∥y−Xb∥∥2+Jλ(b)}, (1.3)

which clearly reduces to LASSO for . Similarly as in classical model selection, the support of solution defines the subset of variables estimated as relevant. It was shown in [10] that SLOPE is strongly connected with BH procedure under orthogonal case, i.e. when . The main theoretical result presented in [10] states that under such assumption, the sequence of tuning parameters could be specifically selected, such that the FDR control is guaranteed. Moreover, in [20] it is proved that SLOPE with this sequence of tuning parameters adapts to unknown sparsity and is asymptotically minimax under orthogonal and random Gaussian designs.

In the sequence of examples presented in [9] and [10] it was shown that SLOPE has very desirable properties in terms of FDR control in case when the regressor variables are weakly correlated. While there exist other interesting approaches which allow to control FDR also under correlated designs (e.g. [5]), the efforts to prevent detection of false regressors which are strongly correlated with true ones inevitably lead to a huge loss of power. An alternative approach to deal with strongly correlated predictors is to simply give up the idea of distinguishing between them and include all of them into the selected model as a group. This leads to the problem of group selection in linear regression, extensively investigated and applied in many fields of science. In many of these applications the groups are selected not only due to the strong correlations but also taking into account the problem specific scientific knowledge.

Probably the most known convex optimization method for selection of gropus of explanatory variables is the group LASSO (gLASSO) [4]. For a fixed tuning parameter, , the gLASSO estimate is most frequently (e.g. [23], [18]) defined as a solution to optimization problem

 argminb  {12∥∥y−m∑i=1XIibIi∥∥22+σλgLm∑i=1√|Ii|∥bIi∥2}, (1.4)

where the sets form the partition of the set , denotes the number of elements in set , is the submatrix of composed of columns indexed by and is the restriction of to indices from . The method introduced in this article is, however, closer to the alternative version of gLASSO, in which penalties are imposed on rather than . This method was formulated in [19], where authors defined estimate of as

 Missing or unrecognized delimiter for \Big (1.5)

with the condition serving as a group relevance indicator.

Similarly as in the context of regular model selection, the properties of gLASSO strongly depend on the smoothing parameter , whose optimal value is the function of unknown parameters of true data generating mechanism. Thus, a natural question arises if the idea of SLOPE can be used for construction of the similar adaptive procedure for the group selection. To answer this query in this paper we define and investigate the properties of the group SLOPE (gSLOPE). We formulate the respective optimization problem and provide the algorithm for its solution. We also define the notion of the group FDR (gFDR), and provide the theoretical choice of the sequence of smoothing parameters, which guarantees that SLOPE controls gFDR in the situation when variables in different groups are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. Additionally, we provide the way of constructing the sequence of smoothing parameters under the assumption that the regressors from distinct groups are independent and use computer simulation to show that it allows to control gFDR.

2 Group SLOPE

2.1 Formulation of the optimization problem

Let the design matrix belong to the space of matrices with rows and columns. Furthermore, suppose that is some partition of the set , i.e. ’s are nonempty sets, for and . We will consider the linear regression model with groups of the form

 y=m∑i=1XIiβIi+z, (2.1)

where is the submatrix of composed of columns indexed by and is the restriction of to indices from the set . We will use notations to refer to the ranks of submatrices . To simplify notations in further part, we will assume that (i.e. there is at least one nonzero entry of for all ). Besides this, may be absolutely arbitrary matrix, in particular any linear dependencies inside submatrices are allowed.

In this article we will treat the value as a measure of an impact of th group on the response and we will say that the group is truly relevant if and only if . Thus our task of the identification of the relevant groups is equivalent with finding the support of the vector .

To estimate the nonzero coefficients of , we will use a new penalized method, namely group SLOPE (gSLOPE). For a given sequence of nonincreasing, nonnegative tuning parameters, , given sequence of positive weights, , and design matrix, , the gSLOPE, , is defined as any solution to

 β\tiny{gS}:=argminb  {12∥∥y−Xb∥∥22+σJλ(W\llbracketb\rrbracketX,I)}, (2.2)

where is diagonal matrix defined by equations for . The estimate of support is simply defined by the indices corresponding to nonzeros of .

It is easy to see that when one considers groups containing only one variable (i.e. single-groups situation), then taking all weights equal to one reduces (2.2) to SLOPE (1.3). On the other hand, taking and putting , immediately gives gLASSO problem (1.5) with the smoothing parameter . The gSLOPE could be therefore treated both: as the extension to SLOPE, and the extension to group LASSO.

Now, let us define and consider the following partition, , of the set

 I1:={1,…,l1},I2:={l1+1,…,l1+l2}, …,Im:={m−1∑j=1li+1,…,m∑j=1li}. (2.3)

Observe that each can be represented as , where is any matrix with orthogonal columns of a unit norm, whose span coincides with the space spanned by the columns of , and is the corresponding matrix of a full row rank. Therefore, for the purpose of estimating the group effects, we can reduce the design matrix to the matrix such that and for all .

Now observe that denoting for we immediately obtain

 Xb=∑mi=1XIibIi=∑mi=1UiRibIi=∑mi=1˜XIicIi=˜Xc,(\llbracketb\rrbracketX,I)i=∥XIibIi∥2=∥RibIi∥2=∥cIi∥2 (2.4)

and the problem (2.2) can be equivalently presented in the form

 ⎧⎪ ⎪⎨⎪ ⎪⎩c\tiny{gS}:=argminc  {12∥∥y−˜Xc∥∥22+σJλ(W\llbracketc\rrbracketI)}c\tiny{gS}Ii:=Riβ\tiny{gS}% Ii, i=1,…,m, (2.5)

for Therefore to identify the relevant groups and estimate their group effects it is enough to solve the optimization problem (2.5). We will say that (2.5) is the standardized version of the problem (2.2).

Remark 2.1.

At the time of finishing this article, we have been informed that the similar formulation of the group SLOPE was proposed in an independent work of Alexej Gossmann et al. [13]. However [13] considers only the case when the weights are equal to the square root of the group size and penalties are imposed directly on rather than on group effects . This makes the method of [13] dependent on scaling or rotations of variables in a given group. In comparison to [13], who propose a Monte Carlo approach for estimating the regularizing sequence, our article provides the choice of the smoothing parameters which provably allows for FDR control in case where the regressors in different groups are orthogonal to each other and, according to our simulation study, allows for FDR control where regressors in different groups are independent.

2.2 Group FDR

Group SLOPE is designed to select groups of variables, which might be very strongly correlated within a group or even linearly dependent. In this context we do not intend to identify single important predictors but rather want to point at the groups which contain at least one true regressor. To theoretically investigate the properties of gSLOPE in this context we now introduce the respective notion of group FDR (gFDR).

Definition 2.2.

Consider model (2.1) with some design matrix . Let be an estimate given by (2.2). We define two random variables: the number of all groups selected by gSLOPE (Rg) and the number of groups falsely discovered by gSLOPE (Vg), as

 Rg:=∣∣{i: ∥XIiβ\tiny{gS}Ii∥2≠0}∣∣,Vg:=∣∣{i: ∥XIiβIi∥2=0, ∥XIiβ\tiny{gS}Ii∥2≠0}∣∣.
Definition 2.3.

We define the false discovery rate for groups (gFDR) as

 gFDR:=E[Vgmax{Rg,1}]. (2.6)

Our goal is the identification of the regularizing sequence for SLOPE such that gFDR can be controlled at any given level . In the next section we will provide such a sequence, which provably controls gFDR in case when variables in different groups are orthogonal to each other. In subsequent sections we will replace this condition with the weaker assumption of the stochastic independence of regressors in different groups.

2.3 Control of gFDR when variables from different groups are orthogonal

In this section we will present the sequence of smoothing parameters for gSLOPE, which guarantees gFDR control at the assumed level if variables from different groups are orthogonal to each other. Before the statement of the respective theorem, we will recall the definition of distribution and define a scaled distribution.

Definition 2.4.

We will say that a random variable has a distribution with degrees of freedom, and write , when could be expressed as for having a distribution with degrees of freedom. We will say that a random variable has a scaled distribution with degrees of freedom and scale , when could be expressed as for having a distribution with degrees of freedom. We will use the notation .

Theorem 2.5 (gFDR control under orthogonal case).

Consider model (2.1) with the design matrix satisfying , for any . Denote the number of zero coefficients in by and let be positive numbers. Moreover,

• let be diagonal matrix such as , for ,

• for each denote by the rank of submatrix ,

• define , with , where is a cumulative distribution function of distribution with degrees of freedom,

Then any solution, , to problem gSLOPE (2.2) generates the same vector and it holds

 gFDR=E[Vgmax{Rg,1}]≤q⋅m0m.
Proof.

We will start with the standardized version of the gSLOPE problem, given by (2.5). Based on results discussed in Appendix B, we can consider problem

 ⎧⎪⎨⎪⎩c∗=argminc{12∑mi=1(∥˜yIi∥2−w−1ici)2+Jσλ(c)}∥XIiβ\tiny{gS}Ii∥2=c∗i(wi∥˜yIi∥2)−1˜yIi,i=1,…,m (2.7)

as an equivalent formulation of (2.5) and work with the model , with and . The uniqueness of follows simply from the uniqueness of in (2.7). Define random variables and . Clearly, then and . Consequently, it is enough to show that

 E[Vmax{R,1}]≤q⋅m0m.

Without loss of generality we can assume that groups are truly irrelevant, which gives and for Suppose now that are some fixed indices from . From definition of

 λr≥1wiF−1χli(1−qrm) ⟹ 1−Fχli(λrwi)≤qrm. (2.8)

Since , for we have

 P(w−1i∥˜yIi∥2≥σλr)=P(σ−1∥˜yIi∥2≥λrwi)=1−Fχli(λrwi)≤qrm. (2.9)

Now, denote by the number of nonzero coefficients in SLOPE estimate of (2.7) after reducing sample by excluding variable, as it was described in the statement of C.7. Thanks to lemmas C.6 and C.7, we immediately get

 {\llbracket˜y\rrbracketI: c∗i≠0 and R=r}⊂{\llbracket˜y\rrbracketI: w−1i∥˜yIi∥2>σλr and ˜Ri=r−1}, (2.10)

which together with (2.10) and (2.9) raises

 P(c∗i≠0 and R=r) ≤P(w−1i∥˜yIi∥2>σλr and ˜Ri=r−1)=P(w−1i∥˜yIi∥2>σλr)P(˜Ri=r−1) ≤qrmP(˜Ri=r−1). (2.11)

Therefore

 E[Vmax{R,1}]=m∑r=1E[Vr\mathds1{R=r}]=m∑r=11rE[m0∑i=1\mathds1{c∗i≠0}\mathds1{R=r}]=m∑r=11rm0∑i=1P(c∗i≠0 and R=r) ≤ m0∑i=1qmm∑r=1P(˜Ri=r−1)=qm0m, (2.12)

which finishes the proof. ∎

In further part of this article, we will use the term basic lambdas and use the notation to refer to the sequence of tuning parameters defined in Theorem 2.5. Figure 4(a) illustrates the gFDR achieved by gSLOPE with design matrix (hence the rank of group, , coincides with its size) and for the sequence . In simulation we have fixed groups sizes from the set , and for each size groups were considered, which gave and . Signal sizes were generated such that for truly relevant groups, which were randomly chosen in each iteration. Parameter was selected to satisfy the condition , where is defined in (E.4).

It could be observed, that the selected tuning parameters are rather conservative, i.e. the achieved gFDR is significantly lower than assumed. This suggests, that penalties (dictated by lambdas) could be slightly decreased, such as the method gets more power and still achieves the gFDR below the assumed level. Returning to the proof of Theorem 2.5, we can see that the crucial property of sequence is that for each . The possible relaxation of this condition is to assume only that

 m∑i=1(1−Fw−1i\raisebox2.0pt$χ$li(λr))≤qr. (2.13)

Under equal weights and equal ’s assumption, the above conditions are equivalent. In general case, however, the second approach (with the inequality replaced by equality) produces smaller penalties compared to tuning parameters given by Theorem 2.5. Most often, such a change results in improving power (and increasing gFDR at the same time). Replacing the inequality in (2.13) by equality yields the following strategy of choosing relaxed sequence (denoted by )

 λmeanr:=¯¯¯¯F−1(1−qrm)% for¯¯¯¯F(x):=1mm∑i=1Fw−1i\raisebox2.0pt$χ$li(x),r∈{1,…,m}, (2.14)

where is the cumulative distribution function of scaled chi distribution with degrees of freedom and scale . In Figure 4b we present estimated gFDR, for tuning parameters given by (2.14). The results suggest that with relaxed version of tuning parameters, we can still achieve the gFDR control while essentially improving the power of gSLOPE. Such a strategy could be especially important in situation, when differences between the smallest and the largest quantiles (among distributions ) are relatively large. When this is the case, the gSLOPE with lambdas given by Theorem 2.5 in orthogonal situation could be considered as too conservative.

Up until this point, we have only considered the testing properties of gSLOPE. Though originally proposed to control the FDR, surprisingly, SLOPE enjoys appealing estimation properties as well [20]. It thus would be desirable to extend this link between testing and estimation for gSLOPE. In measuring the deviation of an estimator from the ground truth , as earlier, we focus on the group level instead of an individual. Accordingly, here we aim to estimate or , equivalently. For illustration purpose, we employ the setting described as follows. Imagine that we have a sequences of problems with the number of groups growing to infinity: the design is orthonormal at groups level; ranks of submatrices , , are bounded, that is, for some constant integer ; denoting by the sparsity level (that is, the number of relevant groups), we assume the asymptotic . Now we state our minimax theorem, where we write if in the asymptotic limit, and denotes the number of nonzero entries of . The proof makes use of the same techniques for proving Theorem 1.1 in [20] and is deferred to the Appendix.

Theorem 2.6.

Fix any constant , let and for . Under the preceding conditions, gSLOPE is asymptotically minimax over the nearly black object , i.e.,

 sup∥\llbracketβ\rrbracketX,I∥0≤kE(∥∥\llbracketβ\tiny{gS}\rrbracketX,I−\llbracketβ\rrbracketX,I∥∥22)∼infˆβsup∥\llbracketβ\rrbracketX,I∥0≤kE(∥∥\llbracketˆβ\rrbracketX,I−\llbracketβ\rrbracketX,I∥∥22),

where the infimum is taken over all measurable estimators .

Notably, in this theorem the choice of does not assume the knowledge of sparsity level. Or putting it differently, in stark contrast to gLASSO, gSLOPE is adaptive to a range of sparsity in achieving the exact minimaxity. Combining Theorems 2.5 and 2.6, we see the remarkable link between FDR control and minimax estimation also applies to gSLOPE [1, 20]. While it is out of the scope of this paper, it is of great interest to extend this minimax result to general design matrices.

2.4 The impact of chosen weights

In this subsection we will discuss the influence of chosen weights, , on results. Let be given division into groups and be ranks of submatrices . Assume the orthogonality at groups level, i.e. that it holds , for , and suppose that . The support of coincides with the support of vector defined in (2.7), namely

 Missing or unrecognized delimiter for \Big (2.15)

where is diagonal matrix with positive numbers on diagonal. Suppose now, that has exactly nonzero coefficients. From Corollary C.4, these indices are given by , where is permutation which orders . Hence, the order of realizations decides about the subset of groups labeled by gSLOPE as relevant. Suppose that groups and are truly relevant, i.e and . If we want to achieve the situation in which subset of truly discovered groups is not significantly affected by , we should choose weights such as and are ”comparable”. One sensible strategy is to look at this issue from the side of expected values. The distributions of and are noncentral distributions, with and degrees of freedom, and the noncentrality parameters equal to and , respectively. Now, the expected value of the noncentral distribution could be well approximated by the square root of the expected value of the noncentral distribution, which gives

 E(w−1i∥˜yIi∥2)≈w−1i√E(∥˜yIi∥22)=w−1i√li+∥˜βIi∥22.

Therefore, roughly speaking, truly relevant groups and are treated as comparable, when it occurs . This gives us the intuition about the behavior of gSLOPE with the choice for each : gSLOPE treats two truly relevant groups as comparable, if groups effect sizes satisfy the condition . The derived condition could be recast as . This gives a nice interpretation: with the choice , under orthogonality at groups level and with linear independence of columns inside groups, gSLOPE treats two groups as comparable, when these groups have similar squared effect group sizes per coefficient. One possible idealistic situation, when such a property occurs, is when all truly relevant variables have the same impact on the response and were divided into groups containing either all truly relevant or all truly irrelavant regressors.

In Figure 8 (results for previously described simulations performed for orthogonal situation and various groups sizes) we see that when the condition is met, the fractions of groups with different sizes in the selected truly relevant groups (STRG) are approximately equal. To investigate the impact of selected weights on the set of discovered groups, we performed simulations with different settings, namely we used and (without changing other parameters). With the first choice, larger groups are penalized less than before, while the second choice yields the opposite situation. This is reflected in the proportion of each groups in STRG (Figure 8).

The values of gFDR are very similar under all choices of weights. Consequently, we are equipped in the entire family of settings, according to the rule: fix weights to induce the comparability in the domain of group effect strengths for different group sizes, and select lambdas to control gFDR for given weights.

2.5 Near-orthogonal situation

In this section we will deal with the case in which the columns in the design matrix are only ”almost orthogonal”, which could be valid for real-world applications. To simulate such a situation we will assume that by design matrix is a realization of the random matrix with independent entries drawn from the normal distribution, , so as the expected value of is equal to for , and equal to otherwise. The main objective is to derive the lambda sequence, which could be applied to achieve gFDR control under assumption that the is sparse. In this subsection we will confine ourselves only to the case , and when the number of elements in each group is relatively small as compared to the number of observations (). For simplicity in this subsection we will fix . In case when , the proposed sequence lambda should be multiplied by , as in expression (2.2).In the heuristic presented in this subsection, we will use the notation , in order to express that with large probability the differences between corresponding entries of matrices and are very small.

In situation when entries of comes from distribution and sizes of groups are relatively small, very good approximation of could be obtained by , defined as

 Missing or unrecognized delimiter for \bigg (2.16)

Assume for simplicity that , for , satisfies the same conditions for some (this implies in particular, that true signals are relatively strong) and the true model is sparse. Divide into two families of sets and . To derive optimality condition for we will prove the following

Theorem 2.7.

Let be such that , for and denote . If , then it holds:

 ⎧⎪⎨⎪⎩gIi=wλibIi∥bIi∥2, i=1,…,s\llbracketg\rrbracketIc∈Cwλc, (2.17)

where the set (here with instead of ) is defined in appendix (F).

Proof.

For define and put If , then for all from definition of subgradient it holds

 Missing or unrecognized delimiter for \big (2.18)

for and . Define , with set . Then . Consider first case, when belongs to the set . This yields

 m−s∑i=1λs+i(\llbrackethc\rrbracket˜I)(i)≥(gc)Thc. (2.19)

Since is open in and contains zero, from Corollary G.1 we have that and the inequality (G.3) is true for any yielding

 Unknown environment 'array% (2.20)

see Proposition F.1. This result immediately gives condition , which is equivalent with . To find conditions for with , define sets . For , (G.2) reduces to . Since the set is open in and contains zero, from Corollary G.1 we have for , . Since is convex and differentiable in , it holds , which finishes the proof. ∎

The above theorem allows to write the optimality condition for in form

 ⎧⎪ ⎪⎨⎪ ⎪⎩XTIi(y−X^β)=wλi^βIi∥^βIi∥2, i=1,…,s\llbracketXT(y−X^β)\rrbracketIc∈Cwλc. (2.21)

Since , for we get , where is matrix without columns from and denotes vector with removed coefficients indexed by . This means that, for , vector is approximately collinear with . Since