1 Introduction
###### Abstract

Sorted L-One Penalized Estimation (SLOPE, [10]) is a relatively new convex optimization procedure which allows for adaptive selection of regressors under sparse high dimensional designs. Here we extend the idea of SLOPE to deal with the situation when one aims at selecting whole groups of explanatory variables instead of single regressors. Such groups can be formed by clustering strongly correlated predictors or groups of dummy variables corresponding to different levels of the same qualitative predictor. We formulate the respective convex optimization problem, gSLOPE (group SLOPE), and propose an efficient algorithm for its solution. We also define a notion of the group false discovery rate (gFDR) and provide a choice of the sequence of tuning parameters for gSLOPE so that gFDR is provably controlled at a prespecified level if the groups of variables are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. We also provide a method for the choice of the regularizing sequence when variables in different groups are not orthogonal but statistically independent and illustrate its good properties with computer simulations. Finally, we illustrate the advantages of gSLOPE in the context of Genome Wide Association Studies. R package grpSLOPE with implementation of our method is available on CRAN.

\makenomenclature

Group SLOPE - adaptive selection of groups of predictors 111An earlier version of the paper appeared on arXiv.org in November 2015: arXiv:1511.09078

Damian Brzyski, Alexej Gossmann, Weijie Su, Małgorzata Bogdan

Department of Epidemiology and Biostatistics, Indiana University, Bloomington, IN 47405, USA

Institute of Mathematics, Jagiellonian University, 30-348 Cracow, Poland

Department of Mathematics, Tulane University, New Orleans, LA 70118, USA

Department of Statistics, University of Pennsylvania, Philadelphia, PA 19104, USA

Institute of Mathematics, University of Wroclaw, 50-384 Wroclaw, Poland

Key words: Asymptotic Minimax, False Discovery Rate, Group selection, Model Selection, Multiple Regression , SLOPE

## 1 Introduction

Consider the classical multiple regression model of the form

 y=Xβ+z, (1.1)

where is the dimensional vector of values of the response variable, is the by experiment (design) matrix and . We assume that and are known, while is unknown. In many applications the purpose of the statistical analysis is to recover the support of , which identifies the set of important regressors. Here, the true support corresponds to truly relevant variables (i.e. variables which have impact on observations). Common procedures to solve this model selection problem rely on minimization of some objective function consisting of the weighted sum of two components: first term responsible for the goodness of fit and second term penalizing the model complexity. Among such procedures one can mention classical model selection criteria like the Akaike Information Criterion (AIC) [3] and the Bayesian Information Criterion (BIC) [22], where the penalty depends on the number of variables included in the model, or LASSO [27], where the penalty depends on the norm of regression coefficients. The main advantage of LASSO over classical model selection criteria is that it is a convex optimization problem and, as such, it can be easily solved even for very large design matrices.

LASSO solution is obtained by solving the optimization problem

 argminb  {12∥∥y−Xb∥∥2+λL∥b∥1}, (1.2)

where is a tuning parameter defining the trade-off between the model fit and the sparsity of solution. In practical applications the selection of good might be very challenging. For example it has been reported that in high dimensional settings the popular cross-validation typically leads to detection of a large number of false regressors (see e.g. [10]). The general rule is that when one reduces , then LASSO can identify more elements from the true support (true discoveries) but at the same time it generates more false discoveries. In general the numbers of true and false discoveries for a given depend on unknown properties on the data generating mechanism, like the number of true regressors and the magnitude of their effects. A very similar problem occurs when selecting thresholds for individual tests in the context of multiple testing. Here it was found that the popular Benjamini-Hochberg rule (BH, [7]), aimed at control of the False Discovery Rate (FDR), adapts to the unknown data generating mechanism and has some desirable optimality properties under a variety of statistical settings (see e.g. [1, 8, 20, 15]). The main property of this rule is that it relaxes the thresholds along the sequence of test statistics, sorted in the decreased order of magnitude. Recently the same idea was used in a new generalization of LASSO, named SLOPE (Sorted L-One Penalized Estimation, [9, 10]). Instead of the norm (as in LASSO case), the method uses FDR control properties of norm, defined as follows; for sequence satisfying and , where is the vector of sorted absolute values of coordinates of . SLOPE is the solution to a convex optimization problem

 argminb  {12∥∥y−Xb∥∥2+Jλ(b)}, (1.3)

which clearly reduces to LASSO for . Similarly as in classical model selection, the support of solution defines the subset of variables estimated as relevant. In [10] it is shown that when the sequence corresponds to the decreasing sequence of threshold for BH then SLOPE controls FDR under orthogonal designs, i.e. when . Moreover, in [26] it is proved that SLOPE with this sequence of tuning parameters adapts to unknown sparsity and is asymptotically minimax under orthogonal and random Gaussian designs.

In the sequence of examples presented in [9], [10] and [11] it was shown that SLOPE has very desirable properties in terms of FDR control in case when regressor variables are weakly correlated. While there exist other interesting approaches which allow to control FDR under correlated designs (e.g., [5]), the efforts to prevent detection of false regressors which are strongly correlated with true ones inevitably lead to a loss of power. An alternative approach to deal with strongly correlated predictors is to simply give up the idea of distinguishing between them and include all of them into the selected model as a group. This leads to the problem of group selection in linear regression, extensively investigated and applied in many fields of science. In many of these applications the groups are selected not only due to the strong correlations but also by taking into account the problem specific scientific knowledge. It is also common to cluster dummy variables corresponding to different levels of qualitative predictors.

Probably the most known convex optimization method for selection of groups of explanatory variables is the group LASSO (gLASSO) [4]. For a fixed tuning parameter, , the gLASSO estimate is most frequently (e.g. [29], [23]) defined as a solution to optimization problem

 argminb  {12∥∥y−m∑i=1XIibIi∥∥22+σλgLm∑i=1√|Ii|∥bIi∥2}, (1.4)

where the sets form a partition of the set , denotes the number of elements in set , is the submatrix of composed of columns indexed by and is the restriction of to indices from . The method introduced in this article is, however, closer to the alternative version of gLASSO, in which penalties are imposed on rather than . This method was formulated in [24], where authors defined estimate of as

 (1.5)

with the condition serving as a group relevance indicator.

Similarly as in the context of regular model selection, the properties of gLASSO strongly depend on the shrinkage parameter , whose optimal value is the function of unknown parameters of true data generating mechanism. Thus, a natural question arises if the idea of SLOPE can be used for construction of the similar adaptive procedure for the group selection. To answer this query in this paper we define and investigate the properties of the group SLOPE (gSLOPE). We formulate the respective optimization problem and provide the algorithm for its solution. We also define the notion of the group FDR (gFDR), and provide the theoretical choice of the sequence of regularization parameters, which guarantees that gSLOPE controls gFDR in the situation when variables in different groups are orthogonal to each other. Moreover, we prove that the resulting procedure adapts to unknown sparsity and is asymptotically minimax with respect to the estimation of the proportions of variance of the response variable explained by regressors from different groups. Additionally, we provide a way of constructing the sequence of regularization parameters under the assumption that the regressors from distinct groups are independent and use computer simulations to show that it allows to control gFDR. Good properties of group SLOPE are illustrated using the practical example of Genome Wide Association Study. R package grpSLOPE with implementation of our method is available on CRAN.

## 2 Group SLOPE

### 2.1 Formulation of the optimization problem

Let the design matrix belong to the space of matrices with rows and columns. Furthermore, suppose that is some partition of the set , i.e. ’s are nonempty sets, for and . We will consider the linear regression model with groups of the form

 y=m∑i=1XIiβIi+z, (2.1)

where is the submatrix of composed of columns indexed by and is the restriction of to indices from the set . We will use notations to refer to the ranks of submatrices . To simplify notations in further part, we will assume that (i.e. there is at least one nonzero entry of for all ). Besides this, may be absolutely arbitrary matrix, in particular any linear dependencies inside submatrices are allowed.

In this article we will treat the value as a measure of an impact of th group on the response and we will say that the group is truly relevant if and only if . Thus our task of the identification of the relevant groups is equivalent with finding the support of the vector .

To estimate the nonzero coefficients of , we will use a new penalized method, namely group SLOPE (gSLOPE). For a given nonincreasing sequence of nonnegative tuning parameters, , a given sequence of positive weights, , and a design matrix, , the gSLOPE, , is defined as solutions to

 β\tiny{gS}:=argminb  {12∥∥y−Xb∥∥22+σJλ(W\llbracketb\rrbracketX,I)}, (2.2)

where is a diagonal matrix with for . The estimate of support is simply defined by the indices corresponding to nonzeros of .

It is easy to see that when one considers groups containing only one variable (i.e. singleton groups situation), then taking all weights equal to one reduces (2.2) to SLOPE (1.3). On the other hand, taking and putting , immediately gives gLASSO problem (1.5) with the smoothing parameter . The gSLOPE could be therefore treated both: as the extension to SLOPE, and the extension to group LASSO.

Now, let us define and consider the following partition, , of the set

 I1:={1,…,l1},I2:={l1+1,…,l1+l2},…,Im:={m−1∑j=1li+1,…,m∑j=1li}. (2.3)

Observe that each can be represented as , where is a matrix with orthogonal columns of a unit norm, whose span coincides with the space spanned by the columns of , and is the corresponding matrix of a full row rank. Define by matrix by putting for . Now observe that denoting for we immediately obtain

 Xb=∑mi=1XIibIi=∑mi=1UiRibIi=∑mi=1˜XIicIi=˜Xc,(\llbracketb\rrbracketX,I)i=∥XIibIi∥2=∥RibIi∥2=∥cIi∥2 (2.4)

and the problem (2.2) can be equivalently presented in the form

 ⎧⎪ ⎪⎨⎪ ⎪⎩c\tiny{gS}:=argminc  {12∥∥y−˜Xc∥∥22+σJλ(W\llbracketc\rrbracketI)}c\tiny{gS}Ii:=Riβ\tiny{gS}% Ii, i=1,…,m, (2.5)

for Therefore to identify the relevant groups and estimate their group effects it is enough to solve the optimization problem (2.5). We will say that (2.5) is the standardized version of the problem (2.2).

###### Remark 2.1.

Similar formulation of the group SLOPE was proposed in [16]. However [16] considers only the case when the weights are equal to the square root of the group size and penalties are imposed directly on rather than on group effects . This makes the method of [16] dependent on scaling or rotations of variables in a given group. In comparison to [16], where a Monte Carlo approach for estimating the regularizing sequence was proposed, our article provides choice of the smoothing parameters which provably allow for FDR control in case where the regressors in different groups are orthogonal to each other and its modification, which according to our simulation study allows for FDR control where regressors in different groups are independent.

### 2.2 Numerical algorithm

As shown in Appendix B the function is a norm and the optimization problem (2.5) can be solved by using proximal gradient methods. In our R package grpSLOPE available on CRAN (The Comprehensive R Archive Network) the accelerated proximal gradient method known as FISTA [6] is applied, which uses the specific procedure for choosing steps sizes, to achieve fast convergence rate. The proximal operator for gSLOPE is obtained by appropriate transformation and reduction of the problem, so the fast proximal operator for SLOPE [9] can be used. To derive proper stopping criteria, we have considered dual problem to gSLOPE and employed the strong duality property. The detailed description of the proximal operator for gSLOPE as well as of the dual norm and conjugate of grouped sorted norm is provided in the Appendix B.

### 2.3 Group FDR

Group SLOPE is designed to select groups of variables, which might be very strongly correlated within a group or even linearly dependent. In this context we do not intend to identify single important predictors but rather want to point at the groups which contain at least one true regressor. To theoretically investigate the properties of gSLOPE in this context we now introduce the respective notion of group FDR (gFDR).

###### Definition 2.2.

Consider model (2.1) and let be an estimate given by (2.2). We define two random variables: the number of all groups selected by gSLOPE (Rg) and the number of groups falsely discovered by gSLOPE (Vg), as

 Rg:=∣∣{i: ∥XIiβ\tiny{gS}Ii∥2≠0}∣∣,Vg:=∣∣{i: ∥XIiβIi∥2=0, ∥XIiβ\tiny{gS}Ii∥2≠0}∣∣.
###### Definition 2.3.

We define the false discovery rate for groups (gFDR) as

 gFDR:=E[Vgmax{Rg,1}]. (2.6)

### 2.4 Control of gFDR when variables from different groups are orthogonal

Our goal is the identification of the regularizing sequence for gSLOPE such that gFDR can be controlled at any given level . In this section we will provide such a sequence, which provably controls gFDR in case when variables in different groups are orthogonal to each other. In subsequent sections we will replace this condition with the weaker assumption of the stochastic independence of regressors in different groups. Before the statement of the main theorem on gFDR control, we will recall the definition of distribution and define a scaled distribution.

###### Definition 2.4.

We will say that a random variable has a distribution with degrees of freedom, and write , when could be expressed as for having a distribution with degrees of freedom. We will say that a random variable has a scaled distribution with degrees of freedom and scale , when could be expressed as for having a distribution with degrees of freedom. We will use the notation .

###### Theorem 2.5 (gFDR control under orthogonal case).

Consider model (2.1) with the design matrix satisfying , for any . Denote the number of zero coefficients in by and let be positive numbers. Moreover, define the sequence of regularizing parameters , with

 λmaxi:=maxj=1,…,m{1wjF−1\raisebox2.0pt$χ$lj(1−q⋅im)}, (2.7)

where is a cumulative distribution function of distribution with degrees of freedom. Then any solution, , to problem gSLOPE (2.2) generates the same vector and it holds

 gFDR=E[Vgmax{Rg,1}]≤q⋅m0m.
###### Proof.

We will start with the standardized version of the gSLOPE problem, given by (2.5). Based on results discussed in Appendix C, we can consider an equivalent formulation of (2.5)

 ⎧⎪⎨⎪⎩c∗=argminc{12∑mi=1(∥˜yIi∥2−w−1ici)2+Jσλ(c)}∥XIiβ\tiny{gS}Ii∥2=c∗i(wi∥˜yIi∥2)−1˜yIi,i=1,…,m, (2.8)

where has a multivariate normal distribution with . The uniqueness of follows simply from the uniqueness of in (2.8). Define random variables and . Clearly, then and . Consequently, it is enough to show that

 E[Vmax{R,1}]≤q⋅m0m.

Without loss of generality we can assume that groups are truly irrelevant, which gives and for Suppose now that are some fixed indices from . From definition of

 (2.9)

Now, let us fix . Since we have

 P(w−1i∥˜yIi∥2≥σλmaxr)=P(σ−1∥˜yIi∥2≥λmaxrwi)=1−Fχli(λmaxrwi)≤qrm. (2.10)

Now, denote by the number of nonzero coefficients in SLOPE estimate (2.8) after eliminating th group of explanatory variables. Thanks to lemmas D.6 and D.7, we immediately get

 {\llbracket˜y\rrbracketI: c∗i≠0 and R=r}⊂{\llbracket˜y\rrbracketI: w−1i∥˜yIi∥2>σλmaxr and ˜Ri=r−1}, (2.11)

which together with (2.10) raises

 P(c∗i≠0 and R=r)≤ P(w−1i∥˜yIi∥2>σλmaxr and ˜Ri=r−1)= P(w−1i∥˜yIi∥2>σλmaxr)P(˜Ri=r−1)≤ qrmP(˜Ri=r−1). (2.12)

Therefore

 E[Vmax{R,1}]=m∑r=1E[Vr\mathds1{R=r}]=m∑r=11rE[m0∑i=1\mathds1{c∗i≠0}\mathds1{R=r}]=m∑r=11rm0∑i=1P(c∗i≠0 and R=r) ≤ m0∑i=1qmm∑r=1P(˜Ri=r−1)=qm0m, (2.13)

which finishes the proof. ∎

Figure 1 illustrates the performance of gSLOPE under the design matrix (hence the rank of group, , coincides with its size), with . In Figure 1 (a) all groups are of the same size , while in Figures 1 (b) - (d) the explanatory variables are clustered into groups of sizes from the set ; groups of each size. Each coefficient of , in truly relevant group , was generated independently from distribution and then was scaled such that . Parameter was selected to satisfy the condition , where is defined in (F.4). Such signals are comparable to the maximal noise and can be detected with moderate power, which allows for a meaningful comparison between different methods.

Figure 1 (a) illustrates that the sequence allows to keep gFDR very close to the ”nominal” level when groups are of the same size. However, Figure 1 (b) shows that for groups of different size is rather conservative, i.e. the achieved gFDR is significantly lower than assumed. This suggests that the shrinkage (dictated by ) could be slightly decreased, such that the method gets more power and still achieves the gFDR below the assumed level. Returning to the proof of Theorem 2.5, we can see that for each we have

 1−F\raisebox2.0pt$χ$li(λmaxrwi)≤qrm, (2.14)

with equality holding only for being the index of the maximum in (2.7). In the result the inequality in (2.13) is usually strict and the true gFDR might be substantially smaller than the nominal level. The natural relaxation of (2.14) is to require only that

 m∑i=1(1−Fw−1i\raisebox2.0pt$χ$li(λr))≤qr. (2.15)

Replacing the inequality in (2.15) by equality yields the strategy of choosing the relaxed sequence

 λmeanr:=¯¯¯¯F−1(1−qrm)% for¯¯¯¯F(x):=1mm∑i=1Fw−1i\raisebox2.0pt$χ$li(x),r∈{1,…,m}, (2.16)

where is the cumulative distribution function of scaled chi distribution with degrees of freedom and scale . In Figure 1c we present estimated gFDR, for tuning parameters given by (2.16). The results suggest that with relaxed version of tuning parameters, we can still achieve the ”average” gFDR control, where the ”average” is with respect to the uniform distribution over all possible signal placements. As shown in Figure 1d, application of allows to achieve a substantially larger power than the one provided by . Such a strategy could be especially important in situation, when differences between the smallest and the largest quantiles (among distributions ) are relatively large and all groups have the same prior probability of being relevant.

### 2.5 The accuracy of estimation

Up until this point, we have only considered the testing properties of gSLOPE. Though originally proposed to control the FDR, surprisingly, SLOPE enjoys appealing estimation properties as well [26]. It thus would be desirable to extend this link between testing and estimation for gSLOPE. In measuring the deviation of an estimator from the ground truth , as earlier, we focus on the group level instead of an individual. Accordingly, here we aim to estimate parts of variance of explained by every group, which are contained in the vector or , equivalently. For illustration purpose, we employ the setting described as follows. Imagine that we have a sequence of problems with the number of groups growing to infinity: the design is orthonormal at groups level; ranks of submatrices , , are bounded, that is, for some constant integer ; denoting by the sparsity level (that is, the number of relevant groups), we assume the asymptotics . Now we state our minimax theorem, where we write if in the asymptotic limit, and denotes the number of nonzero entries of . The proof makes use of the same techniques for proving Theorem 1.1 in [26] and is deferred to the Appendix.

###### Theorem 2.6.

Fix any constant , let and for . Under the preceding conditions, gSLOPE is asymptotically minimax over the nearly black object , i.e.,

 sup∥\llbracketβ\rrbracketX,I∥0≤kE(∥∥\llbracketβ\tiny{gS}\rrbracketX,I−\llbracketβ\rrbracketX,I∥∥22)∼infˆβsup∥\llbracketβ\rrbracketX,I∥0≤kE(∥∥\llbracketˆβ\rrbracketX,I−\llbracketβ\rrbracketX,I∥∥22),

where the infimum is taken over all measurable estimators .

Notably, in this theorem the choice of does not assume the knowledge of sparsity level. Or putting it differently, in stark contrast to gLASSO, gSLOPE is adaptive to a range of sparsity in achieving the exact minimaxity. Combining Theorems 2.5 and 2.6, we see the remarkable link between FDR control and minimax estimation also applies to gSLOPE [1, 26]. While it is out of the scope of this paper, it is of great interest to extend this minimax result to general design matrices.

### 2.6 The impact of chosen weights

In this subsection we will discuss the influence of chosen weights, , on results. Let be a given partition into groups and be ranks of submatrices . Assume the orthogonality at group level, i.e., that it holds , for , and suppose that . The support of coincides with the support of vector defined in (2.8), namely

 (2.17)

where is a diagonal matrix with positive numbers on the diagonal. Suppose now, that has exactly nonzero coefficients. From Corollary D.4, these indices are given by , where is permutation which orders . Hence, the order of realizations decides about the subset of groups labeled by gSLOPE as relevant. Suppose that groups and are truly relevant, i.e., and . The distributions of and are noncentral distributions, with and degrees of freedom, and the noncentrality parameters equal to and , respectively. Now, the expected value of the noncentral distribution could be well approximated by the square root of the expected value of the noncentral distribution, which gives

 E(w−1i∥˜yIi∥2)≈w−1i√E(∥˜yIi∥22)=w−1i√li+∥˜βIi∥22.

Therefore, roughly speaking, truly relevant groups and are treated as comparable, when it occurs . This gives us the intuition about the behavior of gSLOPE with the choice for each . Firstly, gSLOPE treats all irrelevant groups as comparable, i.e. the size of the group has a relatively small influence on it being selected as a false discovery. Secondly, gSLOPE treats two truly relevant groups as comparable, if groups effect sizes satisfy the condition . The derived condition could be recast as . This gives a nice interpretation: with the choice , gSLOPE treats two groups as comparable, when these groups have similar squared effect group sizes per coefficient. One possible idealistic situation, when such a property occurs, is when all ’s in truly relevant groups are comparable.

In Figure 2 we see that when the condition is met, the fractions of groups with different sizes in the selected truly relevant groups (STRG) are approximately equal. To investigate the impact of selected weights on the set of discovered groups, we performed simulations with different settings, namely we used and (without changing other parameters). With the first choice, larger groups are penalized less than before, while the second choice yields the opposite situation. This is reflected in the proportion of each groups in STRG (Figure 2).

The values of gFDR are very similar under all choices of weights.

### 2.7 Independent groups and unknown σ

The assumption that variables in different groups are orthogonal to each other can be satisfied only in rare situations of specifically designed experiments. However, in a variety of applications one can assume that variables in different groups are independent. Such a situation occurs for example in the context of identifying influential genes using distant genetic markers, whose genotypes can be considered as stochastically independent. In this case a group can be formed by clustering dummy variables corresponding to different genotypes of a given marker. Though the difference between stochastic independence and algebraic orthogonality seems rather small, it turns out that small sample correlations between independent regressors together with the shrinkage of regression coefficients lead to magnifying the effective noise and require the adjustment of the tuning sequence (see [25] for discussion of this phenomenon in the context of LASSO). Concerning regular SLOPE, this problem was addressed by heuristic modification of , proposed in [10] and [9]. This modified sequence was calculated based upon the assumption that explanatory variables are randomly sampled from the Gaussian distribution. However, simulation results from [9] illustrate that it controls FDR also in case when the columns of the design matrix correspond to additive effects of independent SNPs and the number of causal genes is moderately small.

Following ideas for regular SLOPE presented in [9], we propose the Procedure 1 for calculating the sequence of tuning parameters in case when variables in different groups are independent. The heuristics justifying this choice are substantially more technically involved than the heuristics for regular SLOPE and their details are presented in the Appendix G. Procedure 1 is based on the sequence of but the version for the conservative choice follows analogously. The proposed sequence of tuning parameters flattens out for a certain value dependent on and . It is supposed to control gFDR when the number of identified groups is not much larger than .

Up until this moment, we have used in gSLOPE optimization problem, assuming that this parameter is known . However, in many applications is unknown and its estimation is an important issue. When , the standard procedure is to use the unbiased estimator of , , given by

 (2.18)

For the target situation, with much larger than , such an estimator can not be used. To estimate we will therefore apply the procedure which was dedicated for this purpose in [9] in the context of SLOPE. Below we present algorithm adjusted to gSLOPE (Procedure 2).

The idea standing behind the procedure is simple. The gSLOPE property of producing sparse estimators is used, and in each iteration columns in design matrix are first restricted to support of , so that the number of rows exceeds the number of columns and (2.18) can be used. Algorithm terminates when gSLOPE finds the same subset of relevant variables as in the preceding iteration.

To investigate the performance of gSLOPE under the Gaussian design and various group sizes, we performed simulations with groups. Their sizes were drawn from the binomial distribution, , so as the expected value of the group size was equal to (Figure 3c). As a result, we obtained variables, divided into groups (the same division was used in all iterations and scenarios). For each sparsity level and the gFDR level , and each iteration we generated entries of the design matrix using distribution, then was standardized and the values of response variable were generated according to model (2.1) with and signals generated as in simulations for Figure 1. To identify relevant groups based on the simulated data we have used the iterative version of gSLOPE, with estimation (Procedure 2) and lambdas given by Procedure 1. We performed repetitions for each scenario, was fixed as . Results are represented in Figure 3 and show that our procedure allows to control gFDR at the assumed level.

Additionally, Figure 3 compares gSLOPE to gLASSO with two choices of the smoothing parameter . Firstly, we used , which allows to control FDR under the total null hypothesis. Secondly, for each of the iterations we chose based on leave-one-out cross-validation. It turns out that the first of these choices becomes rather conservative when the number of truly relevant groups increases. Then gLASSO has a smaller FDR but also a much smaller power than SLOPE (by a factor of three for ). Cross-validation works in the opposite way - it yields a large power but also results in a huge proportion of false discoveries, which in our simulations systematically exceeds 60%.

### 2.8 Simulations in the context of Genome-Wide Association Studies

To test the performance of gSLOPE in the context of Genome-Wide Association Studies (GWAS) we have used the North Finland Birth Cohort (NFBC) dataset, available in dbGaP with accession number phs000276.v2.p1 (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000276.v2.p1) and described in detail in [21]. The raw data contains markers for subjects. To obtain roughly independent SNPs this data set was initially screened such that in the final data set the maximal correlation between any pair of SNPs does not exceed . The reduced data set contains SNPs.

The explanatory variables for our genetic model were defined in Table 1, where denotes the less frequent (variant) allele.

In case when population frequencies of both alleles are the same, variables and are uncorrelated. In other cases correlations between these variables is different from zero and can be very strong for rare genetic variants. Since each SNP is described by two dummy variables, the full design matrix contains potential regressors. This matrix was then centered and standardized, so the columns of the final design matrix have zero mean and unit norm.

The trait values are simulated according to two scenarios. In Scenario 1 we simulate from an additive model, where each of the causal SNPs influences the trait only through the additive dummy variable in matrix ,

 y=XβX+ϵ. (2.19)

Here , the number of ‘causal’ SNPs varies between and and each causal SNP has an additive effect (non-zero components of ) equal to or , with . In each of iterations of our experiment causal SNPs were randomly selected from the full set of SNPs.

The additive model (2.19) assumes that for each of the SNPs the expected value of the trait for the heterozygote is the average of expected trait values for both homozygotes and . This idealistic assumption is usually not satisfied and many of the SNPs exhibit some dominance effects. To illustrate the performance of gSLOPE in the presence of dominance effects we simulated data according to Scenario ;

 y=[X Z][\BMAT(c)[0.5pt,0pt,0.7cm]cccβXβZ]+ϵ (2.20)

which differs from Scenario 1 by adding dominance effects (non-zero components of ), which for each of selected SNPs are sampled from the uniform distribution on. The simulated data sets were analyzed using three different approaches:

• gSLOPE with groups, where each of the groups contains two explanatory variables, describing the additive and the dominance effect of the same SNP,

• SLOPE, where the regular SLOPE is used to search through the reduced design matrix (as in [9] or [11]),

• SLOPE, where the regular SLOPE is used to search through the full design matrix .

In all versions of SLOPE we used the iterative procedure for estimation of and the sequence heuristically adjusted to the case of the Gaussian design matrix, as implemented in the CRAN packages SLOPE and grpSLOPE.

Figure 4 provides the summary of this simulation study. Here FDR and power are calculated at the SNP level. Specifically, in case of SLOPE the SNP is counted as a one discovery if the corresponding additive or the dominance dummy variable is selected.

As shown in Figure 4, for both of the simulated scenarios all versions of SLOPE control gFDR for all considered values of . When the data are simulated according to the additive model the highest power is offered by SLOPE, with the power of gSLOPE being smaller by approximately over the whole range of . However, in the presence of large dominance effects the situation is reversed and gSLOPE offers the highest power, which systematically exceeds the power of SLOPE by the symmetric amount of . In our simulations SLOPE has intermediate performance and does not substantially improve the power of SLOPE in the presence of dominance effects. Thus our simulations suggest that gSLOPE provides an information complementary to SLOPE and might be a useful tool in the context of Genome-Wise Association Studies.

### 2.9 gSLOPE under GWAS application: real phenotype data

Finally, we have applied group SLOPE to identify SNPs associated with four lipid phenotypes available in NFBC dataset: high-density lipoproteins (HDL), low-density lipoproteins (LDL), triglycerides (TG), and total cholesterol (CHOL). The data set contains genotypes of SNPs for individuals and was previously analyzed in [11] using regular SLOPE to search for additive SNP effects. Before this analysis the data were reduced by applying the p-value threshold for single marker t-tests and selecting representatives of strongly correlated SNPs, also based on p-values. Since this pre-processing selects most promising SNPs by performing multiple testing on the full set of SNPs, the sequence of the tuning parameters for SLOPE needs to be adjusted to this value of rather than to the number of selected representatives. The algorithm for this analysis is implemented in R package geneSLOPE and its details are explained in [11]. According to extensive simulation study and real data analysis reported in [11], geneSLOPE allows to control FDR for the analysis with full size GWAS data.

In our data analysis we used three versions of SLOPE: geneSLOPE for additive effects (as in [11]), geneSLOPE, with the design matrix extended by inclusion of dominance dummy variables, and gene group SLOPE (geneGSLOPE). In geneSLOPE and geneGSLOPE representative SNPs were selected based on the one way ANOVA tests. For all these procedures the pre-processing was based on p-value threshold and the correlation cutoff , which allowed to reduce the data set to roughly 8500 of interesting representative SNPs. For the convenience of the reader, the Procedure 3 for the full geneGSLOPE analysis is provided below.

Results in the context of number of discoveries given by geneSLOPE, geneSLOPE and geneGSLOPE are summarized in Table 2, where we can observe that both geneSLOPE and geneSLOPE, gave identical results for LDL, CHOL and TG. Compared to these methods geneGSLOPE did not reveal any new response-related SNPs for LDL and CHOL. Actually, for these two traits geneGSLOPE missed some SNPs detected by the other two methods.

A different situation takes place for TG, where geneGSLOPE identifies 6 additional SNPs as compared to the other two methods. All these detections have a similar structure, showing a significant recessive effect of the minor allele. In all these cases the minor allele frequency was smaller than 0.1. The detection of such ”rare” recessive effects by the simple linear regression model is rather difficult, since the regression line adjusts mainly to the two prevalent genotype groups and is almost flat [19].

In case of HDL all three versions of SLOPE gave different results. geneSLOPE identifies one new SNP as compared to geneSLOPE, while geneGSLOPE identifies one more SNP and misses one of the discoveries obtained by other two methods. In Figure 5 we compare two exemplary discoveries: one detected at the same time by geneSLOPE and geneGSLOPE (known discovery) and one detected only by geneGSLOPE (new discovery). This example clearly shows the additive effect of the previously detected SNP and the recessive character of the second SNP. In case of new discovery there are only 5 individuals in the last genotype group, which makes the change in the mean not detectable by simple linear regression.

The results of real data analysis agree with results of simulations. They show that geneGSLOPE has a lower power than geneSLOPE for detection of additive effects but can be very helpful in detecting rare recessive variants. Thus these two methods are complementary to each other and can be used together to enhance the power of detection of influential genes.

## 3 Discussion

Group SLOPE is a new convex optimization procedure for selection of important groups of explanatory variables, which can be considered as a generalization of group LASSO and of SLOPE. In this article we provide an algorithm for solving group SLOPE and discuss the choice of the sequence of regularizing parameters. Our major focus is the control of group FDR, which can be obtained when variables in different groups are orthogonal to each other or they are stochastically independent and the signal is sufficiently sparse. After some preprocessing of the data such situations occur frequently in the context of genetic studies, which in this paper serve as a major example of applications. While we concentrated mainly on using gSLOPE to group dummy variables corresponding to different effects of the same SNP, gSLOPE can be used to group SNPs based on biological function, physical location etc. We also expect this method to be advantageous in the context of identification of groups of rare genetic variants, where considering their joint effect on phenotype should substantially increase the power of detection.

The major purpose of controlling FDR rather than absolutely eliminating false discoveries is the wish to increase the power of detection of signals which are comparable to the noise level. As shown by a variety of theoretical and empirical results, this allows SLOPE to obtain an optimal balance between the number of false and true discoveries and leads to very good estimation and predictive properties (see e.g. [10], [9] or [26]). Our Theorem 2.6 illustrates that these good estimation properties are inherited by group SLOPE.

We provide the regularizing sequence , which provably controls gFDR in case when variables in different groups are orthogonal. Additionally, we propose its relaxation , which according to our extensive simulations controls ”average” gFDR, where the average is with respect to all possible signal placements. This sequence can be easily modified taking into account the prior distribution on the signal placement. Such ”Bayesian” version of gSLOPE and the proof of control of the respective average gFDR remains an interesting topic for a further research.

Another important topic for a further research is the formal proof of gFDR control when variables in different groups are independent and setting precise limits on the sparsity levels under which it can be done. Asymptotic formulas, which allow for very accurate prediction of FDR for LASSO under Gaussian design are provided in [25]. We expect that similar results can be obtained for SLOPE and gSLOPE and generalized to the case of random matrices, where variables are independent and come from sub-Gaussian distributions. However, the technical complexity of results reported in [25] illustrates that this task is rather challenging. An alternative approach for the perfect gFDR control under random designs is to couple gSLOPE with the new knock-off procedure proposed in [12]. We expect that such a combination should allow to increase the power of detection of relevant features, as compared to other methods currently used with knock-offs.

While we concentrated on control of FDR in case when groups of variables are roughly orthogonal to each other, it is worth mentioning that original SLOPE has very interesting properties also in case when regressors are strongly correlated. As shown e.g. in [14], the Sorted L-One norm has a tendency to average estimated regression coefficients over groups of strongly correlated predictors, which enhances the predictive properties. This also allows not to lose important predictors due to their correlation with other features. We expect similar properties to hold for gSLOPE but the investigation of the properties of gSLOPE when variables in different groups are strongly correlated remains an interesting topic for a further research.

## Acknowledgement

We would like to thank Emmanuel J. Candès and Jan Mielniczuk for helpful remarks and suggestions and Christine Peterson for screening the North Finland Birth Cohort (NFBC) dataset. D. B. would like to thank Professor Jerzy Ombach for significant help with the process of obtaining access to the data. D. B. and M. B. are supported by European Union’s 7th Framework Programme for research, technological development and demonstration under Grant Agreement no 602552 and by the Polish Ministry of Science and Higher Education according to agreement 2932/7.PR/2013/2. Additionally D.B. acknowledges the support from NIMH grant R01MH108467, W. S. was partially supported by a General Wang Yaowu Stanford Graduate Fellowship.

## References

• [1] Abramovich F. and Benjamini Y. and Donoho D. L. and Johnstone I. M. Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist., 34(2):584–653, 2006.
• [2] Abramowitz M. and Stegun I. A. Handbook of mathematical functions: with formulas, graphs, and mathematical tables. Number 55. Courier Corporation, 1964.
• [3] Akaike H. A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control , 19(6):716–723, 1974.
• [4] Bakin S. Adaptive regression and model selection in data mining problems. 1999.
• [5] Barber R. F. and Candès E. J. Controlling the False Discovery Rate via Knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
• [6] Beck A. and Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 1:183–202, 2009.
• [7] Benjamini Y. and Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society, Series B, 57(1):289–300, 1995.
• [8] Bogdan M. and Chakrabarti A. and Frommlet F. and Ghosh J. K. Asymptotic Bayes Optimality under sparsity of some multiple testing procedures . Annals of Statistics, 39:1551–1579, 2011.
• [9] Bogdan M. and van den Berg E. and Sabatti C. and Su W. and Candès E. J. SLOPE – adaptive variable selection via convex optimization. Annals of Applied Statistics, 9(3):1103–1140, 2015.
• [10] Bogdan M. and van den Berg E. and Su W. and Candès E. J. Statistical Estimation and Testing via the Ordered Norm. arXiv:1310.1969, 2013.
• [11] Brzyski, D. and Peterson, C.B. and Sobczyk, P. and Candès, E.J. and Bogdan, M. and Sabatti,C. Controlling the rate of GWAS false discoveries. bioRxiv 058230, to appear in Genetics, 2016.
• [12] Candès, E.J and Fan, Y. and Janson L. and Lv J. Panning for gold: model-free knockoffs for high-dimensional controlled variable selection. arXiv:1610.02351, 2016.
• [13] Donoho D. L. and Johnstone I. M. Minimax risk over -balls for -error. Probability Theory and Related Fields, 99(2):277–303, 1994.
• [14] Figueiredo M. A. T. and Nowak R. D. Ordered Weighted Regularized Regression with Strongly Correlated Covariates: Theoretical Aspects. Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, JMLR:W&CP, 51:930–938, 2016.
• [15] Frommlet F. and Bogdan M. Some optimality properties of FDR controlling rules under sparsity . Electronic Journal of Statistics, 7:1328–1368, 2013.
• [16] Gossmann A. and Cao S. and Wang Y.-P. Identification of significant genetic variants via SLOPE”, and its extension to Group SLOPE. In Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, 2015.
• [17] Hardy G.H. and Littlewood, J.E. and Pólya, G. Inequalities. 1952.
• [18] Inglot T. Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics, 30(2):339–351, 2010.
• [19] Lettre G. and Lange C. and Hirschhorn J. N. Genetic model testing and statistical power in population-based association studies of quantitative traits. Genetic Epidemiology, 31(4): 358–362, 2007.
• [20] Neuvial P. and Roquain E. On false discovery rate thresholding for classification under sparsity. Annals of Statistics, 40:2572–2600, 2012.
• [21] Sabatti C. and Service S. K. and Hartikainen A. and Pouta A. and Ripatti S. and Brodsky J. and Jones C. G. and Zaitlen N. A. and Varilo T. and Kaakinen M. and Sovio U. and Ruokonen A. and Laitinen J. and Jakkula E. and Coin L. and Hoggart C. and Collins A. and Turunen H. and Gabriel S. and Elliot P. and McCarthy M. I. and Daly M. J. and JÃ¤rvelin M. and Freimer N. B. and Peltonen L. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nature Genetics, 41(1):35–46, 2009.
• [22] Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461–464, 1978.
• [23] Simon N. and Friedman J. and Hastie T. and Tibshirani R. A sparse-group lasso. Journal of Computational and Graphical Statistics, 2013.
• [24] Simon N. and Tibshirani R. Standardization and the group lasso penalty. Technical report, 2011.
• [25] Su W. and Bogdan M. and Candès E.J. False discoveries occur early on the lasso path. arXiv:1511.01957, to appear in Ann. Statist., 2015.
• [26] Su W. and Candès E. SLOPE is adaptive to unknown sparsity and asymptotically minimax. Annals of Statistics, 40:1038–1068, 2016.
• [27] Tibshirani R. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
• [28] Tseng P. On accelerated proximal gradient methods for convex-concave optimization. 2008.
• [29] Yuan M. and Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1):49–67, 2006.

## Appendix A Jλ norm properties

For nonnegative, nonincreasing sequence consider function given by , where is the vector of sorted absolute values.

###### Proposition A.1.

If , are such that , then .

###### Proof.

Without loss of generality we can assume that and are nonnegative and that it occurs . We will show that for . Fix such and consider the set