Group Lasso with Overlaps: the Latent Group Lasso approach

# Group Lasso with Overlaps: the Latent Group Lasso approach

\nameGuillaume Obozinski \emailguillaume.obozinski@ens.fr
Ecole Normale Supérieure
(INRIA/ENS/CNRS UMR 8548)
Paris, France \AND\nameLaurent Jacob \emaillaurent@stat.berkeley.edu
University of California
Berkeley CA 94720, USA \AND\nameJean-Philippe Vert \emailJean-Philippe.Vert@mines.org
Mines ParisTech
Fontainebleau, F-77300, France
INSERM U900
Institut Curie
Paris, F-75005, France
Equal contribution
###### Abstract

We study a norm for structured sparsity which leads to sparse linear predictors whose supports are unions of predefined overlapping groups of variables. We call the obtained formulation latent group Lasso, since it is based on applying the usual group Lasso penalty on a set of latent variables. A detailed analysis of the norm and its properties is presented and we characterize conditions under which the set of groups associated with latent variables are correctly identified. We motivate and discuss the delicate choice of weights associated to each group, and illustrate this approach on simulated data and on the problem of breast cancer prognosis from gene expression data.

Group Lasso with Overlaps: the Latent Group Lasso approach Guillaume Obozinskithanks: Equal contribution guillaume.obozinski@ens.fr
Sierra team - INRIA
Ecole Normale Supérieure
(INRIA/ENS/CNRS UMR 8548)
Paris, France
Laurent Jacob laurent@stat.berkeley.edu
Department of Statistics
University of California
Berkeley CA 94720, USA
Jean-Philippe Vert Jean-Philippe.Vert@mines.org
Centre for Computational Biology
Mines ParisTech
Fontainebleau, F-77300, France
INSERM U900
Institut Curie
Paris, F-75005, France

Keywords: group Lasso, sparsity, graph, support recovery, block regularization, feature selection

## 1 Introduction

Sparsity has triggered much research in statistics, machine learning and signal processing recently. Sparse models are attractive in many application domains because they lend themselves particularly well to interpretation and data compression. Moreover, from a statistical viewpoint, betting on sparsity is a way to reduce the complexity of inference tasks in large dimensions with limited amounts of observations. While sparse models have traditionally been estimated with greedy feature selection approaches, more recent formulations as optimization problems involving a non-differentiable convex penalty have proven very successful both theoretically and practically. The canonical example is the penalization of a least-square criterion by the norm of the estimator, known as Lasso in statistics (Tibshirani, 1996) or basis pursuit in signal processing (Chen et al., 1998). Under appropriate assumptions, the Lasso can be shown to recover the exact support of a sparse model from data generated by this model if the covariates are not too correlated (Zhao and Yu, 2006; Wainwright, 2009). It is consistent even in high dimensions, with fast rates of convergence (Lounici, 2008; Bickel et al., 2009). We refer the reader to van de Geer (2010) for a detailed review.

While the norm penalty leads to sparse models, it does not encode any prior information about the structure of the sets of covariates that one may wish to see selected jointly, such as predefined groups of covariates. An extension of the Lasso for the selection of variables in groups was proposed under the name group Lasso by Yuan and Lin (2006), who considered the case where the groups form a partition of the sets of variables. The group Lasso penalty, also called penalty, is defined as the sum (i.e. , norm) of the norms of the restrictions of the parameter vector of the model to the different groups of covariates. The work of several authors shows that when the support can be encoded well by the groups defining the norm, support recovery and estimation are improved (Lounici et al., 2010; Huang and Zhang, 2010; Obozinski et al., 2010; Negahban and Wainwright, 2011; Lounici et al., 2009; Kolar et al., 2011).

Subsequently, the notion of structured sparsity emerged as a natural generalization of the selection in groups, where the support of the model one wishes to recover is not anymore required to be just sparse but also to display certain structure. One of the first natural approaches to structured sparsity has been to consider extensions of the penalty to situations in which the set of groups considered overlap, so that the possible support pattern exhibit some structure (Zhao et al., 2009; Bach, 2009). Jenatton et al. (2011) formalized this approach and proposed an norm construction for families of allowed supports stable by intersection. Other approaches to structured sparsity are quite diverse: Bayesian or non-convex approaches that directly exploit the recursive structure of some sparsity patterns such as trees (He and Carin, 2009; Baraniuk et al., 2010), greedy approaches based on block-coding (Huang et al., 2009), relaxation of submodular penalties (Bach, 2010), generic variational formulations (Micchelli et al., 2011).

While Jenatton et al. (2011) proposed a norm inducing supports that arise as intersections of a sub-collection of groups defining the norm, we consider in this work norms which, albeit defined as well by a collection of overlapping groups, induce supports that are rather unions of a sub-collection of the groups encoding prior information. The main idea is that instead of directly applying the norm to a vector, we apply it to a set of latent variables each supported by one of the groups, which are combined linearly to form the estimated parameter vector. In the regression case, we therefore call our approach latent group Lasso.

The corresponding decomposition of a parameter vector into latent variables calls for the notion of group-support, which we introduce and which corresponds to the set of non-zero latent variables. In the context of a learning problem regularized by the norm we propose, we study the problem of group-support recovery, a notion stronger than the classical support recovery. Group-support recovery typically implies support recovery (although not always) if the support of a parameter vector is exactly a union of groups. We provide sufficient conditions for consistent group-support recovery.

In the definition of our norm, a weight is associated with each group. These weights play a much more important role in the case of overlapping groups than in the case of disjoint groups, since in the former case they determine the set of recoverable supports and the complexity of the class of possible models. We discuss the delicate question of the choice of these weights.

While the norm we consider is quite general and has potentially many applications, we illustrate its potential on the particular problem of learning sparse predictive models for cancer prognosis from high-dimensional gene expression data. The problem of identifying a predictive molecular signature made of a small set of genes is often ill-posed and so noisy that exact variable selection may be elusive. We propose that, instead, selecting genes in groups that are involved in the same biological process or connected in a functional or interaction network could be performed more reliably, and potentially lead to better predictive models. We empirically explore this application, after extensive experiments on simulated data illustrating some of the properties of our norm.

To summarize, the main contributions of this paper, which rephrases and extends a preliminary version published in Jacob et al. (2009), are the following:

• We define the latent group Lasso penalty to infer sparse models with unions of predefined groups as supports, and analyze in details some of its mathematical properties.

• We introduce the notion of group-support and group-support recovery results. Using correspondence theory, we show under appropriate conditions, that, in a classical asymptotic setting, estimators for the linear regression regularized with are consistent for the estimation of a sufficiently sparse group-support.

• We discuss in length the choice of weights associated to each group, which play a crucial role in the presence of overlapping groups of different sizes.

• We provide extended experimental results both on simulated data — addressing support-recovery, estimation error and role of weights — and on breast cancer data, using biological pathways and genes networks as prior information to construct latent group Lasso formulations.

The rest of the paper is structured as follows. We first introduce the latent group Lasso penalty and position it in the context of related work in Section 3. In Section 4 we show that it is a norm and provide several characterizations and variational formulations; we also show that regularizing with this norm is equivalent to covariate duplication (Section 4.6) and derive a corresponding multiple kernel learning formulation (Section 4.7). We briefly discuss algorithms in Section 4.8. In Section 5, we introduce the notion of group-support and consider in Section 6 a few toy examples to illustrate the concepts and properties discussed so far. We study group support-consistency in Section 7. The difficult question of the choice of the weighting scheme is discussed in Section 8. Section 9 presents the latent graph Lasso, a variant of the latent group Lasso when covariates are organized into a graph. Finally, in Section 10, we present several experiments: first, on artificial data to illustrate the gain in support recovery and estimation over the classical Lasso, as well as the influence of the choice of the weights; second, on the real problem of breast cancer prognosis from gene expression data.

## 2 Notations

In this section we introduce notations that will be used throughout the article. For any vector and any , denotes the norm of . We simply use the notation for the Euclidean norm. denotes the support of , i.e., the set of covariates such that . A group of covariates is a subset . The set of all possible groups is therefore , the power set of . For any group , denotes the complement of in , i.e., the covariates which are not in . denotes the projection onto , i.e., is the vector whose entries are the same as for the covariates in , and are for other other covariates. We will usually use the notation . We say that two groups overlap if they have at least one covariate in common.

Throughout the article, denotes a set of groups, usually fixed in advance for each application, and we denote the number of groups in . We require that all covariates belong to at least one group, i.e.,

 ⋃g∈Gg=[1,p].

We note the set of -tuples of vectors , where each is a vector in , that satisfy for each .

For any differentiable function , we denote by the gradient of at and by the partial gradient of with respect to the covariates in .

In optimization problems throughout the paper we will use the convention that so that the -valued function is well defined and jointly convex on .

## 3 Group Lasso with overlapping groups

Given a set of groups which form a partition of , the group Lasso penalty (Yuan and Lin, 2006) is a norm over defined as :

 ∀w∈Rp,∥w∥ℓ1/ℓ2=∑g∈Gdg∥∥wg∥∥, (1)

where are positive weights. This is a norm whose balls have singularities when some are equal to zero. Minimizing a smooth convex loss functional over such a ball, or equivalently solving the following optimization problem for some  :

 minw∈RpL(w)+λ∑g∈Gdg∥∥wg∥∥, (2)

often leads to a solution that lies on a singularity, i.e., to a vector such that for some of the groups in . Equivalently, the solution is sparse at the group level, in the sense that coefficients within a group are usually zero or nonzero together. The hyperparameter in (2) is used to adjust the tradeoff between minimizing the loss and finding a solution which is sparse at the group level.

When is not a partition anymore and some of its groups overlap, the penalty (1) is still a norm, because we assume that all covariates belong to at least one group. However, while the Lasso is sometimes loosely presented as selecting covariates and the group Lasso as selecting groups of covariates, the group Lasso estimator (2) does not necessarily select groups in that case. The reason is that the precise effect of non-differentiable penalties is to set covariates, or groups of covariates, to zero, and not to select them. When there is no overlap between groups, setting groups to zero leaves the other full groups to nonzero, which can give the impression that group Lasso is generally appropriate to select a small number of groups. When the groups overlap, however, setting one group to zero shrinks its covariates to zero even if they belong to other groups, in which case these other groups will not be entirely selected. This is illustrated in Figure 1(a) with three overlapping groups of covariates. If the penalty leads to an estimate in which the norm of the first and of the third group are zero, what remains nonzero is not the second group, but the covariates of the second group which are neither in the first nor in the third one. More formally, the overlapping case has been extensively studied by Jenatton et al. (2009), who showed that in the case where is an empirical risk and under very general assumptions on the data, the support of a solution of (2) almost surely satisfies

 supp(^w)=⎛⎝⋃g∈G0g⎞⎠c

for some , i.e., the support is almost surely the complement of a union of groups. Equivalently, the support is an intersection of the complements of some of groups considered.

In this work, we are interested in penalties which induce a different effect: we want the estimator to select entire groups of covariate, or more precisely we want the support of the solution to be a union of groups. For that purpose, we introduce a set of latent variables such that and for each group , and propose to solve the following problem instead of (2):

 minw∈Rp,¯v∈VGL(w)+λ∑g∈Gdg∥vg∥ s.t.w=∑g∈Gvg. (3)

Problem (3) is always feasible since we assume that all covariates belong to at least one group. Intuitively, the vectors in (3) represent a decomposition of as a sum of latent vectors whose supports are included in each group, as illustrated in Figure 1(b). Applying the penalty to these latent vectors favors solutions which shrink some to , while the non-shrunk components satisfy . On the other hand, since we enforce , a can be nonzero as long as belongs to at least one non-shrunk group. More precisely, if we denote by the set of groups with for the solution of (3), then we immediately get , and therefore we can expect:

 supp(^w)=⋃g∈G1g.

In other words, this formulation leads to sparse solutions whose support is likely to be a union of groups.

Interestingly, problem (3) can be reformulated as the minimization of the cost function penalized by a new regularizer which is a function of only. Indeed since the minimization over only involves the penalty term and the constraints, we can rewrite (3) as

 (4)

with

 (5)

We call this penalty the latent group Lasso penalty, in reference to its formulation as a group Lasso over latent variables. When the groups do not overlap and form a partition, there exists a unique decomposition of as with , namely, for all . In that case, both the group Lasso penalty (1) and the latent group Lasso penalty (5) are equal and boil down to the same standard group Lasso. When some groups overlap, however, the two penalties differ. For example, Figure 2 shows the unit ball for both norms in with groups . The pillow shaped ball of has four singularities corresponding to cases where either only or only is nonzero. By contrast, has two circular sets of singularities corresponding to cases where only or only is nonzero. For comparison, we also show the unit ball when we consider the partition , in which case both norms coincide: singularities appear for or .

To summarize, we enforce a prior we have on by introducing new variables in the optimization problem (3). The constraint we impose is that some groups should be shrunk to zero, and a covariate should have zero weight in if all the groups to which it belongs are set to zero. Equivalently, the support of should be a union of groups. This new problem can be re-written as a classical minimization of the empirical risk, penalized by a particular penalty defined in (5). This penalty itself associates to each vector the solution of a particular constrained optimization problem. While this formulation may not be the most intuitive, it allows to reframe the problem in the classical context of penalized empirical risk minimization. In the remaining of this article, we investigate in more details the latent group Lasso penalty , both theoretically and empirically.

### 3.1 Related work

The idea of decomposing a parameter vector into some latent components and to regularize each of these components separately has appeared recently in the literature independently of this work. In particular Jalali et al. (2010) proposed to consider such a decomposition in the case of multi-task learning, where each task specific parameter vector is decomposed into a first regularized vector and another vector, regularized with an norm; so as to share its sparsity pattern with all other tasks. The norm considered in that work could be interpreted as a special case of the latent group Lasso, where the set of groups consists of all singletons and groups of coefficients associated with the same feature across task. The decomposition into latent variables is even more natural in the context of the work of Chen et al. (2011), Candes et al. (2009), or Agarwal et al. (2011) on robust PCA and matrix decomposition in which a matrix is decomposed in a low rank matrix regularized by the trace norm and a sparse or column-sparse matrix regularized by an or group -norm.

Another type of decompositions which is related to this norm is the idea of cover of the support. In particular it is interesting to consider the counterpart to this norm, which could be written as

can then be interpreted as the value of a min set-cover. This penalization has been considered in Huang et al. (2009) under the name block coding, since, indeed, when is interpreted as a coding length, this penalization induces a code length on all sets, which can be interpreted in the MDL framework.

More generally, one could consider penalties, for all , by replacing the norm used in the definition of the latent group Lasso penalty (5) by a norm. It should be noted then that, unlike the support, the definition of group-support we introduce in Section 5 changes if one considers the latent group Lasso with a different -norm, and even if the weights change 111We discuss the choice of weights in detail in Section 8..

Obozinski and F. (2011) considers the case of , when the weights are given by a set function and shows that is then the tightest convex “ relaxation of the block-coding scheme of Huang et al. (2009). It also shows that when and the weights are an appropriate power of a submodular function then is the norm that naturally extends the norm considered by Bach (2010).

It should be noted that recent theoretical analyses of the norm studied in this paper have been proposed by Percival (2011) and Maurer and Pontil (2011). They adopt points of views or focus on questions that are complementary of this work; we discuss those in section 7.3.

## 4 Some properties of the latent group Lasso penalty

In this section we study a few properties of the latent group Lasso , which will be in particular useful to prove consistency results in the next section. After showing that is a valid norm, we compute its dual norm and provide two variational formulas. We then characterize its unit ball as the convex hull of basic disks, and compute its subdifferential. When used as a penalty for statistical inference, we further reinterpret it in the context of covariate duplication and multiple kernel learning. To lighten notations, in the rest of the paper we simply denote by .

### 4.1 Basic properties

We first analyze the decomposition induced by (5) of a vector as . We denote by the set of -tuples of vectors that are solutions to the optimization problem in (5), i.e., which satisfy

 w=∑g∈GvgandΩ(w)=∑g∈Gdg∥vg∥.

We first have that:

###### Lemma 1

For any , is non-empty, compact and convex.

Proof  The objective of problem (5) is a proper closed convex function with no direction of recession. Lemma 1 is then the consequence of classical results in convex analysis, such as Theorem 27.2 page 265 of Rockafellar (1997).
The following statement shows that, unsurprisingly, we can regard as a classical norm-based penalty.

###### Lemma 2

is a norm.

Proof  Positive homogeneity and positive definiteness hold trivially. We show the triangular inequality. Consider , and let and be respectively optimal decompositions of and , so that and with and . Since is a (a priori non-optimal) decomposition of , we clearly have :

 Ω(w+w′)≤∑g∈Gdg∥∥vg+v′g∥∥≤∑g∈Gdg(∥vg∥+∥∥v′g∥∥)=Ω(w)+Ω(w′).

### 4.2 Dual norm and variational characterizations

being a norm, by Lemma 1, we can consider its Fenchel dual norm defined by:

 ∀α∈Rp,Ω∗(α)=supw∈Rp{w⊤α|Ω(w)≤1}. (6)

The following lemma shows that has a simple closed form expression:

###### Lemma 3 (dual norm)

The Fenchel dual norm of satisfies:

 ∀α∈Rp,Ω∗(α)=maxg∈Gd−1g∥∥αg∥∥.

Proof  We start from the definition of the dual norm (6) and compute:

 Ω∗(α) =maxw∈Rp w⊤α s.t.Ω(w)≤1 =maxw∈Rp,¯v∈VG w⊤α s.t.w=∑g∈Gvg,∑g∈Gdg∥vg∥≤1 =max¯v∈VG ∑g∈G vg⊤α s.t.∑g∈Gdg∥vg∥≤1 =max¯v∈VG,η∈Rm+ ∑g∈G vg⊤α s.t.∑g∈Gηg≤1,∀g∈G,dg∥vg∥≤ηg =maxη∈Rm+ ∑g∈G ηgd−1g∥∥αg∥∥ s.t.∑g∈Gηg≤1 =maxg∈G d−1g∥∥αg∥∥.

The second equality is due to the fact that :

and the fifth results from the explicit solution of the maximization in in the fourth line.

###### Remark 4

Remembering that the infimal convolution of two convex functions and is defined as (see Rockafellar, 1997), it could be noted that is the infimal convolution of all functions for defined as with if and otherwise. One of the main properties motivating the notion of infimal convolution is the fact that it can be defined via , where denotes Fenchel-Legendre conjugation. Several of the properties of can be derived from this interpretation but we will however show them directly.

The norm was initially defined as the solution of an optimization problem in (5). From the characterization of we can easily derive a second variational formulation:

###### Lemma 5 (second variational formulation)

For any , we have

 Ω(w)=maxα∈Rpα⊤ws.t.∥∥αg∥∥≤dgfor all g∈G. (7)

Proof  Since the bi-dual of a norm is the norm itself, we have the variational form

 Ω(w)=maxα∈Rpα⊤ws.t.Ω∗(α)≤1. (8)

Plugging the characterization of of Lemma 3 into this equation finishes the proof.
For any , we denote by the set of in the dual unit sphere which solve the second variational formulation (7) of , namely:

 A(w)Δ=argmaxα∈Rp,Ω∗(α)≤1α⊤w. (9)

With a few more efforts, we can also derive a third variational representation of the norm , which will be useful in Section 7 in the proofs of consistency:

###### Lemma 6 (third variational formulation)

For any , we also have

 Ω(w)=12minλ∈Rm+p∑i=1w2i∑g∋iλg+∑g∈Gd2gλg. (10)

Proof  For any , we can rewrite the solution of the constrained optimization problem of the second variational formulation (7) as the saddle point of the Lagrangian:

 Ω(w)=minλ∈Rm+maxα∈Rpw⊤α−12∑g∈Gλg(∥∥αg∥∥2−d2g).

Optimizing in leads to being solution of , which (distinguishing the cases and ) yields problem (10) when replacing by it optimal value.
Let us denote by the set of solutions to the third variational formulation (10). Note that there is not necessarily a unique solution to (10), because the Hessian of the objective function is not always positive definite (see lemma 48 in Appendix D for a characterization of cases in which positive definiteness can be guaranteed). For any , we now have three variational formulations for , namely (5), (7) and (10), with respective solutions sets , and . The following lemma shows that is in bijection with .

###### Lemma 7

Let . The mapping

 λ:VG→Rm¯v↦λ(¯v)=(d−1g∥vg∥)g∈G (11)

is a bijection from to . For any , the only vector that satisfies is given by , where is any vector of .

Proof  To express the penalty as a minimization problem, let us use the following basic equality valid for any :

 x=12minη≥0[x2η+η],

where the unique minimum in is reached for . From this we deduce that, for any and  :

 d∥v∥=12minη≥0d[∥v∥2η+η]=12minλ′≥0[∥v∥2λ′+d2λ′],

where the unique minimum in the last term is attained for . Using definition (5) we can therefore write as the optimum value of a jointly convex optimization problem in and  :

 Ω(w)=min¯v∈VG,∑g∈Gvg=w,λ′∈Rm+12∑g∈G[∥vg∥2λ′g+d2gλ′g], (12)

where for any , the minimum in is uniquely attained for defined in (11). By definition of , the set of solutions of (12) is therefore exactly the set of pairs of the form for . Let us now isolate the minimization over in (12). To incorporate the constraint we rewrite (12) with a Lagrangian:

 Ω(w)=minλ′∈Rm+maxα′∈Rpmin¯v∈VG12∑g∈G[∥vg∥2λ′g+d2gλ′g]+α′⊤(w−∑g∈Gvg).

The inner minimization in , for fixed and , yields . The constraint therefore implies that, after optimization in and , we have , and as a consequence that . A small computation now shows that, after optimization in and for a fixed , we have:

 ∑g∈G∥vg∥2λ′g=p∑i=1∑g∋i(vgi)2λ′g=p∑i=1∑g∋iλ′gw2i(∑h∋iλ′h)2=p∑i=1w2i∑h∋iλ′h.

Plugging this into (12), we see that after optimization in , the optimization problem in is exactly (10), which by definition admits as solutions, while we showed that (12) admits as solutions. This shows that , and since for any there exists a unique that satisfies , namely, , is indeed a bijection from to . Finally, we noted in the proof of Lemma 6 that for any and , . This shows that the unique associated to a can equivalently be written , which concludes the proof of Lemma 7.

### 4.3 Characterization of the unit ball of Ω as a convex hull

Figure 2(b) suggests visually that the unit ball of is just the convex hull of a horizontal disk and a vertical one. This impression is correct and formalized more generally in the following lemma.

###### Lemma 8

For any group , define the hyperdisks . Then, the unit ball of is the convex hull of the union of hyper-disks .

Proof  Let , then there exist and , for all , such that and . Letting as a suboptimal decomposition of , we easily get

 Ω(w)≤∑g∈Gdg∥∥tgαg∥∥≤∑g∈Gtg≤1.

Conversely, if , then there exists , such that and we obtain and in the simplex by letting and

It should be noted that this lemma shows that is the gauge of the convex hull of the disks , in other words, is, in the terminology introduced by Chandrasekaran et al. (2010), the unit ball of the atomic norm associated with the union of disks .

### 4.4 Subdifferential of Ω

The subdifferential of at is, by definition:

 ∂Ω(w)Δ={s∈Rp|∀h∈Rp,Ω(w+h)−Ω(w)≥s⊤h}.

It is a standard result of convex optimization (resulting e.g. from characterization () of the subdifferential in Theorem 23.5, p. 218, Rockafellar, 1997) that for all , where was defined in (9).

We can now show a simple relationship between the decomposition of a vector induced by , and the subdifferential of .

###### Lemma 9

For any and any ,

Proof  Let and . Since , we have which implies . On the other hand, we also have so that which is a sum of non-negative terms. We conclude that, for all , we have which yields the result.
We can deduce a general property of all decompositions of given vector:

###### Corollary 10

Let . For all , and for all we have or or there exists such that .

Proof  By Lemma 9, if and , then so that .

### 4.5 Ω as a regularizer

We consider in this section the situation where is used as a regularizer for an empirical risk minimization problem. Specifically, let us consider a convex differentiable loss function , such as the squared error for regression problems or the logistic loss for classification problems where . Given a set of training pairs , , we define the empirical risk and consider the regularized empirical risk minimization problem

 minw∈RpL(w)+λΩ(w). (13)

Its solutions are characterized by optimality conditions from subgradient calculus:

###### Lemma 11

A vector is a solution of (13) if and only if one of the following equivalent conditions is satisfied

1. can be decomposed as for some with for all :

 eithervg≠0and∇gL(w)=−λdgvg/∥vg∥orvg=0andd−1g∥∥∇gL(w)∥∥≤λ.

Proof  (a) is immediate from subgradient calculus and the fact that (see Section 4.4). (b) is immediate from Lemma 9.

### 4.6 Covariate duplication

In this section we show that empirical risk minimization penalized by is equivalent to a regular group Lasso in a covariate space of higher dimension obtained by duplication of the covariates belonging to several groups. This has implications for practical implementation of as a regularizer and for its generalization to non-linear classification.

More precisely, let us consider the duplication operator:

 Rp→R∑g∈G|g|x↦~x=⨁g∈G(xi)i∈g. (14)

In other words, is obtained by stacking the restrictions of to each group on top of each other, resulting in a -dimensional vector. Note that any coordinate of that occurs in several groups will be duplicated as many times in . Similarly, for a vector , let us denote by the -dimensional vector obtained by stacking the restrictions of the successive on their corresponding groups on top of each other (resulting in no loss of information, since is null outside of ). This operation is illustrated in (18) below. Then for any and such that , we easily get, for any :

 w⊤x=∑g∈Gvg⊤x=~v⊤~x. (15)

Consider now a learning problem with training points where we minimize over a penalized risk function that depends of only through inner products with the training points, i.e., or the form

 minw∈Rp~L(Xw)+λΩ(w), (16)

where is the matrix of training points and is therefore the vector of inner products of with the training points. Many problems, in particular those considered in Section 4.5, have this form. By definition of we can rewrite (16) as

 minw∈Rp,v∈VG,∑gvg=w~L(Xw)+λ∑g∈Gdg∥vg∥,

which by (15) is equivalent to

 min~v∈R∑g∈G|g|~L(~X~v)+λ∑g∈Gdg∥∥~vg∥∥, (17)

where is the matrix of duplicated training points, and refers to the restriction of to the coordinates of group . In other words, we have eliminated from the optimization problem and reformulated it as a simple group Lasso problem without overlap between groups in an expanded space of size .

On the example of Figure 1,with overlapping groups, this duplication trick can be rewritten as follows :

 (18)

This formulation as a classical group Lasso problem in an expanded space has several implications, detailed in the next two sections. On the one hand, it allows to extend the penalty to non-linear functions by considering infinite-dimensional duplicated spaces endowed with positive definite kernels (Section 4.7). On the other hand, it leads to straightforward implementations by borrowing classical group Lasso implementations after feature duplications (Section 4.8). Note, however, that the theoretical results we will show in Section 7, on the consistency of the estimator proposed, are not mere consequences of existing results for the classical group Lasso, because, in the case we consider, not only is the design matrix rank deficient, but so are all of its restriction to sets of variables corresponding to any union of overlapping groups.

### 4.7 Multiple Kernel Learning formulations

Given the reformulation in a duplicated variable space presented above, we provide in this section a multiple kernel learning (MKL) interpretation to the regularization by our norm and show that it extends naturally the case with disjoint groups.

To introduce it, we return first to the concept of MKL (Lanckriet et al., 2004; Bach et al., 2004) which we can present as follows. If one considers a learning problem of the form

 H=minw∈Rp~L(Xw)+λ2∥w∥2, (19)

then by the representer theorem the optimal value of the objective only depends on the input data through the Gram matrix , which therefore can be replaced by any positive definite (p.d.) kernel between the datapoints. Moreover can be shown to be a convex function of (Lanckriet et al., 2004). Given a collection of p.d. kernels , any convex combination with and is itself a p.d. kernel. The multiple kernel learning problem consists in finding the best such combination in the sense of minimizing :

 minη∈Rk+H(∑iηiKi)s.t.∑iηi=1. (20)

The kernels considered in the linear combination above are typically reproducing kernels associated with different reproducing kernel Hilbert spaces (RKHS).

Bach et al. (2004) showed that problems regularized by a squared -norm and multiple kernel learning were intrinsically related. More precisely he shows that, if forms a partition of , letting problems and be defined through

with , then and are equivalent in the sense that the optimal values of both objectives are equal with a bijection between the optimal solutions. Note that such an equivalence does not hold if the groups overlap.

Now turning to the norm we introduced, using the same derivation as the one leading from problem (16) to problem (17), we can show that minimizing w.r.t. is equivalent to minimizing and setting . At this point, the result of Bach et al. (2004) applied to the latter formulation in the space of duplicates shows that it is equivalent to the multiple kernel learning problem

 minη∈Rm+H(∑g∈GηgKg)s.t.∑g∈Gd2gηg=1,withKg=XgX⊤g. (21)

This shows that minimizing is equivalent to the MKL problem above. Compared with the original result of Bach et al. (2004), it should be noted now that, because of the duplication mechanism implicit in our norm, the original sets are no longer required to be disjoint. In fact this derivation shows that, in some sense, the norm we introduced is the one that corresponds to the most natural extension of multiple kernel learning to the case of overlapping groups.

Conversely, it should be noted that, while one of the application of multiple kernel learning is data fusion and thus allows to combine kernels corresponding to functions of intrinsically different input variables, MKL can also be used to select and combine elements from different function spaces defined on the same input. In general these function spaces are not orthogonal and are typically not even disjoint. In that case the MKL formulation corresponds implicitly to using the norm presented in this paper.

Finally, another MKL formulation corresponding to the norm is possible. If we denote the rank one kernel corresponding to the th feature, then we can write . If is the binary matrix defined by , and is the image of the canonical simplex of by the linear transformation associated with , then with obtained through , the MKL problem above can be reformulated as

 minζ∈ZH(p∑i=1ζiKi). (22)

This last formulation can be viewed as the structured MKL formulation associated with the norm (see Bach et al., 2011, sec. 1.5.4). It is clearly more interesting computationally when . It is however restricted to a particular form of kernel for each group, which has to be a sum of feature kernels . In particular, it doesn’t allow for interactions among features in the group.

In the two formulations above, it is obviously possible to replace the linear kernel used for the derivation by a non-linear kernel. In the case of (21) the combinatorial structure of the problem is a priori lost in the sense that the different kernels are no longer linear combinations of a set of “primary” kernels, while this is still the case for (22).

Using non-linear kernels like RBF, or kernels on discrete structures such as sequence- or graph-kernels may prove useful in cases where the relationship between the covariates in the groups and the output is expected to be non-linear. For example if is a group of genes and the coexpression patterns of genes within the group are associated with the output, the group will be deemed important by a non linear kernel while a linear one may miss it. More generally, it allows for structured non-linear feature selection.

### 4.8 Algorithms

There are several possible algorithmic approaches to solve the optimization problem (13), depending on the structure of the groups in . The approach we chose in this paper is based on the reformulation by covariate duplication of section 4.6, and applies an algorithm for the group Lasso in the space of duplicates. To be specific, for the experiments presented in section 10, we implemented the block-coordinate descent algorithm of Meier et al. (2008) combined with the working set strategy proposed by Roth and Fischer (2008). Note that the covariate duplication of the input matrix needs not to be done explicitly in computer memory, since only fast access to the corresponding entries in is required. Only the vector which is optimized has to be stored in the duplicated space and is potentially of large dimension (although sparse) if has many groups.

Alternatively, efficient algorithms which do not require working in the space of duplicated covariates are possible. Such an algorithm was proposed by Mosci et al. (2010) who suggested to use a proximal algorithm, and to compute the proximal operator of the norm via an approximate projection on the unit ball of the dual norm in the input space. To avoid duplication, it would also be possible to use an approach similar to that of (Rakotomamonjy et al., 2008). Finally, one could also consider algorithms from the multiple kernel learning literature.

## 5 Group-support

A natural question associated with the norm is what sparsity pattern are elicited when the norm is used as a regularizer. This question is natural in the context of support recovery. If the groups are disjoint, one could equivalently ask which patterns of selected group are possible, since answering the latter or the former questions are equivalent. This suggest a view in which the support is expressed in terms of groups. We formalize this idea through the concept of group-support of a vector , which, put informally, is the set of groups that are non-zero in a decomposition of . We will see that this notion is useful to characterize induced decompositions and recovery properties of the norm.

### 5.1 Definitions

More formally, we naturally call group-support of a decomposition , the set of groups such that . We extend this definition to a vector as follows:

###### Definition 12 (Strong group-support)

The strong group-support of a vector is the union of the group-supports of all its optimal decompositions, namely:

 ˘G1(w)Δ={g∈G∣∃¯v∈V(w)s.t.vg≠0}.

If has a unique decomposition , then is the group-support of its decomposition. We also define a notion of weak group-support in terms of uniqueness of the optimal dual variables.

###### Definition 13 (Weak group-support)

The weak group-support of a vector is