New Perspectives on Support and Cluster Norms
Abstract
We study a regularizer which is defined as a parameterized infimum of quadratics, and which we call the boxnorm. We show that the support norm, a regularizer proposed by Argyriou et al. (2012) for sparse vector prediction problems, belongs to this family, and the boxnorm can be generated as a perturbation of the former. We derive an improved algorithm to compute the proximity operator of the squared boxnorm, and we provide a method to compute the norm. We extend the norms to matrices, introducing the spectral support norm and spectral boxnorm. We note that the spectral boxnorm is essentially equivalent to the cluster norm, a multitask learning regularizer introduced by Jacob et al. (2009a), and which in turn can be interpreted as a perturbation of the spectral support norm. Centering the norm is important for multitask learning and we also provide a method to use centered versions of the norms as regularizers. Numerical experiments indicate that the spectral support and boxnorms and their centered variants provide state of the art performance in matrix completion and multitask learning problems respectively.
Department of Computer Science
University College London
Gower Street, London WC1E 6BT, UK Massimiliano Pontil m.pontil@cs.ucl.ac.uk
Istituto Italiano di Tecnologia
Via Morego, 30, 16163 Genoa, Italy
Department of Computer Science
University College London
Gower Street, London WC1E 6BT, UK Dimitris Stamos d.stamos@cs.ucl.ac.uk
Department of Computer Science
University College London
Gower Street, London WC1E 6BT, UK
Editor:
Keywords: Convex Optimization, Matrix Completion, Multitask Learning, Spectral Regularization, Structured Sparsity.
1 Introduction
We continue the study of a family of norms which are obtained by taking the infimum of a class of quadratic functions. These norms can be used as a regularizer in linear regression learning problems, where the parameter set can be tailored to assumptions on the underlying regression model. This family of norms is sufficiently rich to encompass regularizers such as the norms, the group Lasso with overlap (Jacob et al., 2009b) and the norm of Micchelli et al. (2013). In this paper we focus on a particular norm in this framework – the boxnorm – in which the parameter set involves box constraints and a linear constraint. We study the norm in detail and show that it can be generated as a perturbation of the support norm introduced by Argyriou et al. (2012) for sparse vector estimation, which hence can be seen as a special case of the boxnorm. Furthermore, our variational framework allows us to study efficient algorithms to compute the norms and the proximity operator of the square of the norms.
Another main goal of this paper is to extend the support and boxnorms to a matrix setting. We observe that both norms are symmetric gauge functions, hence by applying them to the spectrum of a matrix we obtain two orthogonally invariant matrix norms. In addition, we observe that the spectral boxnorm is essentially equivalent to the cluster norm introduced by Jacob et al. (2009a) for multitask clustering, which in turn can be interpreted as a perturbation of the spectral support norm.
The characteristic properties of the vector norms translate in a natural manner to matrices. In particular, the unit ball of spectral support norm is the convex hull of the set of matrices of rank no greater than , and Frobenius norm bounded by one. In numerical experiments we present empirical evidence on the strong performance of the spectral support norm in low rank matrix completion and multitask learning problems.
Moreover, our computation of the vector boxnorm and its proximity operator extends naturally to the spectral case, which allows us to use proximal gradient methods to solve regularization problems using the cluster norm. Finally, we provide a method to use the centered versions of the penalties, which are important in applications (see e.g. Evgeniou et al., 2007; Jacob et al., 2009a).
1.1 Related Work
Our work builds upon a recent line of papers which considered convex regularizers defined as an infimum problem over a parametric family of quadratics, as well as related infimal convolution problems (see Jacob et al., 2009b; Bach et al., 2011; Maurer and Pontil, 2012; Micchelli and Pontil, 2005; Obozinski and Bach, 2012, and references therein). Related variational formulations for the Lasso have also been discussed in (Grandvalet, 1998) and further studied in (Szafranski et al., 2007).
To our knowledge, the boxnorm was first suggested by Jacob et al. (2009a) and used as a symmetric gauge function in matrix learning problems. The induced orthogonally invariant matrix norm is named the cluster norm in (Jacob et al., 2009a) and was motivated as a convex relaxation of a multitask clustering problem. Here we formally prove that the cluster norm is indeed an orthogonal invariant norm. More importantly, we explicitly compute the norm and its proximity operator.
A key observation of this paper is the link between the boxnorm and the support norm and in turn the link between the cluster norm and the spectral support norm. The support norm was proposed in (Argyriou et al., 2012) for sparse vector prediction and was shown to empirically outperform the Lasso (Tibshirani, 1996) and Elastic Net (Zou and Hastie, 2005) penalties. See also Gkirtzou et al. (2013) for further empirical results.
In recent years there has been a great deal of interest in the problem of learning a low rank matrix from a set of linear measurements. A widely studied and successful instance of this problem arises in the context of matrix completion or collaborative filtering, in which we want to recover a low rank (or approximately low rank) matrix from a small sample of its entries, see e.g. Srebro et al. (2005); Abernethy et al. (2009) and references therein. One prominent method of solving this problem is trace norm regularization: we look for a matrix which closely fits the observed entries and has a small trace norm (sum of singular values) (Jaggi and Sulovsky, 2010; Toh and Yun, 2011; Mazumder et al., 2010). In our numerical experiments we consider the spectral support norm and spectral boxnorm as alternatives to the trace norm and compare their performance.
Another application of matrix learning is multitask learning. In this framework a number of tasks, such as classifiers or regressors, are learned by taking advantage of commonalities between them. This can improve upon learning the tasks separately, for instance when insufficient data is available to solve each task in isolation (see e.g. Evgeniou et al., 2005; Argyriou et al., 2007, 2008; Jacob et al., 2009a; Cavallanti et al., 2010; Maurer, 2006; Maurer and Pontil, 2008). An approach which has been successful is the use of spectral regularizers such as the trace norm to learn a matrix where the columns represent the individual tasks, and in this paper we compare the performance of the spectral support and boxnorms as penalties in multitask learning problems.
Finally, we note that this is a longer version of the conference paper (McDonald et al., 2014) and includes new theoretical and experimental results.
1.2 Contributions
We summarise the main contributions of this paper.

We show that the vector support norm is a special case of the more general boxnorm, which in turn can be seen as a perturbation of the former. The boxnorm can be written as a parameterized infimum of quadratics, and this framework is instrumental in deriving a fast algorithm to compute the norm and the proximity operator of the squared norm in time. Apart from improving on the algorithm for the proximity operator in Argyriou et al. (2012), this method allows one to use optimal first order optimization algorithms (Nesterov, 2007) for the boxnorm^{1}^{1}1We note that recently Chatterjee et al. (2014) showed that the proximity operator of the vector support norm can be computed in . Here we directly follow Argyriou et al. (2012) and consider the squared support norm..

We extend the support and boxnorms to orthogonally invariant matrix norms. We note that the spectral boxnorm is essentially equivalent to the cluster norm, which in turn can be interpreted as a perturbation of the spectral support norm in the sense of the Moreau envelope. Our computation of the vector boxnorm and its proximity operator also extends naturally to the spectral case. This allows us to use proximal gradient methods for the cluster norm. Furthermore, we provide a method to apply the centered versions of the penalties, which are important in applications.

We present extensive numerical experiments on both synthetic and real matrix learning datasets. Our findings indicate that regularization with the spectral support and boxnorms produces stateofthe art results on a number of popular matrix completion benchmarks and centered variants of the norms show a significant improvement in performance over the centered trace norm and the matrix elastic net on multitask learning benchmarks.
1.3 Notation
We use for the set of integers from up to and including . We let be the dimensional real vector space, whose elements are denoted by lower case letters. We let and be the subsets of vectors with nonnegative and strictly positive components, respectively. We denote by the unit simplex, . For any vector , its support is defined as . We use to denote either the scalar or a vector of all ones, whose dimension is determined by its context. Given a subset of , the dimensional vector has ones on the support , and zeros elsewhere. We let be the space of real matrices and write to denote the matrix whose columns are formed by the vectors . For a vector , we denote by the diagonal matrix having elements on the diagonal. We say matrix is diagonal if whenever . We denote the trace of a matrix by , and its rank by . We let be the vector formed by the singular values of , where , and where we assume that the singular values are ordered nonincreasing, i.e. . We use to denote the set of real symmetric matrices, and to denote the subset of positive semidefinite matrices. We use to denote the positive semidefinite ordering on . The notation denotes the standard inner products on and , that is for , and , for . Given a norm on or , denotes the corresponding dual norm, given by . On we denote by the Euclidean norm, and on we denote by the Frobenius norm and by the trace norm, that is the sum of singular values.
1.4 Organization
The paper is organized as follows. In Section 2, we review a general class of norms and characterize their unit ball. In Section 3, we specialize these norms to the boxnorm, which we show is a perturbation of the support norm. We study the properties of the norms and we describe the geometry of the unit balls. In Section 4, we compute the boxnorm and we provide an efficient method to compute the proximity operator of the squared norm. In Section 5, we extend the norms to orthogonally invariant matrix norms – the spectral support and spectral boxnorms – and we show that these exhibit a number of properties which relate to the vector properties in a natural manner. In Section 6, we review the clustered multitask learning setting, we recall the cluster norm introduced by Jacob et al. (2009a) and we show that the cluster norm corresponds to the spectral boxnorm. We also provide a method for solving the resulting matrix regularization problem using “centered” norms. In Section 7, we apply the norms to matrix learning problems on a number of simulated and real datasets and report on their performance. In Section 8, we discuss extensions to the framework and suggest directions for future research. Finally, in Section 9, we conclude.
2 Preliminaries
In this section we review a family of norms parameterized by a set , and which we call the norms. They are closely related to the norms considered in Micchelli et al. (2010, 2013). Similar norms are also discussed in Bach et al. (2011, Sect. 1.4.2) where they are called norms. We first recall the definition of the norm.
Definition 1
Let be a convex bounded subset of the open positive orthant. For the norm is defined as
(1) 
Note that the function is strictly convex on , hence every minimizing sequence converges to the same point. The infimum is, however, not attained in general because a minimizing sequence may converge to a point on the boundary of . For instance, if , then and the minimizing sequence converges to the point , which belongs to only if all the components of are different from zero.
Proposition 2
The norm is well defined and the dual norm is given, for , by
(2) 
Proof Consider the expression for the dual norm. The function is a norm since it is a supremum of norms. Recall that the Fenchel conjugate of a function is defined for every as . It is a standard result from convex analysis that for any norm , the Fenchel conjugate of the function satisfies , where is the corresponding dual norm (see, e.g. Lewis, 1995). By the same result, for any norm the biconjugate is equal to the norm, that is . Applying this to the dual norm we have, for every , that
This is a minimax problem in the sense of von Neumann (see e.g. Prop. 2.6.3 in Bertsekas et al., 2003), and we can exchange the order of the and the , and solve the latter (which is in fact a maximum) componentwise. The gradient with respect to is zero for , and substituting this into the objective function we obtain . It follows that the expression in (1) defines a norm, and its dual norm is defined by (2), as required.
The norm (1) encompasses a number of well known norms. For instance, for the norm is defined, for every , as , if and . For , one can show (Micchelli and Pontil, 2005, Lemma 26), that , where we have defined . For this confirms the set corresponding to the norm as claimed above. Similarly, for we have that , where . The norm is obtained as both a primal and dual norm in the limit as tends to 2. See also Aflalo et al. (2011) who considered the case of .
Other norms which belong to the family (1) are presented in (Micchelli et al., 2013) and correspond to choosing , where is a convex cone. A specific example described therein is the wedge penalty, which corresponds to choosing .
We now describe the unit ball of the norm when the set is a polyhedron and we characterize the unit ball of the norm. This setting applies to a number of norms of practical interest, including the group lasso with overlap, the wedge norm mentioned above and, as we shall see, the support norm. To describe our observation, for every vector , we define the seminorm
Proposition 3
Let such that and let .
Then we have, for every , that
(3) 
Moreover, the unit ball of the norm is given by the convex hull of the set
(4) 
The proof of this result is presented in the appendix. It is based on observing that the Minkowski functional (see e.g. Rudin, 1991) of the convex hull of the set (4) is a norm and it is given by the right hand side of equation (3); we then prove that this norm coincides with by noting that both norms share the same dual norm. To illustrate an application of the proposition, we specialize it to the group Lasso with overlap (Jacob et al., 2009b).
Corollary 4
If is a collection of subsets of such that and is the interior of the set , then we have, for every , that
(5) 
Moreover, the unit ball of the norm is given by the convex hull of the set
(6) 
We do not claim any originality in the above corollary and proposition, although we cannot find a specific reference. The utility of the result is that it links seemingly different norms such as the group Lasso with overlap and the norms, which provide a more compact representation, involving only additional variables. This formulation is especially useful whenever the optimization problem (1) can be solved in closed form. One such example is provided by the wedge norm described above. In the next section we discuss one more important case, the boxnorm, which plays a central role in this paper.
3 The BoxNorm and the Support Norm
We now specialize our analysis to the case that
(7) 
where and . We call the norm defined by (1) the boxnorm and we denote it by .
The structure of set for the boxnorm will be fundamental in computing the norm and deriving the proximity operator in Section 4. Furthermore, we note that the constraints are invariant with respect to permutations of the components of and, as we shall see in Section 5, this property is key to extending the norm to matrices. Finally, while a restriction of the general family, the boxnorm nevertheless encompasses a number of norms including the and norms, as well as the support norm, which we now recall.
For every , the support norm (Argyriou et al., 2012) is defined as the norm whose unit ball is the convex hull of the set of vectors of cardinality at most and norm no greater than one. The authors show that the support norm can be written as the infimal convolution (see Rockafellar, 1970, p. 34)
(8) 
where is the collection of all subsets of containing at most elements. The support norm is a special case of the group lasso with overlap (Jacob et al., 2009b), where the cardinality of the support sets is at most . When used as a regularizer, the norm encourages vectors to be a sum of a limited number of vectors with small support. Note that while definition (8) involves a combinatorial number of variables, Argyriou et al. (2012) observed that the norm can be computed in , a point we return to in Section 4.
Comparing equation (8) with Corollary 4 it is evident that the support norm is a norm where , which by symmetry can be expressed as . Hence, we see that the support norm is a special case of the boxnorm.
Despite the complicated form of (8), Argyriou et al. (2012) observe that the dual norm has a simple formulation, namely the norm of the largest components,
(9) 
where is the vector obtained from by reordering its components so that they are nonincreasing in absolute value. Note from equation (9) that for and , the dual norm is equal to the norm and norm, respectively. It follows that the support norm includes the norm and norm as special cases.
We now provide a different argument illustrating that the support norm belongs to the family of boxnorms using the dual norm. We first derive the dual boxnorm.
Proposition 5
The dual boxnorm is given by
(10) 
where and is the largest integer not exceeding .
Proof
We need to solve problem (2). We make the change of variable and observe that the constraints on induce the constraint set , where . Furthermore .
The result then follows by taking the supremum over .
We see from equation (10) that the dual norm decomposes into a weighted combination of the norm, the support norm and a residual term, which vanishes if . For the rest of this paper we assume this holds, which loses little generality. This choice is equivalent to requiring that , which is the case considered by Jacob et al. (2009a) in the context of multitask clustering, where is interpreted as the number of clusters and as the number of tasks. We return to this case in Section 6, where we explain in detail the link between the spectral support norm and the cluster norm.
Observe that if , and , the dual boxnorm (10) coincides with dual support norm in equation (9). We conclude that if
then the norm coincides with the support norm.
3.1 Properties of the Norms
In this section we illustrate a number of properties of the boxnorm and the connection to the support norm. The first result follows as a special case of Proposition 3.
Corollary 6
If and , for , then it holds that
Furthermore, the unit ball of the norm is given by the convex hull of the set
(11) 
Notice in Equation (11) that if , then as tends to zero, we obtain the expression of the support norm (8), recovering in particular the support constraints. If is small and positive, the support constraints are not imposed, however most of the weight for each tends to be concentrated on . Hence, Corollary 6 suggests that if then the boxnorm regularizer will encourage vectors whose dominant components are a subset of a union of a small number of groups .
Our next result links two norms whose parameter sets are related by a linear transformation with positive coefficients.
Lemma 7
Let be a convex bounded subset of the positive orthant in , and let , where . Then
Proof We consider the definition of the norm in (1). We have
(12) 
where we have made the change of variable . Next we observe that
(13) 
where we interchanged the order of the minimum and the infimum and solved for componentwise, setting .
The result now follows by combining equations (12) and (13).
In Section 3 we characterized the support norm as a special case of the boxnorm. Conversely, Lemma 7 allows us to interpret the boxnorm as a perturbation of the support norm with a quadratic regularization term.
Proposition 8
Let be the boxnorm on with parameters and , for , then
(14) 
Proof
The result directly follows from Lemma 7 for
, and .
Lemma 7 and Proposition 8 can further be interpreted using the Moreau envelope from convex optimization, which we now recall (Rockafellar and Wets, 2009, Ch. 1 §G).
Definition 9
Let be proper, lower semicontinuous and let . The Moreau envelope of with parameter is defined as
Note that minorizes and is convex and smooth (Bauschke and Combettes, 2010, see e.g.). It acts as a parameterized smooth approximation to from below, which motivates its use in variational analysis (see e.g. Rockafellar and Wets, 2009, for further discussion). Lemma 7 therefore says that is a Moreauenvelope of with parameter whenever is defined as , . In particular we see from (14) that the squared boxnorm, scaled by a factor of , is a Moreau envelope of the squared support norm as we have
(15) 
where and .
Proposition 8 further allows us to decompose the solution to a vector learning problem using the squared boxnorm into two components with particular structure. Specifically, consider the regularization problem
(16) 
with data and response . Using Proposition 8 and setting , we see that (16) is equivalent to
(17) 
Furthermore, if solves problem (17) then solves problem (16). The solution can therefore be interpreted as the superposition of a vector which has small norm, and a vector which has small support norm, with the parameter regulating these two components. Specifically, as tends to zero, in order to prevent the objective from blowing up, must also tend to zero and we recover support norm regularization. Similarly, as tends to , vanishes and we have a simple ridge regression problem.
A further consequence of Proposition 8 is the differentiability of the squared boxnorm.
Proposition 10
If the squared boxnorm is differentiable on and its gradient
is Lipschitz continuous with parameter .
3.2 Geometry of the Norms
In this section, we briefly investigate the geometry of the boxnorm. Figure 4 depicts the unit balls for the support norm in for various parameter values, setting throughout. For and we recognize the and balls respectively. For the unit ball retains characteristics of both norms, and in particular we note the discontinuities along each of , and planes, as in the case of the norm.
Figure 4 depicts the unit balls for the boxnorm for a range of values of and , with . We see that in general the balls increase in volume with each of and , holding the other parameter fixed. Comparing the support norm (), that is the norm, and the boxnorm (, ), we see that the parameter smooths out the sharp edges of the norm. This is also visible when comparing the support () and the box (, ). This illustrates the smoothing effect of the parameter , as suggested by Proposition 10.
We can gain further insight into the shape of the unit balls of the boxnorm from Corollary 6. Equation (11) shows that the primal unit ball is the convex hull of ellipsoids in , where for each group the semiprinciple axis along dimension has length if , and length if . Similarly, the unit ball of the dual boxnorm is the intersection of ellipsoids in where for each group the semiprinciple axis in dimension has length if , and length if (see also Equation 37 in the appendix). It is instructive to further consider the effect of the parameter on the unit balls for fixed . To this end, recall that since , when we have . In this case, for all values of in , the objective in (1) is attained by setting for all , and we recover the norm, scaled by , for the primal boxnorm. Similarly in (2), the dual norm gives rise to the norm, scaled by . In the remainder of this section we therefore only consider the cases in
For , . The unit ball of the primal boxnorm is the convex hull of the ellipsoids defined by
(18) 
and the unit ball of the dual boxnorm is the intersection of the ellipsoids defined by
(19) 
For , . The unit ball of the primal boxnorm is the convex hull of the ellipsoids defined by (18) in addition to the following
(20) 
and the unit ball of the dual boxnorm is the intersection of the ellipsoids defined by (19) in addition to the following
(21) 
For the primal norm, note that since , each of the ellipsoids in (18) is entirely contained within one of those defined by (20), hence when taking the convex hull we need only consider the latter set. Similarly for the dual norm, since , each of the ellipsoids in (19) is contained within one of those defined by (21), hence when taking the intersection we need only consider the latter set.
Figures 4 and 4 depict the constituent ellipses for various parameter values for the primal and dual norms. As tends to zero the ellipses become degenerate. For , taking the convex hull we recover the unit ball in the primal norm, and taking the intersection we recover the unit ball in the dual norm. As tends to we recover the norm in both the primal and the dual.
4 Computation of the Norm and the Proximity Operator
In this section, we compute the norm and the proximity operator of the squared boxnorm by explicitly solving the optimization problem (1). We also specialize our results to the support norm and comment on the improvement with respect the method by Argyriou et al. (2012). Recall that, for every vector , denotes the vector obtained from by reordering its components so that they are nonincreasing in absolute value.
Theorem 11
For every it holds that
(22) 
where , , , and are the unique integers in that satisfy ,
(23) 
and we have defined and . Furthermore, the minimizer has the form
Proof We solve the constrained optimization problem
(24) 
To simplify the notation we assume without loss of generality that are positive and ordered nonincreasing, and note that the optimal are ordered non increasing. To see this, let . Now suppose that for some and define to be identical to , except with the and elements exchanged. The difference in objective values is
which is negative so cannot be a minimizer.
We further assume without loss of generality that for all , and (see Remark 12 below). The objective is continuous and we take the infimum over a closed bounded set, so a solution exists and it is unique by strict convexity. Furthermore, since , the sum constraint will be tight at the optimum. Consider the Lagrangian function
(25) 
where is a strictly positive multiplier, and is to be chosen to make the sum constraint tight, call this value . Let be the minimizer of over subject to .
We claim that solves equation (24). Indeed, for any , , which implies that
If in addition we impose the constraint , the second term on the right hand side is at most zero, so we have for all such that
whence it follows that is the minimizer of (24).
We can therefore solve the original problem by minimizing the Lagrangian (25) over the box constraint. Due to the coupling effect of the multiplier, the problem is separable, and we can solve the simplified problem componentwise (see Micchelli et al., 2013, Theorem 3.1). For completeness we repeat the argument here. For every and , the unique solution to the problem is given by
(26) 
Indeed, for fixed , the objective function is strictly convex on and has a unique minimum on (see Figure 1.b in Micchelli et al. (2013) for an illustration). The derivative of the objective function is zero for , strictly positive below and strictly increasing above . Considering these three cases the result follows and is determined by (26) where satisfies .
The minimizer then has the form
where are determined by the value of which satisfies
i.e. , where .
The value of the norm follows by substituting into the objective and we get
as required. We can further characterize and by considering the form of . By construction we have and , or equivalently