A Appendix

Fitting Spectral Decay with the -Support Norm

Abstract

The spectral -support norm enjoys good estimation properties in low rank matrix learning problems, empirically outperforming the trace norm. Its unit ball is the convex hull of rank matrices with unit Frobenius norm. In this paper we generalize the norm to the spectral -support norm, whose additional parameter can be used to tailor the norm to the decay of the spectrum of the underlying model. We characterize the unit ball and we explicitly compute the norm. We further provide a conditional gradient method to solve regularization problems with the norm, and we derive an efficient algorithm to compute the Euclidean projection on the unit ball in the case . In numerical experiments, we show that allowing to vary significantly improves performance over the spectral -support norm on various matrix completion benchmarks, and better captures the spectral decay of the underlying model.

Keywords. -support norm, orthogonally invariant norms, matrix completion, multitask learning, proximal point algorithms.

1 Introduction

The problem of learning a sparse vector or a low rank matrix has generated much interest in recent years. A popular approach is to use convex regularizers which encourage sparsity, and a number of these have been studied with applications including image denoising, collaborative filtering and multitask learning, see for example, [Buehlmann and van der Geer 2011, Wainwright 2014] and references therein.

Recently, the -support norm was proposed by [Argyriou et al. 2012], motivated as a tight relaxation of the set of -sparse vectors of unit Euclidean norm. The authors argue that as a regularizer for sparse vector estimation, the norm empirically outperforms the Lasso [Tibshirani 1996] and Elastic Net [Zou and Hastie 2005] penalties. Statistical bounds on the Gaussian width of the -support norm have been provided by [Chatterjee et al. 2014]. The -support norm has also been extended to the matrix setting. By applying the norm to the vector of singular values of a matrix, [McDonald et al. 2014] obtain the orthogonally invariant spectral -support norm, reporting state of the art performance on matrix completion benchmarks.

Motivated by the performance of the -support norm in sparse vector and matrix learning problems, in this paper we study a natural generalization by considering the -norms (for ) in place of the Euclidean norm. These allow a further degree of freedom when fitting a model to the underlying data. We denote the ensuing norm the -support norm. As we demonstrate in numerical experiments, using is not necessarily the best choice in all instances. By tuning the value of the model can incorporate prior information regarding the singular values. When prior knowledge is lacking, the parameter can be chosen by validation, hence the model can adapt to a variety of decay patterns of the singular values. An interesting property of the norm is that it interpolates between the norm (for ) and the -norm (for ). It follows that varying both and the norm allows one to learn sparse vectors which exhibit different patterns of decay in the non-zero elements. In particular, when the norm prefers vectors which are constant.

A main goal of the paper is to study the proposed norm in matrix learning problems. The -support norm is a symmetric gauge function hence it induces the orthogonally invariant spectral -support norm. This interpolates between the trace norm (for ) and the Schatten -norms (for ) and its unit ball has a simple geometric interpretation as the convex hull of matrices of rank no greater than and Schatten -norm no greater than one. This suggests that the new norm favors low rank structure and the effect of varying allows different patterns of decay in the spectrum. In the special case of , the -support norm is the dual of the Ky-Fan -norm [Bhatia 1997] and it encourages a flat spectrum when used as a regularizer.

The main contributions of the paper are: i) we propose the -support norm as an extension of the -support norm and we characterize in particular the unit ball of the induced orthogonally invariant matrix norm (Section 3); ii) we show that the norm can be computed efficiently and we discuss the role of the parameter (Section 4); iii) we outline a conditional gradient method to solve the associated regularization problem for both vector and matrix problems (Section 5); and in the special case we provide an computation of the projection operator (Section 5.1); finally, iv) we present numerical experiments on matrix completion benchmarks which demonstrate that the proposed norm offers significant improvement over previous methods, and we discuss the effect of the parameter (Section 6). The appendix contains derivations of results which are sketched in or are omitted from the main body of the paper.

Notation. We use for the set of integers from up to and including . We let be the -dimensional real vector space, whose elements are denoted by lower case letters. For any vector , its support is defined as , and its cardinality is defined as . We let be the space of real matrices. We denote the rank of a matrix as . We let be the vector formed by the singular values of , where , and where we assume that the singular values are ordered nonincreasing, that is . For the -norm of a vector is defined as and . Given a norm on or , denotes the corresponding dual norm, defined by . The convex hull of a subset of a vector space is denoted .

2 Background and Previous Work

For every , the -support norm is defined as the norm whose unit ball is given by

(2.1)

that is, the convex hull of the set of vectors of cardinality at most and -norm no greater than one [Argyriou et al. 2012]. We readily see that for and we recover the unit ball of the and -norms respectively.

The -support norm of a vector can be expressed as an infimal convolution [Rockafellar 1970, p. 34],

(2.2)

where is the collection of all subsets of containing at most elements and the infimum is over all vectors such that , for . Equation (2.2) highlights that the -support norm is a special case of the group lasso with overlap [Jacob et al. 2009], where the cardinality of the support sets is at most . This expression suggests that when used as a regularizer, the norm encourages vectors to be a sum of a limited number of vectors with small support. Due to the variational form of (2.2) computing the norm is not straightforward, however [Argyriou et al. 2012] note that the dual norm has a simple form, namely it is the -norm of the largest components,

(2.3)

where is the vector obtained from by reordering its components so that they are nonincreasing in absolute value. Note also from equation (2.3) that for and , the dual norm is equal to the -norm and -norm, respectively, which agrees with our earlier observation regarding the primal norm.

A related problem which has been studied in recent years is learning a matrix from a set of linear measurements, in which the underlying matrix is assumed to have sparse spectrum (low rank). The trace norm, the -norm of the singular values of a matrix, has been shown to perform well in this setting, see e.g. [Argyriou et al. 2008, Jaggi and Sulovsky 2010]. Recall that a norm on is called orthogonally invariant if , for any orthogonal matrices and . A classical result by von Neumann establishes that a norm is orthogonally invariant if and only if it is of the form , where is the vector formed by the singular values of in nonincreasing order, and is a symmetric gauge function [Von Neumann 1937]. In other words, is a norm which is invariant under permutations and sign changes of the vector components, that is , where is any permutation matrix and is diagonal with entries equal to [Horn and Johnson 1991, p. 438].

Examples of symmetric gauge functions are the norms for and the corresponding orthogonally invariant norms are called the Schatten -norms [Horn and Johnson 1991, p. 441]. In particular, those include the trace norm and Frobenius norm for and respectively. Regularization with Schatten -norms has been previously studied by [Argyriou et al. 2007] and a statistical analysis has been performed by [Rohde and Tsybakov 2011]. As the set includes all subsets of size , expression (2.2) for the -support norm reveals that is a symmetric gauge function. [McDonald et al. 2014] use this fact to introduce the spectral -support norm for matrices, by defining , for and report state of the art performance on matrix completion benchmarks.

3 The -Support Norm

In this section we introduce the -support norm as a natural extension of the -support norm. This follows by applying the -norm, rather than the Euclidean norm, in the infimum convolution definition of the norm.

Definition 1.

Let and . The -support norm of a vector is defined as

(3.1)

where the infimum is over all vectors such that , for .

Let us note that the norm is well defined. Indeed, positivity, homogeneity and non degeneracy are immediate. To prove the triangle inequality, let . For any there exist and such that , , , and . As , we have

and the result follows by letting tend to zero.

Note that, since a convex set is equivalent to the convex hull of its extreme points, Definition 1 implies that the unit ball of the -support norm, denoted by , is given by the convex hull of the set of vectors with cardinality no greater than and -norm no greater than 1, that is

(3.2)

Definition 1 gives the norm as the solution of a variational problem. Its explicit computation is not straightforward in the general case, however for the unit ball (3.2) does not depend on and is always equal to the unit ball. Thus, the -support norm is always equal to the -norm, and we do not consider further this case in this section. Similarly, for we recover the -norm for all values of . For , from the definition of the dual norm it is not difficult to show that . We return to this in Section 4 when we describe how to compute the norm for all values of .

Note further that in Equation (3.1), as tends to , the -norm of each is increasingly dominated by the largest component of . As the variational formulation tries to identify vectors with small aggregate -norm, this suggests that higher values of encourage each to tend to a vector whose entries are equal. In this manner varying allows us adjust the degree to which the components of vector can be clustered into (possibly overlapping) groups of size .

As in the case of the -support norm, the dual -support norm has a simple expression. Recall that the dual norm of a vector is defined by the optimization problem

(3.3)
Proposition 2.

If then the dual -support norm is given by

where and is the set of indices of the largest components of in absolute value. Furthermore, if and then the maximum in (3.3) is attained for

(3.4)

If the maximum is attained for

Note that for we recover the dual of the -support norm in (2.3).

3.1 The Spectral -Support Norm

From Definition 1 it is clear that the -support norm is a symmetric gauge function. This follows since contains all groups of cardinality and the -norms only involve absolute values of the components. Hence we can define the spectral -support norm as

Since the dual of any orthogonally invariant norm is given by , see e.g. [Lewis 1995], we conclude that the dual spectral -support norm is given by

The next result characterizes the unit ball of the spectral -support norm. Due to the relationship between an orthogonally invariant norm and its corresponding symmetric gauge function, we see that the cardinality constraint for vectors generalizes in a natural manner to the rank operator for matrices.

Proposition 3.

The unit ball of the spectral -support norm is the convex hull of the set of matrices of rank at most and Schatten -norm no greater than one.

In particular, if , the dual vector norm is given by , by . Hence, for any , the dual spectral norm is given by , that is the sum of the largest singular values, which is also known as the Ky-Fan -norm, see e.g. [Bhatia 1997].

4 Computing the Norm

In this section we compute the norm, illustrating how it interpolates between the and -norms.

Theorem 4.

Let . For every , and , it holds that

(4.1)

where , and for , we set , otherwise is the largest integer in satisfying

(4.2)

Furthermore, the norm can be computed in time.

Proof.

Note first that in (4.1) when we understand the first term in the right hand side to be zero, and when we understand the second term to be zero.

We need to compute

where the dual norm is described in Proposition 2. Let . The problem is then equivalent to

(4.3)

This further simplifies to the -dimensional problem

Note that when , the solution is given by the dual of the -norm, that is the -norm. For the remainder of the proof we assume that . We can now attempt to use Holder’s inequality, which states that for all vectors such that , , and the inequality is tight if and only if

We use it for the vector . The components of the maximizer satisfy if , and

where for every , denotes the r.h.s. in equation (4.1). We then need to verify that the ordering constraints are satisfied. This requires that

which is equivalent to inequality (4.2) for . If this inequality is true we are done, otherwise we set and solve the smaller problem

We use again Hölder’s inequality and keep the result if the ordering constraints are fulfilled. Continuing in this way, the generic problem we need to solve is

where . Without the ordering constraints the maximum, , is obtained by the change of variable followed by applying Hölder’s inequality. A direct computation provides that the maximizer is if , and

Using the relationship , we can rewrite this as

Hence, the ordering constraints are satisfied if

which is equivalent to (4.2). Finally note that is a nondecreasing function of . This is because the problem with a smaller value of is more constrained, namely, it solves (4.3) with the additional constraints . Moreover, if the constraint (4.2) holds for some value then it also holds for a smaller value of , hence we maximize the objective by choosing the largest .

The computational complexity stems from using the monotonicity of with respect to , which allows us to identify the critical value of using binary search. ∎

Note that for we recover the -norm and for we recover the result in [Argyriou et al. 2012, McDonald et al. 2014], however our proof technique is different from theirs.

Remark 5 (Computation of the norm for ).

Since the norm computed above for is continuous in , the special cases and can be derived by a limiting argument. We readily see that for the norm does not depend on and it is always equal to the -norm, in agreement with our observation in the previous section. For we obtain that .

5 Optimization

In this section, we describe how to solve regularization problems using the vector and matrix -support norms. We consider the constrained optimization problem

(5.1)

where is in or , is a regularization parameter and the error function is assumed to be convex and continuously differentiable. For example, in linear regression a valid choice is the square error, , where is matrix of observations and a vector of response variables. Constrained problems of form (5.1) are also referred to as Ivanov regularization in the inverse problems literature [Ivanov et al. 1978].

A convenient tool to solve problem (5.1) is provided by the Frank-Wolfe method [Frank and Wolfe 1956], see also [Jaggi 2013] for a recent account.

  Choose such that
  for   do
      Compute Compute Update , for
  end for
Algorithm 1 Frank-Wolfe.

The method is outlined in Algorithm 1, and it has worst case convergence rate . The key step of the algorithm is to solve the subproblem

(5.2)

where , that is the gradient of the objective function at the -th iteration. This problem involves computing a subgradient of the dual norm at . It can be solved exactly and efficiently as a consequence of Proposition 2. We discuss here the vector case and postpone the discussion of the matrix case to Section 5.2. By symmetry of the -norm, problem (5.2) can be solved in the same manner as the maximum in Proposition 2, and the solution is given by , where is given by (3.4). Specifically, letting be the set of indices of the largest components of in absolute value, for we have

(5.3)

and, for we choose the subgradient

(5.4)

5.1 Projection Operator

An alternative method to solve (5.1) in the vector case is to consider the equivalent problem

(5.5)

where is the indicator function of convex set . Proximal gradient methods can be used to solve optimization problems of the form , where is a convex loss function with Lipschitz continuous gradient, is a regularization parameter, and is a convex function for which the proximity operator can be computed efficiently see [Beck and Teboulle 2009, Nesterov 2007] and references therein. The proximity operator of with parameter is defined as . The proximity operator for the squared -support norm was computed by [Argyriou et al. 2012] and [McDonald et al. 2014], and for the -support norm by [Chatterjee et al. 2014].

In the special case that , where is a convex set, the proximity operator reduces to the projection operator onto . For the -support norm, for the case we can compute the projection onto its unit ball using the following result.

Proposition 6.

For every , the projection of onto the unit ball of the )-norm is given by

(5.6)

where if , otherwise is chosen such that . Furthermore, the projection can be computed in time.

Proof.

(Sketch) We solve the optimization problem

(5.7)

We consider two cases. If , then the problem decouples and we solve it componentwise. If , we solve problem (5.7) by minimizing the Lagrangian function with nonnegative multiplier . This can be done componentwise, and at the optimum the constraint will be tight. Finally, both cases can be combined into the form of (5.6). The complexity follows by taking advantage of the monotonicity of . ∎

We can use Proposition 6 to project onto the unit ball of radius by a rescaling argument (see the appendix for details).

5.2 Matrix Problem

Given data matrix for which we observe a subset of entries, we consider the constrained optimization problem

(5.8)

where the operator applied to a matrix sets unobserved values to zero. As in the vector case, the Frank-Wolfe method can be applied to the matrix problems. Algorithm 1 is particularly convenient in this case as we only need to compute the largest singular values, which can result in a computationally efficient algorithm. The result is a direct consequence of Proposition 2 and von Neumann’s trace inequality, see e.g. [Marshall and Olkin 1979, Ch. 9 Sec. H.1.h]. We obtain that the solution of the inner minimization step is where and are the top left and right singular vectors of the gradient of the objective function in (5.8) evaluated at the current solution, whose singular values we denote by , and is obtained from as per equations (5.3) and (5.4), for and , respectively.

Note also that the proximity operator of the norm and the Euclidean projection on the associated unit ball both require the full singular value decomposition to be performed. Indeed, the proximity operator of an orthogonally invariant norm at is given by , where and are the matrices formed by the left and right singular vectors of , see e.g. [Argyriou et al. 2011, Prop. 3.1], and this requires the full decomposition.

dataset norm test error
rank 5 trace 0.8415 (0.03) - -
=10% k-supp 0.8343 (0.03) 3.3 -
kp-supp 0.8108 (0.05) 5.0
rank 5 trace 0.6161 (0.03) - -
=15% k-supp 0.6129 (0.03) 3.3 -
kp-supp 0.4262 (0.04) 5.0
rank 5 trace 0.4453 (0.03) - -
=20% k-supp 0.4436 (0.02) 3.5 -
kp-supp 0.2425 (0.02) 5.0
rank 5 trace 0.1968 (0.01) - -
=30% k-supp 0.1838 (0.01) 5.0 -
kp-supp 0.0856 (0.01) 5.0
Table 1: Matrix completion on rank 5 matrix with flat spectrum. The improvement of the -support norm over the -support and trace norms is considerable (statistically significant at a level ).

6 Numerical Experiments

In this section we apply the spectral -support norm to matrix completion (collaborative filtering), in which we want to recover a low rank, or approximately low rank, matrix from a small sample of its entries, see e.g. [Jaggi and Sulovsky 2010]. One prominent method of solving this problem is trace norm regularization: we look for a matrix which closely fits the observed entries and has a small trace norm (sum of singular values) [Jaggi and Sulovsky 2010, Mazumder et al. 2010, Toh and Yun 2011]. We apply the -support norm to this framework and we investigate the impact of varying . Next we compare the spectral -support norm to the trace norm and the spectral -support norm () in both synthetic and real datasets. In each case we solve the optimization problem (5.8) using the Frank-Wolfe method as outlined in Section 5. We determine the values of and by validation, averaged over a number of trials. Specifically, we choose the optimal , , as well as the regularization parameter by validation over a grid. We let alpha vary in to with step , we let vary over values from to , plus , and vary from to . Our code is available from http://www0.cs.ucl.ac.uk/staff/M.Pontil/software.html.


Figure 1: Optimal vs. decay .

Figure 2: Optimal fitted to Matrix spectra with various decays.

Impact of . A key motivation for the additional parameter is that it allows us to tune the norm to the decay of the singular values of the underlying matrix. In particular the variational formulation of (3.1) suggests that as the spectrum of the true low rank matrix flattens out, larger values of should be preferred.

We ran the method on a set of matrices of rank 12, with decay of the non zero singular values proportional to , for 26 values of between and , and we determined the corresponding optimal value of . Figure 1 illustrates the optimum value of as a function of . We clearly observe the negative slope, that is the steeper the slope the smaller the optimal value of . Figure 2 shows the spectrum and the optimal for several decay values.

Note that is never equal to 1, which is a special case in which the norm is independent of , and is equal to the trace norm. In each case the improvement of the spectral -support norm over the -support and trace norms is statistically significant at a level .

Figure 3 illustrates the impact of the curvature on the test error on synthetic and real datasets. We observe that the error levels off as tends to infinity, so for these specific datasets the major gain is to be had for small values of . The optimum value of for both the real and synthetic datasets is statistically different from (-support norm), and (trace norm).

dataset norm test error
MovieLens trace 0.2017 - -
100k k-supp 0.1990 1.9 -
kp-supp 0.1921 2.0
Jester 1 trace 0.1752 - -
k-supp 0.1739 6.4 -
kp-supp 0.1744 2.0
kp-supp 0.1731 2.0 6.5
Jester 3 trace 0.1959 - -
k-supp 0.1942 2.1 -
kp-supp 0.1841 2.0
Table 2: Matrix completion on real datasets. The improvement of the -support norm over the -support and trace norms is statistically significant at a level .

Simulated Data. Next we compared the performance of the -support norm to that of the -support norm and the trace norm for a matrix with flat spectrum. As outlined above, as the spectrum of the true low rank matrix flattens out, larger values of should be preferred. Each matrix is generated as , where and are the singular vectors of the matrix , where , the entries of , and are i.i.d. standard Gaussian, and is diagonal with 5 non zero constant entries. Table 1 illustrates the performance of the norms on a synthetic dataset of rank 5, with identical singular values, that is a flat spectrum. In each regime the case outperforms the other norms by a substantial margin, with statistical significance at a level . We followed the framework of [McDonald et al. 2014] and use to denote the percentage of the data to use in the training set. We further replicated the setting of [McDonald et al. 2014] for synthetic matrix completion, and found that the -support norm outperformed the standard -support norm, as well as the trace norm, at a statistically significant level (see Table 3 in the appendix for details).

We note that although Frank-Wolfe method for the -support norm does not generally converge as quickly as proximal methods (which are available in the case of -support norm [McDonald et al. 2016, McDonald et al. 2014, Chatterjee et al. 2014]), the computational cost can be mitigated using the continuation method. Specifically given an ordered sequence of parameter values for we can proceed sequentially, initializing its value based on the previously computed value. Empirically we tried this approach for a series of 30 values of and found that the total computation time increased only moderately.

Real Data. Finally, we applied the norms to real collaborative filtering datasets. We observe a subset of the (user, rating) entries of a matrix and predict the unobserved ratings, with the assumption that the true matrix is likely to have low rank. We report on the MovieLens 100k dataset (http://grouplens.org/datasets/movielens/), which consists of ratings of movies, and the Jester 1 and 3 datasets (http://goldberg.berkeley.edu/jester-data/), which consist of ratings of jokes. We followed the experimental protocol in [McDonald et al. 2014, Toh and Yun 2011], using normalized mean absolute error [Toh and Yun 2011], and we implemented a final thresholding step as in [McDonald et al. 2014] (see the appendix for further details). The results are outlined in Table 2. The spectral -support outperformed the trace norm and the spectral -support norm, and the improvement is statistically significant at a level (the standard deviations, not shown here, are of the order of ). In summary the experiments suggest that the additional flexibility of the parameter does allow the model to better fit both the sparsity and the decay of the true spectrum.

Figure 3: Test error vs curvature (). Left axis: synthetic data (blue crosses); right axis: Jester 1 dataset (red circles).

7 Conclusion

We presented a generalization of the -support norm, the -support norm, where the additional parameter is used to better fit the decay of the components of the underlying model. We determined the dual norm, characterized the unit ball and computed an explicit expression for the norm. As the norm is a symmetric gauge function, we further described the induced spectral -support norm. We adapted the Frank-Wolfe method to solve regularization problems with the norm, and in the particular case we provided a fast computation for the projection operator. In numerical experiments we considered synthetic and real matrix completion problems and we showed that varying leads to significant performance improvements. Future work could include deriving statistical bounds for the performance of the norms, and situating the norms in the framework of other structured sparsity norms which have recently been studied.

Appendix A Appendix

In this appendix, we provide proofs of the results stated in the main body of the paper, and we include experimental details and results that were not included in the paper for space reasons.

a.1 Proof of Proposition 2

For every we have

where the first equality uses the definition of the unit ball (3.2) and the second equality is true because the maximum of a linear functional on a compact set is attained at an extreme point of the set. The third equality follows by using the cardinality constraint, that is we set if . Finally, the last equality follows by Hölder’s inequality in [Marshall and Olkin 1979, Ch. 16 Sec. D.1].

The second claim is a direct consequence of the cardinality constraint and Hölder’s inequality in . ∎

To prove Proposition 3 we require the following auxiliary result. Let be a finite dimensional vector space. Recall that a subset of is called balanced if whenever . Furthermore, is called absorbing if for any , for some .

Lemma 7.

Let be a bounded, convex, balanced, and absorbing set. The Minkowski functional of , defined, for every , as

is a norm on .

Proof.

We give a direct proof that satisfies the properties of a norm. See also e.g. [Rudin 1991, §1.35] for further details. Clearly for all , and . Moreover, as is bounded, whenever .

Next we show that is one-homogeneous. For every , , let and note that

where we have made a change of variable and used the fact that .

Finally, we prove the triangle inequality. For every , if and then setting , we have

and since is convex, then . We conclude that . The proof is completed. ∎

Note that for such set , the unit ball of the induced norm is . Furthermore, if is a norm then its unit ball satisfies the hypotheses of Lemma 7.

a.2 Proof of Proposition 3

Define the set

and its convex hull , and consider the Minkowski functional

(A.1)

We show that is absorbing, bounded, convex and symmetric, and it follows by Lemma 7 that defines a norm on with unit ball equal to . The set is clearly bounded, convex and symmetric. To see that it is absorbing, let in have singular value decomposition , and let . If is zero then clearly , so assume it is non zero.

For let have entry equal to , and all remaining entries zero. We then have

Now for each , , and , so for any . Furthermore and , that is lies in the unit simplex in , so is a convex combination of elements of , in other words , and we have shown that is absorbing. It follows that satisfies the hypotheses of Lemma 7, and defines a norm on with unit ball equal to .

Since the constraints in involve spectral functions, the sets and are invariant to left and right multiplication by orthogonal matrices. It follows that is a spectral function, that is is defined in terms of the singular values of . By von Neumann’s Theorem [Von Neumann 1937] the norm it defines is orthogonally invariant and we have

where we have used Equation (3.2), which states that is the unit ball of the -support norm. It follows that the norm defined by (A.1) is the spectral -support norm with unit ball given by .

a.3 Proof of Proposition 6

We solve the optimization problem

(A.2)

We consider two cases. If , then the problem decouples and we solve it componentwise. Specifically we minimize subject to , and the solution is immediately given by