Convex Sparse Matrix Factorizations

Convex Sparse Matrix Factorizations

Francis Bach, Julien Mairal, Jean Ponce
Willow Project-team
Laboratoire d’Informatique de l’Ecole Normale Supérieure
(CNRS/ENS/INRIA UMR 8548)
45, rue dÕUlm, 75230 Paris, France
Abstract

We present a convex formulation of dictionary learning for sparse signal decomposition. Convexity is obtained by replacing the usual explicit upper bound on the dictionary size by a convex rank-reducing term similar to the trace norm. In particular, our formulation introduces an explicit trade-off between size and sparsity of the decomposition of rectangular matrices. Using a large set of synthetic examples, we compare the estimation abilities of the convex and non-convex approaches, showing that while the convex formulation has a single local minimum, this may lead in some cases to performance which is inferior to the local minima of the non-convex formulation.

1 Introduction

Sparse decompositions have become prominent tools in signal processing [1], image processing [2], machine learning, and statistics [3]. Many relaxations and approximations of the associated minimum cardinality problems are now available, based on greedy approaches [4] or convex relaxations through the -norm [1, 3]. Active areas of research are the design of efficient algorithms to solve the optimization problems associated with the convex non differentiable norms (see, e.g., [5]), the theoretical study of the sparsifying effect of these norms [6, 7], and the learning of the dictionary directly from data (see, e.g., [8, 2]).

In this paper, we focus on the third problem—namely, we assume that we are given a matrix and we look for factorizations of the form , where and , that are close to and such that the matrix is sparse. This corresponds to decomposing vectors in (the rows of ) over a dictionary of size . The columns of are the dictionary elements (of dimension ), while the rows of are the decomposition coefficients of each data point. Learning sparse dictionaries from data has shown great promise in signal processing tasks, such as image or speech processing [2], and core machine learning tasks such as clustering may be seen as special cases of this framework [9].

Various approaches have been designed for sparse dictionary learning. Most of them consider a specific loss between entries of and , and directly optimize over and , with additional constraints on and  [10, 11]: dictionary elements, i.e., columns of , may or may not be constrained to unit norms, while a penalization on the rows of is added to impose sparsity. Various forms of jointly non-convex alternating optimization frameworks may then be used [10, 11, 2]. The main goal of this paper is to study the possibility and efficiency of convexifying these non-convex approaches. As with all convexifications, this leads to the absence of non-global local minima, and should allow simpler analysis. However, does it really work in the dictionary learning context? That is, does convexity lead to better decompositions?

While in the context of sparse decomposition with fixed dictionaries, convexification has led to both theoretical and practical improvements [6, 3, 7], we report both positive and negative results in the context of dictionary learning. That is, convexification sometimes helps and sometimes does not. In particular, in high-sparsity and low-dictionary-size situations, the non-convex fomulation outperforms the convex one, while in other situations, the convex formulation does perform better (see Section 5 for more details).

The paper is organized as follows: we show in Section 2 that if the size of the dictionary is not bounded, then dictionary learning may be naturally cast as a convex optimization problem; moreover, in Section 3, we show that in many cases of interest, this problem may be solved in closed form, shedding some light on what is exactly achieved and not achieved by these formulations. Finally, in Section 4, we propose a mixed - formulation that leads to both low-rank and sparse solutions in a joint convex framework. In Section 5, we present simulations on a large set of synthetic examples.

Notations    Given a rectangular matrix and , we denote by or its element indexed by the pair , by its -th column and by its -th row. Moreover, given a vector , we denote by its -norm, i.e., for , and . We also write a matrix as , where each .

2 Decomposition norms

We consider a loss which is convex with respect to the second variable. We assume in this paper that all entries of are observed and the risk of the estimate is equal to. Note that our framework extends in a straightforward way to matrix completion settings by summing only over observed entries [12].

We consider factorizations of the form ; in order to constrain and , we consider the following optimization problem:

(1)

where and are any norms on and (on the column space and row space of the original matrix ). This corresponds to penalizing each column of and . In this paper, instead of considering and separately, we consider the matrix and the set of its decompositions on the form , and in particular, the one with minimum sum of norms , , . That is, for , we consider

(2)

If is strictly smaller than the rank of , then we let . Note that the minimum is always attained if is larger than or equal to the rank of . Given , each pair is defined up to a scaling factor, i.e., may be replaced by ; optimizing with respect to leads to the following equivalent formulation:

(3)

Moreover, we may derive another equivalent formulation by constraining the norms of the columns of to one, i.e.,

(4)

This implies that constraining dictionary elements to be of unit norm, which is a common assumption in this context [11, 2], is equivalent to penalizing the norms of the decomposition coefficients instead of the squared norms.

Our optimization problem in Eq. (1) may now be equivalently written as

(5)

with any of the three formulations of in Eqs. (2)-(4). The next proposition shows that if the size of the dictionary is allowed to grow, then we obtain a norm on rectangular matrices, which we refer to as a decomposition norm. In particular, this shows that if is large enough the problem in Eq. (5) is a convex optimization problem.

Proposition 1

For all , the limit exists and is a norm on rectangular matrices.

Proof  Since given , is nonnegative and clearly nonincreasing with , it has a nonnegative limit when tends to infinity. The only non trivial part is the triangular inequality, i.e., . Let and let and be the two -optimal decompositions, i.e., such that and . Without loss of generality, we may asssume that . We consider , , we have and . We obtain the triangular inequality by letting tend to zero.  
Following the last proposition, we now let tend to infinity; that is, if we denote , we consider the following rank-unconstrained and convex problem:

(6)

However, there are three potentially major caveats that should be kept in mind:

Convexity and polynomial time    Even though the norm leads to a convex function, computing or approximating it may take exponential time—in general, it is not because a problem is convex that it can be solved in polynomial time. In some cases, however, it may be computed in closed form, as presented in Section 3, while in other cases, an efficiently computable convex lower-bound is available (see Section 4).

Rank and dictionary size    The dictionary size must be allowed to grow to obtain convexity and there is no reason, in general, to have a finite such that . In some cases presented in Section 3, the optimal is finite, but we conjecture that in general the required may be unbounded. Moreover, in non sparse situations, the rank of and the dictionary size are usually equal, i.e., the matrices and have full rank. However, in sparse decompositions, may be larger than the rank of , and sometimes even larger than the underlying data dimension (the corresponding dictionaries are said the be overcomplete).

Local minima    The minimization problem in Eq. (1), with respect to and , even with very large, may still have multiple local minima, as opposed to the one in , i.e., in Eq. (6), which has a single local minimum. The main reason is that the optimization problem defining from , i.e., Eq. (3), may itself have multiple local minima. In particular, it is to be constrasted to the optimization problem

(7)

which will turn out to have no local minima if is large enough (see Section 4.3 for more details).

Before looking at special cases, we compute the dual norm of (see, e.g., [13] for the definition and properties of dual norms), which will be used later.

Proposition 2 (Dual norm)

The dual norm , defined as

is equal to .

Proof  We have, by convex duality (see, e.g., [13]),

Let . If ,

while if , then

The result follows.  

3 Closed-form decomposition norms

We now consider important special cases, where the decomposition norms can be expressed in closed form. For these norms, with the square loss, the convex optimization problems may also be solved in closed form. Essentially, in this section, we show that in simple situations involving sparsity (in particular when one of the two norms or is the -norm), letting the dictionary size go to infinity often leads to trivial dictionary solutions, namely a copy of some of the rows of . This shows the importance of constraining not only the -norms, but also the -norms, of the sparse vectors , , and leads to the joint low-rank/high-sparsity solution presented in Section 4.

3.1 Trace norm: and

When we constrain both the -norms of and of , it is well-known, that is the sum of the singular values of , also known as the trace norm [12]. In this case we only need dictionary elements, but this number will turn out in general to be a lot smaller—see in particular[14] for rank consistency results related to the trace norm. Moreover, with the square loss, the solution of the optimization problem in Eq. (5) is , where is the singular value decomposition of . Thresholding of singular values, as well as its interpretation as trace norm minimization is well-known and well-studied. However, sparse decompositions (as opposed to simply low-rank decompositions) have shown to lead to better decompositions in many domains such as image processing (see, e.g., [8]).

3.2 Sum of norms of rows:

When we use the -norm for , whatever the norm on , we have:

which implies immediately that

That is, the decomposition norm is simply the sum of the norms of the rows. Moreover, an optimal decomposition is , where is a vector with all null components except at , where it is equal to one. In this case, each row of is a dictionary element and the decomposition is indeed extremely sparse (only one non zero coefficient).

In particular, when , we obtain the sum of the -norms of the rows, which leads to a closed form solution to Eq. (6) as for all . Also, when , we obtain the sum of the -norms of the rows, i.e, the -norm of all entries of the matrix, which leads to decoupled equations for each entry and closed form solution .

These examples show that with the -norm on the decomposition coefficients, these simple decomposition norms do not lead to solutions with small dictionary sizes. This suggests to consider a larger set of norms which leads to low-rank/small-dictionary and sparse solutions. However, those two extreme cases still have a utility as they lead to good search ranges for the regularization parameter for the mized norms presented in the next section.

4 Sparse decomposition norms

We now assume that we have , i.e, we use the -norm on the dictionary elements. In this situation, when , as shown in Section 3.2, the solution corresponds to a very sparse but large (i.e., large dictionary size ) matrix ; on the contrary, when , as shown in Section 3.1, we get a small but non sparse matrix . It is thus natural to combine the two norms on the decomposition coefficients. The main result of this section is that the way we combine them is mostly irrelevant and we can choose the combination which is the easiest to optimize.

Proposition 3

If the loss is differentiable, then for any function , such that is a norm, and which is increasing with respect to both variables, the solution of Eq. (6) for is the solution of Eq. (6) for , for a certain and a potentially different regularization parameter .

Proof  If we denote and its Fenchel conjugate [13], then the dual problem of Eq. (6) is the problem of maximizing such that . Since the loss is differentiable, the primal solution is entirely characterized by the dual solution . The optimality condition for the dual problem is exactly that the gradient of is equal to , where is one of the maximizers in the definition of the dual norm, i.e., in . In this case, we have in closed form, and is the maximizer of . With our assumptions on , these maximizers are the same as the ones subject to and for certain , . The optimality condition is thus independent of . We then select the function which is practical as it leads to simple lower bounds (see below).  

We thus now consider the norm defined as . We denote by the convex function defined on symmetric matrices as , for which we have .

In the definition of in Eq. (2), we can optimize with respect to in closed form, i.e.,

is attained at (the value is infinite if the span of the columns of is not included in the span of the columns of ). Thus the norm is equal to

(8)

Though is a convex function of , we currently don’t have a polynomial time algorithm to compute it, but, since is convex and homogeneous, . This leads to the following lower-bounding convex optimization problem in the positive semi-definite matrix :

(9)

This problem can now be solved in polynomial time [13]. This computable lower bound in Eq. (9) may serve two purposes: (a) it provides a good initialization to gradient descent or path following rounding techniques presented in Section 4.1; (b) the convex lower bound provides sufficient conditions for approximate global optimality of the non convex problems [13].

4.1 Recovering the dictionary and/or the decomposition

Given a solution or approximate solution to our problem, one may want to recover dictionary elements and/or the decomposition for further analysis. Note that (a) having one of them automatically gives the other one and (b) in some situations, e.g., denoising of through estimating , the matrices and are not explicitly needed.

We propose to iteratively minimize with respect to (by gradient descent) the following function, which is a convex combination of the true function in Eq. (8) and its upper bound in Eq. (9):

When this is exactly our convex lower bound applied defined in Eq. (9), for which there are no local minima in , although it is not a convex function of (see Section 4.3 for more details), while at , we get a non-convex function of , with potentially multiple local minima. This path following strategy has shown to lead to good local minima in other settings [15].

Moreover, this procedure may be seen as the classical rounding operation that follows a convex relaxation—the only difference here is that we relax a hard convex problem into a simple convex problem. Finally, the same technique can be applied when minimizing the regularized estimation problem in Eq. (6), and, as shown in Section 5, rounding leads to better performance.

4.2 Optimization with square loss

In our simulations, we will focus on the square loss as it leads to simpler optimization, but our decomposition norm framework could be applied to other losses. With the square loss, we can optimize directly with respect to (in the same way theat we could earlier for computing the norm itself); we temporarily assume that is known; we have:

with a minimum attained at . The minimum is a convex function of and we now have a convex optimization problem over positive semi-definite matrices, which is equivalent to Eq. (6):

(10)

It can be lower bounded by the following still convex, but now solvable in polynomial time, problem:

(11)

This fully convex approach will be solved within a globally optimal low-rank optimization framework (presented in the next section). Then, rounding operations similar to Section 4.1 may be used to improve the solution—note that this rounding technique takes into account and it thus preferable to the direct application of Section 4.1.

4.3 Low rank optimization over positive definite matrices

We first smooth the problem by using as an approximation of , and as an approximation of .

Following [16], since we expect low-rank solutions, we can optimize over low-rank matrices. Indeed, [16] shows that if is a convex function over positive semidefinite symmetric matrices of size , with a rank deficient global minimizer (i.e., of rank ), then the function defined over matrices has no local minima as soon as . The following novel proposition goes a step further for twice differentiable functions by showing that there is no need to know in advance:

Proposition 4

Let be a twice differentiable convex function over positive semidefinite symmetric matrices of size , with compact level sets. If the function defined over matrices has a local minimum at a rank-deficient matrix , then is a global minimum of .

Proof  Let . The gradient of is equal to and the Hessian of is such that . Since we have a local mimimum, which implies that . Moreover, by invariance by post-multiplying by an orthogonal matrix, without loss of generality, we may consider that the last column of is zero. We now consider all directions with first columns equal to zero and last column being equal to a given . The second order Taylor expansion of is

Since we have a local minima, we must have . Since is arbitrary, this implies that . Together with the convexity of and , this implies that we have a global minimum of  [13].  
The last proposition suggests to try a small , and to check that a local minimum that we can obtain with descent algorithms is indeed rank-deficient. If it is, we have a solution; if not, we simply increase and start again until turns out to be greater than .

Note that minimizing our convex lower bound in Eq. (7) by any descent algorithm in is different than solving directly Eq. (1): in the first situation, there are no (non-global) local minima, whereas there may be some in the second situation. In practice, we use a quasi-Newton algorithm which has complexity to reach a stationary point, but requires to compute the Hessian of size to check and potentially escape local minima.

4.4 Links with sparse principal component analysis

If we now consider that we want sparse dictionary elements instead of sparse decompositions, we exactly obtain the problem of sparse PCA [17, 18], where one wishes to decompose a data matrix into where the dictionary elements are sparse, and thus easier to interpret. Note that in our situation, we have seen that with , the problem in Eq. (1) is equivalent to Eq. (10) and indeed only depends on the covariance matrix .

This approach to sparse PCA is similar to the non convex formulations of [18] and is to be contrasted with the convex formulation of [17] as we aim at directly obtaining a full decomposition of with an implicit trade-off between dictionary size (here the number of principal components) and sparsity of such components. Most works consider one unique component, even though the underlying data have many more underlying dimensions, and deal with multiple components by iteratively solving a reduced problem. In the non-sparse case, the two approaches are equivalent, but they are not here. By varying and , we obtain a set of solutions with varying ranks and sparsities. We are currently comparing the approach of [18], which constrains the rank of the decomposition to ours, where the rank is penalized implicitly.

NoConv Conv-R Conv NoConv Conv-R Conv
1 10 10 2 -16.4 5.7 -9.0 1.9 -6.5 2.3 -19.8 2.3 -10.2 1.6 -7.1 2.0
2 20 10 2 -40.8 4.2 -11.6 2.6 -5.6 3.2 -45.5 2.0 -16.4 1.4 -7.0 1.3
3 10 20 2 -8.6 3.6 -9.0 1.8 -8.4 1.9 -15.0 2.7 -11.5 1.5 -10.5 1.5
4 20 20 2 -24.9 3.3 -13.0 0.7 -10.4 1.1 -40.9 2.2 -18.9 0.8 -14.8 0.7
5 10 40 2 -6.6 2.8 -8.9 1.5 -9.0 1.4 -7.6 2.6 -10.1 1.6 -9.9 1.6
6 20 40 2 -13.2 2.6 -12.3 1.4 -11.5 1.3 -25.4 3.0 -16.7 1.3 -15.6 1.4
7 10 10 4 1.7 3.9 -1.5 0.5 -0.2 0.2 -1.9 2.5 -1.7 0.6 -0.1 0.1
8 20 10 4 -16.7 5.9 -1.4 0.8 -0.0 0.0 -27.1 1.8 -3.0 0.7 0.0 0.0
9 10 20 4 2.2 2.4 -2.5 0.9 -1.7 0.8 2.0 2.9 -2.5 0.8 -1.2 1.0
10 20 20 4 -1.2 2.5 -3.1 1.1 -0.9 0.9 -12.1 3.0 -5.5 1.0 -1.6 1.0
11 10 40 4 3.5 3.0 -3.3 1.3 -3.3 1.5 2.6 0.9 -3.3 0.5 -3.3 0.5
12 20 40 4 3.7 2.3 -3.9 0.6 -3.6 0.8 -1.7 1.7 -6.3 0.9 -5.3 0.8
13 10 10 8 9.6 3.4 -0.1 0.1 0.0 0.0 7.2 3.0 -0.1 0.1 0.0 0.0
14 20 10 8 -1.6 3.7 0.0 0.0 0.0 0.0 -4.8 2.3 0.0 0.0 0.0 0.0
15 10 20 8 9.6 2.4 -0.4 0.4 -0.2 0.3 9.4 1.5 -0.4 0.4 -0.2 0.2
16 20 20 8 11.3 1.8 -0.2 0.2 -0.0 0.0 7.0 2.5 -0.4 0.3 -0.0 0.0
17 10 40 8 8.8 3.0 -0.8 0.7 -0.7 0.7 7.2 1.3 -0.7 0.4 -0.5 0.5
18 20 40 8 10.9 1.1 -0.9 0.6 -0.6 0.5 9.4 1.0 -1.0 0.4 -0.4 0.4
Table 1: Percentage of improvement in mean squared error, with respect to spectral denoising, for various parameters of the data generating process. See text for details.

5 Simulations

We have performed extensive simulations on synthetic examples to compare the various formulations. Because of identifiability problems which are the subject of ongoing work, it is not appropriate to compare decomposition coefficients and/or dictionary elements; we rather consider a denoising experiment. Namely, we have generated matrices as follows: select unit norm dictionary elements in uniformly and independently at random, for each , select indices in uniformly at random and form the -th row of with zeroes except for random normally distributed elements at the selected indices. Construct , where has independent standard normally distributed elements and (held fixed at ). The goal is to estimate from , and we compare the three following formulations on this task: (a) the convex minimization of Eq. (11) through techniques presented in Section 4.3 with varying and , denoted as Conv, (b) the rounding of the previous solution using techniques described in Section 4.1, denoted as Conv-R, and (c) the low-rank constrained problem in Eq. (1) with and with varying and , denoted as NoConv, and which is the standard method in sparse dictionary learning [8, 2, 11].

For the three methods and for each replication, we select the two regularization parameters that lead to the minimum value , and compute the relative improvement on using the singular value decomposition (SVD) of . If the value is negative, denoising is better than with the SVD (the more negative, the better). In Table 1, we present averages over 10 replications for various values of , , , and .

First, in these simulations where the decomposition coefficients are known to be sparse, penalizing by -norms indeed improves performance on spectral denoising for all methods. Second, as expected, the rounded formulation (Conv-R) does perform better than the non-rounded one (Conv), i.e., our rounding procedure allows to find “good” local minima of the non-convex problem in Eq. (1).

Moreover, in high-sparsity situations (, lines 1 to 6 of Table 1), we see that the rank-constrained formulation NoConv outperforms the convex formulations, sometimes by a wide margin (e.g., lines 1 and 2). This is not the case when the ratio becomes larger than 2 (lines 3 and 5). In the medium-sparsity situation (, lines 7 to 12), we observe the same phenomenon, but the non-convex approach is better only when the ratio is smaller than or equal to one. Finally, in low-sparsity situations (, lines 13 to 18), imposing sparsity does not improve performance much and the local minima of the non-convex approach NoConv really hurt performance. Thus, from Table 1, we can see that with high sparsity (small ) and small relative dictionary size of the original non noisy data (i.e., low ratio ), the non convex approach performs better. We are currently investigating theoretical arguments to support these empirical findings.

6 Conclusion

In this paper, we have investigated the possibility of convexifying the sparse dictionary learning problem. We have reached both positive and negative conclusions: indeed, it is possible to convexify the problem by letting the dictionary size explicitly grow with proper regularization to ensure low rank solutions; however, it only leads to better predictive performance for problems which are not too sparse and with large enough dictionaries. In the high-sparsity/small-dictionary cases, the non convex problem is empirically simple enough to solve so that our convexification leads to no gain.

We are currently investigating more refined convexifications and extensions to nonnegative variants [9], applications of our new decomposition norms to clustering [9], the possibility of obtaining consistency theorems similar to [14] for the convex formulation, and the application to the image denoising problem [2].

References

  • [1] Scott Shaobing Chen, David L. Donoho, and Michael A. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1999.
  • [2] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Proc., 15(12):3736–3745, 2006.
  • [3] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of The Royal Statistical Society Series B, 58(1):267–288, 1996.
  • [4] S. Mallat and Z. Zhang. Matching pursuit with time-frequency dictionaries. IEEE Trans Sig. Proc., 41:3397–3415, 1993.
  • [5] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat., 32:407, 2004.
  • [6] E. J. Candès and T. Tao. Decoding by linear programming. IEEE Trans. Information Theory, 51(12):4203–4215, 2005.
  • [7] P. Zhao and B. Yu. On model selection consistency of Lasso. J. Mach. Learn. Res., 7:2541–2563, 2006.
  • [8] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37:3311–3325, 1997.
  • [9] C. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. Technical Report 60428, Lawrence Berkeley Nat. Lab., 2006.
  • [10] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Sig. Proc., 54(11):4311–4322, 2006.
  • [11] H. Lee, Al. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In NIPS, 2007.