We address the problem of minimizing a convex function over the space of large matrices with low rank. While this optimization problem is hard in general, we propose an efficient greedy algorithm and derive its formal approximation guarantees. Each iteration of the algorithm involves (approximately) finding the left and right singular vectors corresponding to the largest singular value of a certain matrix, which can be calculated in linear time. This leads to an algorithm which can scale to large matrices arising in several applications such as matrix completion for collaborative filtering and robust low rank matrix approximation.
Large-Scale Convex Minimization with a Low-Rank Constraint
Shai Shalev-Shwartz email@example.com
Alon Gonen firstname.lastname@example.org
School of Computer Science and Engineering, The Hebrew University of Jerusalem, ISRAEL
Ohad Shamir email@example.com
Microsoft Research New-England, USA
Our goal is to approximately solve an optimization problem of the form:
where is a convex and smooth function. This problem arises in many machine learning applications such as collaborating filtering (Koren et al., 2009), robust low rank matrix approximation (Ke & Kanade, 2005; Croux & Filzmoser, 1998; A. Baccini & Falguerolles, 1996), and multiclass classification (Amit et al., 2007). The rank constraint on is non-convex and therefore it is generally NP-hard to solve Equation (1) (this follows from (Natarajan, 1995; Davis et al., 1997)).
In this paper we describe and analyze an approximation algorithm for solving Equation (1). Roughly speaking, the proposed algorithm is based on a simple, yet powerful, observation: instead of representing a matrix using numbers, we represent it using an infinite dimensional vector , indexed by all pairs taken from the unit spheres of and respectively. In this representation, low rank corresponds to sparsity of the vector .
Thus, we can reduce the problem given in Equation (1) to the problem of minimizing a vector function over the set of sparse vectors, . Based on this reduction, we apply a greedy approximation algorithm for minimizing a convex vector function subject to a sparsity constraint. At first glance, a direct application of this reduction seems impossible, since is an infinite-dimensional vector, and at each iteration of the greedy algorithm one needs to search over the infinite set of the coordinates of . However, we show that this search problem can be cast as the problem of finding the first leading right and left singular vectors of a certain matrix.
After describing and analyzing the general algorithm, we show how to apply it to the problems of matrix completion and robust low-rank matrix approximation. As a side benefit, our general analysis yields a new sample complexity bound for matrix completion. We demonstrate the efficacy of our algorithm by conducting experiments on large-scale movie recommendation data sets.
As mentioned earlier, the problem defined in Equation (1) has many applications, and therefore it was studied in various contexts. A popular approach is to use the trace norm as a surrogate for the rank (e.g. (Fazel et al., 2002)). This approach is closely related to the idea of using the norm as a surrogate for sparsity, because low rank corresponds to sparsity of the vector of singular values and the trace norm is the norm of the vector of singular values. This approach has been extensively studied, mainly in the context of collaborating filtering. See for example (Cai et al., 2008; Candes & Plan, 2010; Candès & Recht, 2009; Keshavan et al., 2010; Keshavan & Oh, 2009).
While the trace norm encourages low rank solutions, it does not always produce sparse solutions. Generalizing recent studies in compressed sensing, several papers (e.g. (Recht et al., 2007; Cai et al., 2008; Candes & Plan, 2010; Candès & Recht, 2009; Recht, to appear)) give recovery guarantees for the trace norm approach. However, these guarantees rely on rather strong assumptions (e.g., it is assumed that the data is indeed generated by a low rank matrix, that certain incoherence assumptions hold, and for matrix completion problems, it requires the entries to be sampled uniformly at random). In addition, trace norm minimization often involves semi-definite programming, which usually does not scale well to large-scale problems.
In this paper we tackle the rank minimization directly, using a greedy selection approach, without relying on the trace norm as a convex surrogate. Our approach is similar to forward greedy selection approaches for optimization with sparsity constraint (e.g. the MP (Mallat & Zhang, 1993) and OMP (Pati et al., 2002) algorithms), and in particular we extend the fully corrective forward greedy selection algorithm given in (Shalev-Shwartz et al., 2010)). We also provide formal guarantees on the competitiveness of our algorithm relative to matrices with small trace norm.
Recently, (Lee & Bresler, 2010) proposed the ADMiRA algorithm, which also follows the greedy approach. However, the ADMiRA algorithm is different, as in each step it first chooses components and then uses SVD to revert back to a rank matrix. This is more expensive then our algorithm which chooses a single rank 1 matrix at each step. The difference between the two algorithms is somewhat similar to the difference between the OMP (Pati et al., 2002) algorithm for learning sparse vectors, to CoSaMP (Needell & Tropp, 2009) and SP (Dai & Milenkovic, 2008). In addition, the ADMiRA algorithm is specific to the squared loss while our algorithm can handle any smooth loss. Finally, while ADMiRA comes with elegant performance guarantees, these rely on strong assumptions, e.g. that the matrix defining the quadratic loss satisfies a rank-restricted isometry property. In contrast, our analysis only assumes smoothness of the loss function.
The algorithm we propose is also related to Hazan’s algorithm (Hazan, 2008) for solving PSD problems, which in turns relies on Frank-Wolfe algorithm (Frank & Wolfe, 1956) (see Clarkson (Clarkson, 2008)), as well as to the follow-up paper of (Jaggi & Sulovskỳ, 2010), which applies Hazan’s algorithm for optimizing with trace-norm constraints. There are several important changes though. First, we tackle the problem directly and do not enforce neither PSDness of the matrix nor a bounded trace-norm. Second, our algorithm is ”fully corrective”, that is, it extracts all the information from existing components before adding a new component. These differences between the approaches are analogous to the difference between Frank-Wolfe algorithm and fully corrective greedy selection, for minimizing over sparse vectors, as discussed in (Shalev-Shwartz et al., 2010). Finally, while each iteration of both methods involves approximately finding leading eigenvectors, in (Hazan, 2008) the quality of approximation should improve as the algorithm progresses while our algorithm can always rely on the same constant approximation factor.
In this section we describe our algorithm, which we call Greedy Efficient Component Optimization (or GECO for short). Let be a matrix, and without loss of generality assume that . The SVD theorem states that can be written as , where are members of , comes from , and are scalars. To simplify the presentation, we assume that each real number is represented using a finite number of bits, therefore the sets and are finite sets.111This assumption greatly simplifies the presentation but is not very limiting since we do not impose any restriction on the amount of bits needed to represent a single real number. We note that the assumption is not necessary and can be waived by writing , where is a measure on , and from the SVD theorem, there is always a representation with which is non-zero on finitely many points. It follows that we can also write as where and we index the elements of using pairs . Note that the representation of using a vector is not unique, but from the SVD theorem, there is always a representation of for which the number of non-zero elements of is at most , i.e. where . Furthermore, if then there is a representation of using a vector for which .
Given a (sparse) vector we define the corresponding matrix to be
Note that is a linear mapping. Given a function , we define a function
It is easy to verify that if is a convex function over then is convex over (since is a composition of over a linear mapping). We can therefore reduce the problem given in Equation (1) to the problem
While the optimization problem given in Equation (2) is over an arbitrary large space, we next show that a forward greedy selection procedure can be implemented efficiently. The greedy algorithm starts with . At each iteration, we first find the vectors that maximizes the magnitude of the partial derivative of with respect to . Assuming that is differentiable, and using the chain rule, we obtain:
where is the matrix of partial derivatives of with respect to the elements of . The vectors that maximizes the magnitude of the above expression are the left and right singular vectors corresponding to the maximal singular value of . Therefore, even though the number of elements in is very large, we can still perform a greedy selection of one pair in an efficient way.
In some situations, even the calculation of the leading singular vectors might be too expensive. We therefore allow approximate maximization, and denote by a procedure222 An example of such a procedure is the power iteration method, which can implement ApproxSV in time , where is the number of non-zero elements of . See Theorem 3.1 in (Kuczyński & Woźniakowski, 1992). Our analysis shows that the value of has a mild effect on the convergence of GECO, and one can even choose a constant value like . This is in contrast to (Hazan, 2008; Jaggi & Sulovskỳ, 2010) which require the approximation parameter to decrease when the rank increases. Note also that the ApproxEV procedure described in (Hazan, 2008; Jaggi & Sulovskỳ, 2010) requires an additive approximation, while we require a multiplicative approximation. which returns vectors for which
Let and be matrices whose columns contain the vectors and we aggregated so far. The second step of each iteration of the algorithm sets to be the solution of the following optimization problem:
where , and are the linear spans of the columns of respectively.
We now describe how to solve Equation (3). Let be the number of columns of and . Note that any vector can be written as , where , and similarly, any can be written as . Therefore, if the support of is in we have that can be written as
Thus, any whose support is in yields a matrix . The SVD theorem tells us that the opposite direction is also true, namely, for any there exists whose support is in that generates (and also ). Denote , it follows that Equation (3) is equivalent to the following unconstrained optimization problem . It is easy to verify that is a convex function, and therefore can be minimized efficiently. Once we obtain the matrix that minimizes we can use its SVD to generate the corresponding .
In practice, we do not need to maintain at all, but only to maintain matrices such that . A summary of the pseudo-code is given in Algorithm 1. The runtime of the algorithm is as follows. Step 4 can be performed in time , where is the number of non zero elements of , using the power method (see Footnote 2). Since our analysis (given in Section id1) allows to be a constant (e.g. ), this means that the runtime is . The runtime of Step 6 depends on the structure of the function . We specify it when describing specific applications of GECO in later sections. Finally, the runtime of Step 7 is at most , and step 8 takes .
GECO chooses to be the leading singular vectors, which are the maximizers of over unit spheres of and . Our analysis in the next section guarantees that this choice yields a sufficient decrease of the objective function. However, there may be a pair which leads to an even larger decrease in the objective value. Choosing such a direction can lead to improved performance. We note that our analysis in the next section still holds, as long as the direction we choose leads to a larger decrease in the objective value, relative to the increase we can get from using the leading singular vectors. In Section id1 we describe a method that finds better directions.
Each iteration of GECO increases the rank by . In many cases, it is possible to decrease the objective by replacing one of the components without increasing the rank. If we verify that this replacement step indeed decreases the objective (by simply evaluating the objective before and after the change), then the analysis we present in the next section remains valid. We now describe a simple way to perform a replacement. We start with finding a candidate pair and perform steps of GECO. Then, we approximate the matrix by zeroing its smallest singular value. Let denote this approximation. We next check if is strictly smaller than the previous objective value. If yes, we update based on and obtain that the rank of has not been increased while the objective has been decreased. Otherwise, we update based on , thus increasing the rank, but our analysis tells us that we are guaranteed to sufficiently decrease the objective. If we restrict the algorithm to perform at most attempted replacement steps between each rank-increasing iteration, then its runtime guarantee is only increased by an factor, and all the convergence guarantees remain valid.
In some situations, rank constraint is not enough for obtaining good generalization guarantees and one can consider objective functions which contains additional regularization of the form , where is the vector of singular values of and is a vector function such as . For example, if , this regularization term is equivalent to Frobenius norm regularization of . In general, adding a convex regularization term should not pose any problem. A simple trick to do this is to orthonormalize the columns of and before Step 6. Therefore, for any , the singular values of equal the singular values of . Thus, we can solve the problem in Step 6 more efficiently while regularizing instead of the larger matrix .
Step of GECO involves solving a problem with variables, where . When is small this is a reasonable computational effort. However, when is large, Steps can be expensive. For example, in matrix completion problems, the complexity of Step can scale with . If runtime is important, it is possible to restrict to be a diagonal matrix, or in other words, we only optimize over the coefficients of corresponding to and without changing the support of . Thus, in step we solve a problem with variables, and Step is not needed. It is possible to verify that the analysis we give in the next section still holds for this variant.
In this section we give a competitive analysis for GECO. The first theorem shows that after performing iterations of GECO, its solution is not much worse than the solution of all matrices , whose trace norm333The trace norm of a matrix is the sum of its singular values. is bounded by a function of . The second theorem shows that with additional assumptions, we can be competitive with matrices whose rank is at most . The proofs can be found in the long version of this paper.
To formally state the theorems we first need to define a smoothness property of the function .
Definition 1 (smoothness)
We say that is -smooth if for any and we have
where is the all zeros vector except in the coordinate corresponds to . We say that is -smooth if the function is -smooth.
Fix some . Assume that GECO (or one of its variants) is run with a -smooth function , a rank constraint , and a tolerance parameter . Let be its output matrix. Then, for all matrices with
we have that .
The previous theorem shows competitiveness with matrices of low trace norm. Our second theorem shows that with additional assumptions on the function we can be competitive with matrices of low rank as well. We need the following definition.
Definition 2 (strong convexity)
Let . We say that is -strongly-convex over if for any whose support444The support of is the set of for which . is in we have
We say that is -strongly-convex over if the function is -strongly-convex over .
Assume that the conditions of Theorem 1 hold. Then, for any such that
and such that is -strongly-convex over the singular vectors of , we have that .
We discuss the implications of these theorems for several applications in the next sections.
Matrix completion is the problem of predicting the entries of some unknown target matrix based on a random subset of observed entries, . For example, in the famous Netflix problem, represents the number of users, represents the number of movies, and is a rating user gives to movie . One approach for learning the matrix is to find a matrix of low rank which approximately agrees with on the entries of (in mean squared error terms). Using the notation of this paper, we would like to minimize the objective
over low rank matrices .
We now specify GECO for this objective function. It is easy to verify that the element of is if and otherwise. The number of non-zero elements of is at most , and therefore Step 4 of GECO can be implemented using the power method in time . Given matrices , let be the ’th row of and be the ’th row of . We have that the element of the matrix can be written as , where of a matrix is the vector obtained by taking all the elements of the matrix column wise. We can therefore rewrite as , which makes Step 6 of GECO a vanilla least squares problem over at most variables. The runtime of this step is therefore bounded by .
To apply our analysis for matrix completion we first bound the smoothness parameter.
For matrix completion the smoothness parameter is at most .
Proof For any and we can rewrite as
Taking expectation over we obtain:
, the proof follows.
Our general analysis therefore implies that for any , GECO can find a matrix with rank , such that .
Let us now discuss the implications of this result for the number of observed entries required for predicting the entire entries of . Suppose that the entries are sampled i.i.d. from some unknown distribution , for all and . Denote the generalization error of a matrix by
Using generalization bounds for low rank matrices (e.g. (Srebro et al., 2005)), it is possible to show that for any matrix of rank at most we have that with high probability555To be more precise, this bound requires that the elements of are bounded by a constant. But, since we can assume that the elements of are bounded by a constant, it is always possible to clip the elements of to the range of the elements of without increasing .
Combining this with our analysis for GECO, and optimizing , it is easy to derive the following:
Fix some matrix . Then, GECO can find a matrix such that with high probability over the choice of the entries in
Without loss of generality assume that . It follows that if is order of then order of entries are suffices to learn the matrix . This matches recent learning-theoretic guarantees for distribution-free learning with the trace norm (Shalev-Shwartz & Shamir, 2011).
A very common problem in data analysis is finding a low-rank matrix which approximates a given matrix , namely solving , where is some discrepancy measure. For simplicity, assume that . When is the normalized Frobenius norm , this problem can be solved efficiently via SVD. However, due to the use of the Frobenius norm, this procedure is well-known to be sensitive to outliers.
One way to make the procedure more robust is to replace the Frobenius norm by a less sensitive norm, such as the norm (see for instance (A. Baccini & Falguerolles, 1996),(Croux & Filzmoser, 1998),(Ke & Kanade, 2005)). Unfortunately, there are no known efficient algorithms to obtain the global optimum of this objective function, subject to a rank constraint on . However, using our proposed algorithm, we can efficiently find a low-rank matrix which approximately minimizes . In particular, we can apply it to any convex discrepancy measure , including robust ones such as the norm. The only technicality is that our algorithm requires to be smooth, which is not true in the case of the norm. However, this can be easily alleviated by working with smoothed versions of the norm, which replace the absolute value by a smooth approximation. One example is a Huber loss, defined as for , and otherwise.
The smoothness parameter of , where is the Huber loss, is at most .
Proof It is easy to verify that the smoothness parameter of is , since is upper bounded by the parabola , whose smoothness parameter is exactly . Therefore,
Taking the average over all entries, this implies that
Since the last term is at most , the result follows.
We therefore obtain:
Let be the Huber loss discrepancy as defined in Lemma 2. Then, for any matrix , GECO can find a matrix with and .
We evaluated GECO for the problem of matrix completion by conducting experiments on three standard collaborative filtering datasets: MovieLens100K, MovieLens1M, and MovieLens10M666Available through www.grouplens.org. The different datasets contain ratings of users on movies, respectively. All the ranking are integers in . We partitioned each data set into training and testing sets as done in (Jaggi & Sulovskỳ, 2010).
We implemented GECO while applying two of the variants described in Section id1 as we explain in details below. The first variant (see Section id1) tries to find update vectors which leads to a larger decrease of the objective function relatively to the leading singular vectors of the gradient matrix . Inspired by the proof of Theorem 1, we observe that the decrease of the objective function inversely depends on the smoothness of the scalar function . We therefore would like to find a pair which on one hand has a large correlation with and on the other hand yields a smooth scalar function . The smoothness of is analyzed in Lemma 1 and is shown to be at most . Examining the proof lines more carefully, we see that for balanced vectors, i.e. , we obtain a lower smoothness parameter of . Thus, a possible good update direction is to choose that maximizes over vectors of the form . This is equivalent to maximizing over the balls of and , which is unfortunately known to be NP-hard. Nevertheless, a simple alternate maximization approach is easy to implement and often works well. That is, fixing some , we can see that maximizes the objective, and similarly, fixing we have that is optimal. We therefore implement this alternate maximization at each step and find a candidate pair . As described in section Section id1, we compare the decrease of loss as obtained by the leading singular vectors, , and the candidate pair mentioned previously, , and update using the pair which leads to a larger decrease of the objective. We remind the reader that although are obtained heuristically, our implementation is still provably correct and our guarantees from Section id1 still hold.
In addition we performed the additional replacement steps as described in Section id1. For that purpose, let be the number of times we try to perform additional replacement steps for each rank. Each replacement attempt is done using the alternate maximization procedure described previously. After utilizing attempts of additional replacement steps, we force an increase of the rank. In our experiments, we set . Finally, we implemented the ApproxSV procedure using iterations of the power iteration method.
We compared GECO to a state-of-the-art method, recently proposed in (Jaggi & Sulovskỳ, 2010), which we denote as the JS algorithm. JS, similarly to GECO, iteratively increases the rank by computing a direction that maximizes some objective function and performing a step in that direction. See more details in Section id1. In Figure 1, we plot the root mean squared error (RMSE) on the test set as a function of the rank. As can be seen, GECO decreases the error much faster than the JS algorithm. This is expected — see again the discussion in Section id1. We observe that GECO achieves slightly larger test error on the small data set, slightly smaller test error on the medium data set, and the same error on the large data set. On the small data set, GECO starts to overfit when the rank increases beyond . The JS algorithm avoids this overfitting by constraining the trace-norm, but also starts overfitting after around iterations. On the other hand, on the medium data, the trace-norm constraint employed by the JS algorithm yields a higher estimation error, and GECO, which does not constrain the trace-norm, achieves a smaller error. In any case, GECO achieves very good results while using a rank of at most .
GECO is an efficient greedy approach for minimizing a convex function subject to a rank constraint. One of the main advantages of GECO is that each of its iterations involves running few (precisely, ) iterations of the power method, and therefore GECO scales to large matrices. In future work we intend to apply GECO to additional applications such as multiclass classification and learning fast quadratic classifiers.
Acknowledgements This work emerged from fruitful discussions with Tomer Baba, Barak Cohen, Harel Livyatan, and Oded Schwarz. The work is supported by the Israeli Science Foundation grant number 598-10.
- A. Baccini & Falguerolles (1996) A. Baccini, P. Besse and Falguerolles, A. A l1-norm pca and a heuristic approach. In E. Diday, Y. Lechevalier and Opitz, P. (eds.), Ordinal and Symbolic Data Analysis, pp. 359–368. Springer, 1996.
- Amit et al. (2007) Amit, Yonatan, Fink, Michael, Srebro, Nathan, and Ullman, Shimon. Uncovering shared structures in multiclass classification. In International Conference on Machine Learning, 2007.
- Cai et al. (2008) Cai, J.F., Candes, E.J., and Shen, Z. A singular value thresholding algorithm for matrix completion. preprint, 2008.
- Candes & Plan (2010) Candes, E.J. and Plan, Y. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010. ISSN 0018-9219.
- Candès & Recht (2009) Candès, E.J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009. ISSN 1615-3375.
- Clarkson (2008) Clarkson, K.L. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 922–931, 2008.
- Croux & Filzmoser (1998) Croux, C. and Filzmoser, P. Robust factorization of a data matrix. In COMPASTAT, Proceedings in Computational Statistics, 1998.
- Dai & Milenkovic (2008) Dai, W. and Milenkovic, O. Subspace pursuit for compressive sensing: Closing the gap between performance and complexity, 2008.
- Davis et al. (1997) Davis, G., Mallat, S., and Avellaneda, M. Greedy adaptive approximation. Journal of Constructive Approximation, 13:57–98, 1997.
- Fazel et al. (2002) Fazel, M., Hindi, H., and Boyd, S.P. A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, 2001. Proceedings of the 2001, volume 6, pp. 4734–4739. IEEE, 2002. ISBN 0780364953.
- Frank & Wolfe (1956) Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Res. Logist. Quart., 3:95–110, 1956.
- Hazan (2008) Hazan, Elad. Sparse approximate solutions to semidefinite programs. In Proceedings of the 8th Latin American conference on Theoretical informatics, pp. 306–316, 2008.
- Jaggi & Sulovskỳ (2010) Jaggi, M. and Sulovskỳ, M. A simple algorithm for nuclear norm regularized problems. In ICML, 2010.
- Ke & Kanade (2005) Ke, Q. and Kanade, T. Robust l norm factorization in the presence of outliers and missing data by alternative convex programming. In CVPR, 2005.
- Keshavan & Oh (2009) Keshavan, R.H. and Oh, S. Optspace: A gradient descent algorithm on the grassman manifold for matrix completion. Arxiv preprint arXiv:0910.5260 v2, 2009.
- Keshavan et al. (2010) Keshavan, R.H., Montanari, A., and Oh, S. Matrix completion from a few entries. Information Theory, IEEE Transactions on, 56(6):2980–2998, 2010. ISSN 0018-9448.
- Koren et al. (2009) Koren, Yehuda, Bell, Robert M., and Volinsky, Chris. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30–37, 2009.
- Kuczyński & Woźniakowski (1992) Kuczyński, J. and Woźniakowski, H. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM journal on matrix analysis and applications, 13:1094, 1992.
- Lee & Bresler (2010) Lee, K. and Bresler, Y. Admira: Atomic decomposition for minimum rank approximation. Information Theory, IEEE Transactions on, 56(9):4402–4416, 2010. ISSN 0018-9448.
- Mallat & Zhang (1993) Mallat, S. and Zhang, Z. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41:3397–3415, 1993.
- Natarajan (1995) Natarajan, B. Sparse approximate solutions to linear systems. SIAM J. Computing, 25(2):227–234, 1995.
- Needell & Tropp (2009) Needell, D. and Tropp, J.A. CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 26(3):301–321, 2009. ISSN 1063-5203.
- Pati et al. (2002) Pati, YC, Rezaiifar, R., and Krishnaprasad, PS. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pp. 40–44. IEEE, 2002. ISBN 0818641207.
- Recht (to appear) Recht, B. A simpler approach to matrix completion. JMLR, to appear.
- Recht et al. (2007) Recht, B., Fazel, M., and Parrilo, P.A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. preprint, 2007.
- Shalev-Shwartz & Shamir (2011) Shalev-Shwartz, Shai and Shamir, Ohad. Collaborative filtering with the trace norm: Learning, bounding, and transducing. In COLT, 2011.
- Shalev-Shwartz et al. (2010) Shalev-Shwartz, Shai, Zhang, Tong, and Srebro, Nathan. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20:2807–2832, 2010.
- Srebro et al. (2005) Srebro, N., Alon, N., and Jaakkola, T. Generalization error bounds for collaborative prediction with low-rank matrices. Advances In Neural Information Processing Systems, 17, 2005.
To prove the theorem we need the following key lemma, which generalizes a result given in (Shalev-Shwartz et al., 2010).
Assume that is -smooth. Let be two subsets of . Let be a minimizer of over all vectors with support in and let be a vector supported on . Assume that , denote , and let . Let . Then, there exists such that
Without loss of generality assume that (if for some we can set and without effecting the objective) and assume that (if this does not hold, let ). For any , let be the partial derivative of w.r.t. coordinate at and denote
Note that the definition of and our assumption above implies that
Therefore, for all we have
In addition, the smoothness assumption tells us that for all we have . Thus, for any we have
Combining the above we get
Multiplying both sides by and noting that
we get that
Since is a minimizer of over we have that for . Combining this with the fact that is supported on and is supported on we obtain that
From the convexity of we know that . Combining all the above we obtain
This holds for all and in particular for (which is positive). Thus,
Rearranging the above concludes our proof.
Equipped with the above lemma we are ready to prove Theorem 1.
Fix some and let be the vector of its singular values. Thus, and . For each iteration , denote , where is the value of at the beginning of iteration of GECO, before we increase the rank to be . Note that all the operations we perform in GECO or one if its variants guarantee that the loss is monotonically non-increasing. Therefore, if we are done. In addition, whenever we increase the rank by , the definition of the update implies that , where . Lemma 3 implies that