Reexamining Low Rank Matrix Factorization for Trace Norm Regularization

# Reexamining Low Rank Matrix Factorization for Trace Norm Regularization

Carlo Ciliberto
c.ciliberto@ucl.ac.uk
Dimitrios Stamos
d.stamos@cs.ucl.ac.uk
Massimiliano Pontil
m.pontil@cs.ucl.ac.uk
###### Abstract
11footnotetext: University College London, London, UK22footnotetext: Computational Statistics and Machine Learning - Istituto Italiano di Tecnologia, Genova, Italy

Trace norm regularization is a widely used approach for learning low rank matrices. A standard optimization strategy is based on formulating the problem as one of low rank matrix factorization which, however, leads to a non-convex problem. In practice this approach works well, and it is often computationally faster than standard convex solvers such as proximal gradient methods. Nevertheless, it is not guaranteed to converge to a global optimum, and the optimization can be trapped at poor stationary points. In this paper we show that it is possible to characterize all critical points of the non-convex problem. This allows us to provide an efficient criterion to determine whether a critical point is also a global minimizer. Our analysis suggests an iterative meta-algorithm that dynamically expands the parameter space and allows the optimization to escape any non-global critical point, thereby converging to a global minimizer. The algorithm can be applied to problems such as matrix completion or multitask learning, and our analysis holds for any random initialization of the factor matrices. Finally, we confirm the good performance of the algorithm on synthetic and real datasets.

## 1 Introduction

Learning low rank matrices is a problem of broad interest in machine learning and statistics, with applications ranging from collaborative filtering [1, 2], to multitask learning [3], to computer vision [4], and many more. A principled approach to tackle this problem is via suitable convex relaxations. Perhaps the most successful strategy in this sense is provided by trace (or nuclear) norm regularization [5, 3, 6, 7]. However, solving the corresponding optimization problem is computationally expensive for two main reasons. First, many commonly used algorithms require the computation of the proximity operator (e.g. [8]) which entails performing the singular value decomposition at each iteration. Second, and most importantly, the space complexity of convex solvers grows with the matrix size, which makes it prohibitive to employ them for large scale applications.

Due to the above shortcomings, practical algorithms for low rank matrix completion often use an explicit low rank matrix factorization to reduce the number of variables (see e.g. [2, 9] and references therein). In particular, a reduced variational form of the trace norm is used [7]. The resulting problem is however non-convex, and popular methods such as alternate minimization or alternate gradient descent may get struck at poor stationary points. Recent studies [10, 11] have shown that under certain conditions on the data generation process (underling low rank model, RIP, etc.) a particular version of the non-convex problem can be solved efficiently. However such conditions are not verifiable in real applications, and the problem of finding a global solution remains open.

In this paper we characterize the critical points of the non-convex problem and provide an efficient criterion to determine whether a critical point is also a global minimizer. Our analysis is constructive and suggests an iterative meta-algorithm that dynamically expands the parameter space to escape any non-global critical point, thereby converging to a global minimizer. We highlight the potential of the proposed meta-algorithm, by comparing its computational and statistical performance to two state of the art methods [12, 9].

The paper is organized as follows. In Sec. 2 we introduce the trace norm regularization problem and basic notions used throughout the paper. In Sec. 2.2 we present the low rank matrix factorization approach and set the main questions addressed in the paper. In Sec. 3 we present our analysis of the critical points of the low rank matrix factorization problem and the associated meta-algorithm. In Sec. 4 we report results from numerical experiments using our method. The Appendix contains proofs of the results, only stated in the main body of the paper, together with further auxiliary results and empirical observations.

## 2 Background and Problem Setting

In this work we study trace norm regularized problems of the form

 minimizeW∈Rn×m fλ(W),fλ(W)=ℓ(W)+λ ∥W∥∗ (1)

where is a twice differentiable convex function with Lipschitz continuous gradient, denotes the trace norm of a matrix , namely the sum of the singular values of , and is a positive parameter. Examples of relevant learning problems that can be formulated as Eq. (1) are:

Matrix Completion. In this setting we wish to recover a matrix from a small subset of its entries. A typical choice for is the square error, , where is the Frobenius norm, denotes the Hadamard product (i.e. the entry-wise product) between two matrices and is a binary matrix used to “mask” the entries of that are not available.

Multi-task Learning. Here, the columns of matrix are interpreted as the regression vectors of different learning tasks. Given datasets with and , for , we choose , where is a prescribed loss function (e.g. the square or the logistic loss).

Other examples which are captured by problem (1) include collaborative filtering with attributes [13] and multiclass classification [5].

Problem (1) can be solved by first-order optimization methods such as the proximal forward-backward (PFB) splitting [8]. Given a starting point , PFB produces a sequence with

 Wk+1=proxγλ∥⋅∥∗ (Wk−γ∇ℓ(Wk)) (2)

where and for any convex function , the associated proximity operator at is defined as . In particular, the proximity operator of the trace norm at corresponds to performing a soft-thresholding on the singular values of . That is, assuming a singular value decomposition (SVD) , where , have orthonormal columns, , with and , we have

 proxγλ∥⋅∥∗(W)=Udiag(hγλ(σ1),…,hγλ(σr))V⊤ (3)

where is the soft-thresholding operator, defined for as .

PFB guarantees that for a suitable choice of the descent step (e.g. with the Lipschitz constant of the gradient of ), the sequence converges to the global minimum of the problem with a rate of [8] (faster rates can be achieved using accelerated versions of the algorithm [14]). However, at each iteration PFB performs the SVD of an matrix via Eqs. (2) and (3), which requires operations, a procedure that becomes prohibitively expensive for large values of and . Other methods such as those based on Frank-Wolfe procedure (e.g. [15]) alleviate this cost but require more iterations. More importantly, these methods need to store in memory the iterates imposing an space complexity, a major bottleneck for large scale applications. However, trace norm regularization is typically applied to problems where the solution of problem (1) is assumed to be low-rank, namely of the form for some and . Therefore it would be ideal to have an optimization method capable to capture this aspect and consequently reduce the memory requirement to by keeping track of the two factor matrices and throughout the optimization rather than their product. This is the idea behind factorization-based methods, which have been observed to lead to remarkable performance in practice and are the subject of our investigation in this work.

### 2.2 Matrix Factorization Approach

Factorization methods build on the so-called variational form of the trace norm. Specifically, the trace norm of a matrix can be characterized as (see e.g. [16] or A.4 in the Appendix)

 ∥W∥∗=12pinf{∥A∥2F+∥B∥2F, : r∈N, A∈Rn×r, B∈Rm×r, W=AB⊤}. (4)

with the infimum always attained for . The above formulation leads to the following “factorized” version of the original optimization problem (1)

 minimizeA∈Rn×r,B∈Rm×r gλ,r(A,B),gλ,r(A,B)=ℓ(AB⊤)+λ2 (∥A∥2F+∥B∥2F) (5)

where is now a further hyperparameter of the problem. Clearly, and are tightly related and a natural question is whether minimizing the latter would allow to recover a solution of the original problem. The following well-known result (of which we provide a proof in the Appendix for completeness) shows that the two problems are indeed equivalent (for sufficiently large ).

{restatable}

[Equivalence between problems (1) and (5)]propositionPEquivalence Let be a global minimizer of in Eq. (1) with . Then, for every , every global minimizer of is such that

 gλ,r(A∗,B∗)=fλ(A∗B⊤∗)=fλ(W∗). (6)

The above proposition implies that for sufficiently large values of in Eq. (5), the optimization of and are equivalent. Therefore, we can minimize by finding a global minimizer for . This is a well-known approach to trace norm regularization (see e.g. [2, 1]) and can be extremely advantageous in practice. Indeed, we can leverage on a large body of smooth optimization literature to solve such a factorized problem [17]. As an example, if we apply Gradient Descent (GD) from a starting point , we obtain the sequence with

 Ak+1=Ak−γ(∇ℓ(AkB⊤k)Bk+λAk)Bk+1=Bk−γ(∇ℓ(AkB⊤k)⊤Ak+λBk). (7)

This approach is much more appealing than PFB from a computational perspective because: ) for small values of , the iterations at Eq. (7) are extremely fast since they mainly consists of matrix products and therefore require only operations; the space complexity of GD is , which may be remarkably smaller than the of PFB for small values of ; ) Even if for large values of , e.g. , every iteration has the same time complexity as PFB, we do not need to perform expensive operations such as the SVD of an matrix at every iteration. This dramatically reduces computational times in practice.

The strategy of minimizing instead of was originally proposed in [7] and its empirical advantages have been extensively documented in previous literature, e.g. [2]. However, this approach opens important theoretical and practical questions that have not been addressed by previous work:

• How to choose r? By subsection 2.2 we know that for suitably large values of the minimization of and are equivalent. However, a lower bound for such an cannot be recovered analytically from the functional itself and, so, it is not clear how to choose in practice.

• Global convergence. The function is not jointly convex in the two variables and . This opens the question of whether GD (or other optimization methods) converge to a global minimizer for sufficiently large values of .

Investigating such issues is the main focus of this work.

## 3 Analysis

In this section we study the questions outlined above and provide a meta-algorithm to minimize the function while incrementally searching for a rank for which subsection 2.2 is verified. Our analysis builds upon the following keypoints:

• (subsection 3.1) We characterize all critical points of , namely those points to which iterative optimization methods applied to Eq. (5) (e.g. GD) could in principle converge to.

• (subsection 3.1) We derive an efficient criterion to determine whether a critical point of is a global minimizer (typically an NP hard problem for non-convex functions).

• (subsection 3.2) We show that for any critical point of which is not a global minimizer, it is always possible to constructively find a descent direction for from the point .

• (subsection 3.2) By combining the above results we show that for , every critical point of is either a global minimizer or a so-called strict saddle point, namely a point where the Hessian of the target function has at least a negative direction. We can then appeal to [18] to show that descent methods such as GD avoid strict saddle points and hence convergence to a global minimizer.

The above discussion suggests a natural “meta-algorithm” (which is presented more formally in Sec. 3.3) to address the minimization of via while increasing incrementally:

1. Initialize . Choose and .

2. Starting from , converge to a critical point for .

3. If satisfies our criterion for global optimality (see subsection 3.1) stop, otherwise:

4. Perform a step in a descent direction for from to a point , ; Increase to and go back to Step .

From our analysis in the following, the procedure above is guaranteed to stop at most after iterations. However, subsection 2.2, together with our criterion for global optimality, suggests that this meta-algorithm could stop much earlier if admits a low-rank minimizer (which is indeed the case in our experiments). This has two main advantages: ) by exploring candidate ranks incrementally, we can expect significantly faster computations and convergence if our optimality criterion activates for and 2) we automatically recover the rank of a minimizer for without the need to perform expensive operations such as SVD.

{restatable}

remarkRVidal The meta-algorithm considered in this paper is related to the optimization strategy recently proposed in [19], where the authors study convex problems for which a non-convex “factorized” formulation exists, including the setting considered in this work as a special case. However, by adopting such a general perspective, the resulting optimization strategy is less effective when applied to the specific minimization of . In particular: the optimality criterion derived in [19] is only a sufficient but not necessary condition; the upper bound on is much larger than the one provided in this work, i.e. rather than ; convergence guarantees to a global optimum cannot be provided.

By focusing exclusively on the question of minimizing via its factorized form and, by leveraging on the specific structure of the problem, it is instead possible to provide a further analysis of the behavior of the proposed meta-algorithm.

### 3.1 Critical Points and a Criterion for Global Optimality

Since is a non-convex smooth function, in principle we can expect optimization algorithms based on first or second order methods to converge only to critical points, i.e. points for which . The following result provides a characterization of such critical points and plays a key role in our analysis. {restatable}[Characterization of Critical Points of ]propositionPCharacterization Let be a critical point of . Let and let and be two matrices with orthonormal columns corresponding to the left and right singular vectors of with singular value equal to . Then, there exists , such that and . This result is instrumental in deriving a necessary and sufficient condition to determine whether a stationary point for is actually a global minimizer. Indeed, since is convex, we can leverage on the first order condition for global optimality, stating that the matrix is a global minimizer for (and by subsection 2.2 also for ) if and only if the zero matrix belongs to its subdifferential (see e.g. [8]). Studying this inclusion leads to the following theorem. {restatable}[A Criterion for Global Optimality]theoremTCriterion Let be a critical point of . Then is a minimizer for if and only if . This result provides a natural strategy to determine whether a descent method minimizing has converged to a global minimizer, that is we evaluate the operator norm of the gradient, denoted , and then check whether it is larger than . For large matrices this operation can be performed efficiently by using approximation methods, e.g. power iteration [20]. Note that in general it is an NP-hard problem to determine whether a critical point of a non-convex function is actually a global minimizer [21]; it is only because of the relation with the convex function that in this case it is possible to perform such check in polynomial time.

### 3.2 Escape Directions and Global Convergence

In this section, we observe that for any critical point of which is not a global minimizer, it is always possible to either find a direction to escape from it or alternatively to increase by one and to find a decreasing direction for . This strategy is suggested by the following result. {restatable}[Escape Direction from Critical Points]propositionPEscape With the same notation of subsection 3.1, assume and . Then, is a so-called strict saddle point for , namely the Hessian of at  has at least one negative eigenvalue. In particular, there exists such that , and if and are the left and right singular vectors of  with singular value equal to , then decreases locally at along the direction . A direct consequence of the result above is that an optimization strategy can remain trapped only at global minimizers of or at critical points for which and have full rank (since we can always escape from rank deficient ones). If the latter happens, subsection 3.2 suggests the strategy adopted in this work, namely to increase the problem dimension to and consider the “inflated” point . Indeed, at such point, attains the same value as and it is straightforward to verify that is still a critical point for . Since matrices and have now rank , we can apply subsection 3.2 to find a direction along which decreases. This procedure will stop for since and we can therefore apply subsection 3.2 to always escape from critical points until we reach a global minimizer (this fact can be actually improved to hold also for as we see in the following).

Note however that in general, if the number of non-global critical points is infinite, it is not guaranteed that such strategy will converge to the global minimizer. However, since by subsection 3.2 every such critical point is a strict saddle point, we can leverage on previous results from the non-convex optimization literature (see [18] and references therein) in order to prove the following result. {restatable}[Convergence to Global Minimizers of ]theoremTConvergence Let . Then the set of starting points for which GD does not converge to a global minimizer of has measure zero. In particular, subsection 3.2 suggests to initialize the optimization method used to minimize by applying a small perturbation to the initial point via additive noise according to a distribution that is absolutely continuous with respect to the Lebesgue measure of (e.g. a Gaussian). This perturbation guarantees such initial point to not be in the set of points for which GD converges to strict saddle point and therefore that the meta-algorithm considered in this work converges to a global minimizer. We make this statement more precise in the next section.

### 3.3 A Meta-algorithm to Minimize fλ

We can now formally present the meta-algorithm outlined at the beginning of Sec. 3 to find a solution of the trace norm regularization problem (1) by minimizing in Eq. (5) for increasing values of .

Algorithm 1 proceeds by iteratively applying the descent method OptimizationAlgorithm (e.g. GD) to minimize while increasing the estimated rank one step at the time. Whenever the optimization algorithm converges to a critical point of (within a certain tolerance ), Algorithm 1 verifies whether the global optimality criterion has been activated (again within a certain tolerance ). If this is the case, is a global minimizer and we stop the algorithm. Otherwise we “inflate” the two factor matrices by one column each and repeat the procedure. The new column is initialized randomly, since by subsection 3.2 we know that we will not converge again to because it is not full rank. A more refined approach would be to choose and to be the singular vectors of associated to the highest singular value. For the sake of brevity we provide an example of such strategy in the Appendix. However note that we still need to apply a random perturbation to the step in the descent direction in order to invoke subsection 3.2 and be guaranteed to not converge to strict saddle points. As a direct corollary to subsection 3.2 we have

###### Corollary 1.

Algorithm 1 with GD as OptimizationAlgorithm converges to a point such that is a global minimizer for with probability .

### 3.4 Convergence Rates

In this section, we touch upon the question of convergence rates for optimization schemes applied to problem (5). For simplicity, we focus on GD, but our result generalizes to other first order methods. We provide upper bounds to the number of iterations required to guarantee that GD iterates are close to a critical point of the target function. By standard convex analysis results (see  [22]) it is known that GD applied to a differentiable convex function is guaranteed to have sublinear convergence of comparable to that of PFB. However, since is non-convex, here we need to rely on more recent results that investigated the application of first order methods to functions satisfying the so-called Kurdyka-Lojasiewicz (KL) inequality [23, 24].

###### Definition 2 (Kurdyka-Lojasiewicz Inequality).

A differentiable function is said to satisfy the Kurdyka-Lojasiewicz inequality at a critical point if there exists a neighborhood and constants and such that, for all ,

 γ|g(x)−g(x∗)|α<∥∇g(x)∥F. (8)

The KL inequality is a measure of how large is the gradient of the function in the neighborhood of a critical point. This allows us to derive convergence rates for GD methods that depend on the constant . In particular, as a corollary to [18] (see also [23]) we obtain the following result.

###### Corollary 3 (Convergence Rate of Gradient Descent).

Let a sequence produced by GD method applied to . If satisfies the KL inequality for some , then there exists a critical point of and constants , such that

 ∥(Ak,Bk)−(A∗,B∗)∥2F≤{Cbkif α∈(0,1/2],Ck−1−α2α−1if α∈(1/2,1). (9)

Furthermore, if convergence is achieved in a finite number of steps.

This result shows that depending on the constant appearing in the KL inequality, we can expect different convergence rates for GD applied to problem (5). Although, it is a challenging task to identify such constant or even provide an upper bound in specific instances, the class of functions satisfying the KL inequality is extremely large and includes both analytic and semi-algebraic functions, see e.g. [25]. In the Appendix, we argue that if a function is a semi-algebraic, then also such that is semi-algebraic. Therefore, in order to apply Cor. 3 it is sufficient that the error and the regularizer are semi-algebraic, a property which is verified by many commonly used error functions (or the associated loss functions, e.g. square, logistic) and regularizers (in particular the squared Frobenius norm).

## 4 Experiments

We report an empirical analysis of the meta-algorithm studied in this work. All experiments were conducted on an Intel Xeon E5-2697 V3 2.60Ghz CPU with 32GB RAM. We consider a matrix completion setting, with the loss , where is the matrix that we aim to recover and a binary matrix masking entries not available at training time. Below we briefly described the datasets used.

Synthetic. We generated a matrix as a rank matrix product of two matrices plus additive noise ; the entries for and were independently sampled from a standard normal distribution.

Movielens. This dataset [26] consists of different datasets for movie recommendation of increasing size. They all comprise a number of ratings (from 1 to 5) given by users on a database of movies, which are recorded as a matrix with missing entries. In this work we considered three such datasets of increasing size, namely Movielens () with thousand ratings from users on movies, Movielens () with million ratings, users and movies and Movielens (), with millions ratings, users and movies.

### 4.1 The Criterion for Global Optimality and the Estimated Rank

The result in subsection 3.1 provides a necessary and sufficient criterion to determine whether Algorithm 1 has achieved a global minimum for . A natural question is to ask whether, in practice, such criterion will be satisfied for a much smaller rank than the one at which we are guaranteed convergence, namely . To address this question we compared the solution achieved by our approach with the one obtained by iterative soft thresholding (ISTA) (or proximal forward backward, see e.g. [8]) on both synthetic and real datasets. Fig. 1 reports the value of for which our meta-algorithm satisfied the criterion for global optimality and compares it with the rank of the ISTA solution for different values of . For the Synthetic dataset (Fig. 1 Left ) we considered only of the generated matrix entries for the optimization of . For the real dataset (Fig. 1 Right ) we considered and sampled of each user’s available ratings. For both synthetic and real experiments our meta-algorithm recovers the same rank than as that found by ISTA. However, our algorithm reaches such rank incrementally, exploiting the low rank factorization. As we will see in the next section this results in a much faster convergence speed in practice.

### 4.2 Large Scale Matrix Completion

We compared the performance of our meta-algorithm with two state of the art methods, Active Newton (ALT) [12] and Alternating Least Squares Soft Impute (ASL-SI) [9] on the three Movielens datasets. We used and of each user’s ratings for training, validation and testing and repeated our experiments across separate trials to account for statistical variability in the sampling. Test error was measured in terms of the Normalized Mean Absolute Error (NMAE), namely the mean (entry-wise) absolute error on the test set, normalized by the maximum discrepancy between entires in . As a reference of the behavior of the different methods, Fig. 2, reports on a single trial the decrease of on training data and NMAE on the test set with respect to time for the best chosen by validation. All methods where run until convergence and attained the same value of the objective function and same test error in all our trials. However, as it can be noticed, our meta-algorithm and ALS-SI seem to attain a much lower test error during earlier iterations. To better investigate this aspect, Table. 1 reports results in terms of time, test error and estimated rank attained on average across the trials by the different methods at the iteration with smallest validation error. As it can be noticed our meta-algorithm is generally on par with its competitors in terms of test error while being relatively faster and recovering low-rank solutions. This highlights an interesting byproduct of the meta-algorithm considered in this work, namely that by exploring candidate ranks incrementally, the method allows to find potentially better factorizations than trace norm regularization both in terms of test error and estimated rank. This fact can be empirically observed also for different values of as we report in the Appendix.

## 5 Conclusions

We studied the convergence properties of low rank factorization methods for trace norm regularization. Key to our study is a necessary and sufficient condition for global optimality, which can be applied to any critical points of the non-convex problem. This condition together with a detailed analysis of the critical points lead us to propose a meta-algorithm for trace norm regularization, that incrementally expands the number of factors used by the non-convex solver. Although algorithms of this kind have been studied empirically for years, our analysis provides a fresh look and novel insights which can be used to confirm whether a global solution has been reached. Numerical experiments indicated that our optimality condition is useful in practice and the meta-algorithm is competitive with state-of-the art solvers. In the future it would be valuable to study improvements to our analysis, which would allow from one hand to derive precise rate of convergence for specific solvers used within the meta-algorithm and from another hand to study additional conditions under which our global optimality is guaranteed to activate immediately after the number of factors exceed the rank of the trace norm regularization minimizer.

## 6 Acknowledgments

We kindly thank Andreas Argyriou, Kevin Lai and Saverio Salzo for their helpful comments, corrections and suggestions. This work was supported in part by EPSRC grant EP/P009069/1.

## References

• [1] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713–719, 2005.
• [2] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8), 2009.
• [3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
• [4] Z. Harchaoui, M. Douze, M. Paulin, M. Dudik, and J. Malick. Large-scale image classification with trace-norm regularization. In Proc. 2012 IEEE Conf. on Computer Vision and Pattern Recognition, pages 3386–3393, 2012.
• [5] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification. In Proceedings of the 24th International Conference on Machine Learning, 2007.
• [6] F. Bach. Consistency of trace norm minimization. Journal of Machine Learning Research, Vol. 8:1019–1048, 2008.
• [7] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. Advances in Neural Information Processing Systems 17, 2005.
• [8] H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Canadian Mathematical Society, 2010.
• [9] T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. Matrix completion and low-rank svd via fast alternating least squares. J. Mach. Learn. Res, 16(1):3367–3402, 2015.
• [10] R. Ge, J. D. Lee, and T. Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems 29, pages 2973–2981. 2016.
• [11] S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
• [12] C.-J. Hsieh and P. Olsen. Nuclear norm minimization via active subspace selection. In International Conference on Machine Learning, pages 575–583, 2014.
• [13] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. A new approach to collaborative filtering. Journal of Machine Learning Research, 10:803–826, 2009.
• [14] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183–202, 2009.
• [15] M. Dudik, Z. Harchaoui, and J. Malick. Lifted coordinate descent for learning with trace-norm regularization. In Artificial Intelligence and Statistics, pages 327–336, 2012.
• [16] G. J. O. Jameson. Summing and Nuclear Norms in Banach Space Theory. Cambridge University Press, 1987.
• [17] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
• [18] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, pages 1246–1257, 2016.
• [19] B. D. Haeffele and R. Vidal. Global optimality in tensor factorization, deep learning, and beyond. arXiv preprint arXiv:1506.07540, 2015.
• [20] D. P Woodruff et al. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science, 10(1–2):1–157, 2014.
• [21] K. G. Murty and S. N. Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
• [22] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
• [23] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
• [24] J. Bolte, A. Daniilidis, O. Ley, and L. Mazet. Characterizations of łojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362(6):3319–3363, 2010.
• [25] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Mathematical Programming, 116(1):5–16, 2009.
• [26] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.
• [27] A. S. Lewis. The convex analysis of unitarily invariant matrix functions. Journal of Convex Analysis, 2:173–183, 1995.

## Appendix

Here we collect some auxiliary results and we provide proofs of the results stated in the main body of the paper.

## Appendix A Auxiliary Results

The first lemma establishes the variational form for the trace norm; its proof can be found in [16].

###### Lemma A.4 (Variational Form of the Trace Norm).

For every and let . Let and let be the singular values of . Then

 ∥W∥∗=k∑i=1σi(W)=12pinf{ ∥A∥2F+∥B∥2F ∣∣ (A,B)∈Fr(W), r∈N}.

Furthermore if is a singular value decomposition (SVD) for , with , the infimum is attained for , , and .

Recall that if is proper convex function, its sub-differential at is the set

 ∂ϕ(x)={u:ϕ(x)+⟨u,y−x⟩≤ϕ(y), %for all y∈domain(ϕ)}.

The elements of are called the sub-gradients of at .

Let be the set of orthogonal matrices. A norm is called orthogonally invariant if, for every , and we have that or, equivalently , where is a symmetric gauge function (SGF), that is is a norm invariant to permutations and sign changes. An important example of orthogonally invariant norms are the -Schatten norms, , where is the -norm of a vector. In particular, for we have the trace, Frobenius, and spectral norms, respectively.

The following result is due to [27, Cor. 2.5].

###### Lemma A.5.

If is an orthogonally invariant and is the associated SGF, then for every , it holds that

 ∂∥W∥={Udiag(μ)V⊤ : U∈On, V∈Om, μ∈∂g(σ), W=Udiag(σ)V⊤}.

## Appendix B Proofs

For convenience of the reader, we restate the results presented in the main body of the paper.

\PEquivalence

*

###### Proof.

Let be a minimizer for of rank and let be a singular value decomposition of with and with orthonormal columns and diagonal with positive diagonal entires. Define and . By construction and therefore

 fλ(W∗)=fλ(A∗B⊤∗)=gλ,r∗(A∗,B∗).

Now, we prove that is a minimizer for . Suppose by contraddiction that there exist a couple such that . Define

 (¯A1,¯B1)=argminAB⊤=A1B⊤1∥A∥2F+∥B∥2F.

Then by A.4 we have

 ∥¯A1¯B⊤1∥∗=12(∥¯A1∥2F+∥¯B1∥2F)≤12(∥A1∥2F+∥B1∥2F)

and therefore

 fλ(¯A1¯B⊤1)=gλ,r∗(¯A1,¯B1)≤gλ,r∗(A1,B1)

which is clearly not possible since was a global minimizer for . ∎

\PCharacterization

*

###### Proof.

Let , be the matrices that correspond to a critical point of . We let

 ∇ℓ(A∗B∗⊤)=λU1V⊤1+U2Σ2V⊤2+U3Σ3V⊤3 (10)

be the breakdown SVD of the gradient of at , where is the diagonal matrix formed by the singular values strictly larger than and is the diagonal matrix formed by the singular values strictly smaller than (including those which are zero). For each we denote with the number of columns of and . Recall that the matrices and are both orthogonal.

Taking the derivatives of Eq. (5) w.r.t. and and setting them to zero gives the following optimality conditions for the critical points

 ∇ℓ(A∗B∗⊤)B∗+λA∗ =0 (11) ∇ℓ(A∗B∗⊤)⊤A∗+λB∗ =0. (12)

Solving Eq. (12) for and and replacing it in Eq. (11) yields that

 (−1λ2∇ℓ(A∗B∗⊤)∇ℓ(A∗B∗⊤)⊤+Im)A∗=0. (13)

By Eq. (10) . Using this in Eq. (13) and rearranging gives

 (Im−U1U⊤1−U2Σ22λ2U⊤2−U3Σ23λ2U⊤3)A∗=0

which we rewrite as

 (U2(I−Σ22λ2)U⊤2+U3(I−Σ23λ2)U⊤3)A∗=0.

Therefore, the columns of must be in the range of (i.e. orthogonal to and , namely

 A∗=U1C (14)

for some . Similarly we derive that

 B∗=V1D (15)

for some . Combining Eq. (10) with Eq. (15) we obtain that

 ∇ℓ(A∗B∗⊤)B∗=−λU1C.

Using this equation, Eq. (14) and Eq. (15) we rewrite Eq. (11) as . This implies that and, so, . ∎

\TCriterion

*

###### Proof.

Let . We need to show that or, equivalently

 −1λ∇ℓ(W∗)∈∂∥W∗∥∗. (16)

Let . As a special case of Lemma A.5 we have that holds true if and only if there exists a simultaneous singular value decomposition of the form , and , with denoting the spectrum of (namely the vector of singular values of arranged in non-increasing order). Recall that

 ∂∥σ∥1={z∈Rm:zi=1 if σi≠0, and zi∈[−1,1] otherwise}.

Using the same notation of subsection 3.1, consider the SVD

 Z=−U1V⊤1−U2Σ2λV⊤2−U3Σ3λV⊤3.

By subsection 3.1 we have that and , with . Now, let and let be the SVD of , with and matrices with orthonormal columns and diagonal with positive diagonal elements. Denote the orthonormal matrix obtained by completing with a matrix with orthonormal columns such that . Moreover denote as

Then,

 W∗=−U1CC⊤V1=(−U1P)Γ2(V1P)⊤=(−U1˜P)˜Γ2(V1˜P)⊤=(−˜U1)˜Γ2(˜V1)

with and . Note that since is orthonormal, . Moreover have columns orthogonal to and have columns orthogonal to . Consequently,

 Z=−U1V⊤1−U2Σ2λV⊤2−U3Σ3λV⊤3=−~U1~V⊤1−U2Σ2λV2−U3Σ3λV⊤3

is an alternative singular value decomposition for . Therefore, and have a simultaneous singular value decomposition and we can conclude that if and only if , namely as desired. ∎

\PEscape

*

###### Proof.

By Theorem 3.1, and , where and are the matrices of left and right singular vectors of and , for . Taking the SVD of , we rewrite

 A∗=U1PΓQ⊤,B∗=−V1PΓQ⊤. (17)

Since and are rank deficient and they have the same null space, we can choose such that . Let and the a left and right singular vector of with singular value equal to .

We consider a perturbation of the objective function in the direction