Learning of Generalized Low-Rank Models: A Greedy Approach

Learning of Generalized Low-Rank Models: A Greedy Approach

Abstract

Learning of low-rank matrices is fundamental to many machine learning applications. A state-of-the-art algorithm is the rank-one matrix pursuit (R1MP). However, it can only be used in matrix completion problems with the square loss. In this paper, we develop a more flexible greedy algorithm for generalized low-rank models whose optimization objective can be smooth or nonsmooth, general convex or strongly convex. The proposed algorithm has low per-iteration time complexity and fast convergence rate. Experimental results show that it is much faster than the state-of-the-art, with comparable or even better prediction performance.

= 10000 = 2000

[theorem]Lemma [theorem]Proposition [theorem]Corollary

1Introduction

In many machine learning problems, the data samples can be naturally represented as low-rank matrices. For example, in recommender systems, the ratings matrix is low-rank as users (and items) tend to form groups. The prediction of unknown ratings is then a low-rank matrix completion problem [5]. In social network analysis, the network can be represented by a matrix with entries representing similarities between node pairs. Unknown links are treated as missing values and predicted as in matrix completion [7]. Low-rank matrix learning also have applications in image and video processing [6], multitask learning [1], multilabel learning [23], and robust matrix factorization [8].

The low-rank matrix optimization problem is NP-hard [18], and direct minimization is difficult. To alleviate this problem, one common approach is to factorize the target matrix as a product of two low-rank matrices and , where and with . Gradient descent and alternating minimization are often used for optimization [22]. However, as the objective is not jointly convex in and , this approach can suffer from slow convergence [11].

Another approach is to replace the matrix rank by the nuclear norm (i.e., sum of its singular values). It is known that the nuclear norm is the tightest convex envelope of the matrix rank [5]. The resulting optimization problem is convex, and popular convex optimization solvers such as the proximal gradient algorithm [3] and Frank-Wolfe algorithm [12] can be used. However, though convergence properties can be guaranteed, singular value decomposition (SVD) is required in each iteration to generate the next iterate. This can be prohibitively expensive when the target matrix is large. Moreover, nuclear norm regularization often leads to biased estimation. Compared to factorization approaches, the obtained rank can be much higher and the prediction performance is inferior [15].

Recently, greedy algorithms have been explored for low-rank optimization [20]. The idea is similar to orthogonal matching pursuit (OMP) [17] in sparse coding. For example, the state-of-the-art in matrix completion is the rank-one matrix pursuit (R1MP) algorithm [25]. In each iteration, it performs an efficient rank-one SVD on a sparse matrix, and then greedily adds a rank-one matrix to the matrix estimate. Unlike other algorithms which typically require a lot of iterations, it only takes iterations to obtain a rank- solution. Its prediction performance is also comparable or even better than others.

However, R1MP is only designed for matrix completion with the square loss. As recently discussed in [24], different loss functions may be required in different learning scenarios. For example, in link prediction, the presence or absence of a link is naturally represented by a binary variable, and the logistic loss is thus more appropriate. In robust matrix learning applications, the loss or Huber loss can be used to reduce sensitivity to outliers [6]. While computationally R1MP can be used with these loss functions, its convergence analysis is tightly based on OMP (and thus the square loss), and cannot be easily extended.

This motivates us to develop more general greedy algorithms that can be used in a wider range of low-rank matrix learning scenarios. In particular, we consider low-rank matrix optimization problems of the form

where is the target rank, and the objective can be smooth or nonsmooth, (general) convex or strongly convex. The proposed algorithm is an extension of R1MP, and can be reduced to R1MP when is the square loss. In general, when is convex and Lipschitz-smooth, convergence is guaranteed with a rate of . When is strongly convex, this is improved to a linear rate. When is nonsmooth, we obtain a rate for (general) convex objectives and for strongly convex objectives. Experiments on large-scale data sets demonstrate that the proposed algorithms are much faster than the state-of-the-art, while achieving comparable or even better prediction performance.

Notation: The transpose of vector / matrix is denoted by the superscript . For matrix (without loss of generality, we assume that ), its Frobenius norm is , norm is , nuclear norm is , where ’s are the singular values, and is its largest singular value. For two vectors , the inner product ; whereas for two matrices and , . For a smooth function , denotes its gradient. When is convex but nonsmooth, is its subgradient at . Moreover, given , if , and 0 otherwise.

2Review: Rank-One Matrix Pursuit

The rank-one matrix pursuit (R1MP) algorithm [25] is designed for matrix completion [5]. Given a partially observed matrix , indices of the observed entries are contained in the matrix , where if is observed and 0 otherwise. The goal is to find a low-rank matrix that is most similar to at the observed entries. Mathematically, this is formulated as the following optimization problem:

where is the target rank. Note that the square loss has to be used in R1MP.

The key observation is that if has rank , it can be written as the sum of rank-one matrices, i.e., , where and . To solve (Equation 2), R1MP starts with an empty estimate. At the th iteration, the pair that is most correlated with the current residual is greedily added. It can be easily shown that this pair are the leading left and right singular vectors of , and can be efficiently obtained from the rank-one SVD of . After adding this new basis matrix, all coefficients of the current basis can be updated as

Because of the use of the square loss, this is a simple least-squares regression problem with closed-form solution.

To save computation, R1MP also has an economic variant. This only updates the combination coefficients of the current estimate and the rank-one update matrix as:

The whole procedure is shown in Algorithm ?.

Note that each R1MP iteration is computationally inexpensive. Moreover, as the matrix’s rank is increased by one in each iteration, only iterations are needed in order to obtain a rank- solution. It can also be shown that the residual’s norm decreases at a linear rate, i.e., for some .

3Low-Rank Matrix Learning with Smooth Objectives

Though R1MP is scalable, it can only be used for matrix completion with the square loss. In this Section, we extend R1MP to problems with more general, smooth objectives. Specifically, we only assume that the objective is convex and -Lipschitz smooth. This will be further extended to nonsmooth objectives in Section 4.

3.1Proposed Algorithm

Let the matrix iterate at the th iteration be . We follow the gradient direction of the objective , and find the rank-one matrix that is most correlated with :

Similar to [25], its optimal solution is given by , where are the leading left and right singular vectors of . We then set the coefficient for this new rank-one update matrix to , where is the singular value corresponding to . Optionally, all the coefficients can be refined as

As in R1MP, an economic variant is to update the coefficients as , where and are obtained as

The whole procedure, which will be called “greedy low-rank learning” (GLRL), is shown in Algorithm ?. Its economic variant will be called EGLRL. Obviously, on matrix completion problems with the square loss, Algorithm ? reduces to R1MP.

Note that , are smooth minimization problems (with and 2 variables, respectively). As the target matrix is low-rank, should be small and thus , can be solved inexpensively. In the experiments, we use the popular limited-memory BGFS (L-BFGS) solver [16]. Empirically, fewer than five L-BFGS iterations are needed. Preliminary experiments show that using more iterations does not improve performance.

Unlike R1MP, note that the coefficient refinement at step 5 is optional. Convergence results in Theorems ? and ? below still hold even when this step is not performed. However, as will be illustrated in Section 5.1, coefficient refinement is always beneficial in practice. It results in a larger reduction of the objective in each iteration, and thus a better rank- model after running for iterations.

3.2Convergence

The analysis of R1MP is based on orthogonal matching pursuit [17]. This requires the condition , which only holds when is the square loss. In contrast, our analysis for Algorithm ? here is novel and can be used for any Lipschitz-smooth .

The following Proposition shows that the objective is decreasing in each iteration. Because of the lack of space, all the proofs will be omitted.

If is strongly convex, a linear convergence rate can be obtained.

If is only (general) convex, the following shows that Algorithm ? converges at a slower rate.

The square loss in (Equation 2) is only general convex and -Lipschitz smooth. From Theorem ?, one would expect GLRL to only have a sublinear convergence rate of on matrix completion problems. However, our analysis can be refined in this special case. The following Theorem shows that a linear rate can indeed be obtained, which also agrees with Theorem 3.1 of [25].

3.3Per-Iteration Time Complexity

The per-iteration time complexity of Algorithm ? is low. Here, we take the link prediction problem in Section 5.1 as an example. With defined only on the observed entries of the link matrix, is sparse, and computation of in step 3 takes time. The rank-one SVD on can be obtained by the power method [10] in time. Refining coefficients using takes time for the th iteration. Thus, the total per-iteration time complexity of GLRL is . Similarly, the per-iteration time complexity of EGLRL is . In comparison, the state-of-the-art AIS-Impute algorithm [27] (with a convergence rate of ) takes time in each iteration, whereas the alternating minimization approach in [7] (whose convergence rate is unknown) takes time per iteration.

3.4Discussion

To learn the generalized low-rank model, [24] ([24]) followed the common approach of factorizing the target matrix as a product of two low-rank matrices and then performing alternating minimization [22]. However, this may not be very efficient, and is much slower than R1MP on matrix completion problems [25]. More empirical comparisons will be demonstrated in Section 5.1.

Similar to R1MP, the greedy efficient component optimization (GECO) [20] is also based on greedy approximation but can be used with any smooth objective. However, GECO is even slower than R1MP [25]. Moreover, it does not have convergence guarantee.

4Low-Rank Matrix Learning with Nonsmooth Objectives

Depending on the application, different (convex) nonsmooth loss functions may be used in generalized low-rank matrix models [24]. For example, the loss is useful in robust matrix factorization [6], the scalene loss in quantile regression [13], and the hinge loss in multilabel learning [29]. In this Section, we extend the GLRL algorithm, with simple modifications, to nonsmooth objectives.

4.1Proposed Algorithm

As the objective is nonsmooth, one has to use the subgradient of at the th iteration instead of the gradient in Section 3. Moreover, refining the coefficients as in (Equation 6) or (Equation 7) will now involve nonsmooth optimization, which is much harder. Hence, we do not optimize the coefficients. To ensure convergence, a sufficient reduction in the objective in each iteration is still required. To achieve this, instead of just adding a rank-one matrix, we add a rank- matrix (where may be greater than 1). This matrix should be most similar to , which can be easily obtained as:

where are the leading left and right singular vectors of , and are corresponding singular values. The proposed procedure is shown in Algorithm ?. The stepsize in step 3 is given by

where and .

4.2Convergence

The following Theorem shows that when is nonsmooth and strongly convex, Algorithm ? has a convergence rate of .

When is only (general) convex, the following Theorem shows that the rate is reduced to .

For other convex nonsmooth optimization problems, the same rate for strongly convex objectives and and rate for general convex objectives have also been observed [21]. However, their analysis is for different problems, and cannot be readily applied to our low-rank matrix learning problem here.

4.3Per-Iteration Time Complexity

To study the per-iteration time complexity, we take the robust matrix factorization problem in Section 5.2 as an example. The main computations are on steps 4 and 6. In step 4, since the subgradient is sparse (nonzero only at the observed entries), computing takes time. At outer iteration and inner iteration , in step 6 is sparse and has low rank (equal to ). Thus, admits the so-called “sparse plus low-rank” structure [15]. This allows matrix-vector multiplications and subsequently rank-one SVD to be performed much more efficiently. Specifically, for any , the multiplication takes only time (and similarly for the multiplication with any ), and rank-one SVD using the power method takes time. Assuming that inner iterations are run at (outer) iteration , it takes a total of time. Typically, is small (empirically, usually 1 or 2).

In comparison, though the ADMM algorithm in [14] has a faster convergence rate, it needs SVD and takes time in each iteration. As for the Wiberg algorithm [8], its convergence rate is unknown and a linear program with variables needs to be solved in each iteration. As will be seen in Section 5.2, this is much slower than GLRL.

5Experiments

In this section, we compare the proposed algorithms with the state-of-the-art on link prediction and robust matrix factorization. Experiments are performed on a PC with Intel i7 CPU and 32GB RAM. All the codes are in Matlab.

5.1Social Network Analysis

Given a graph with nodes and an incomplete adjacency matrix , link prediction aims to recover a low-rank matrix such that the signs of ’s and ’s agree on most of the observed entries. This can be formulated as the following optimization problem [7]:

where contains indices of the observed entries. Note that (Equation 10) uses the logistic loss, which is more appropriate as ’s are binary.

The objective in (Equation 10) is convex and smooth. Hence, we compare the proposed GLRL (Algorithm ? with coefficient update step (Equation 6)) and its economic variant EGLRL (using coefficient update step (Equation 7)) with the following:

  1. AIS-Impute [27]: This is an accelerated proximal gradient algorithm with further speedup based on approximate SVD and the special “sparse plus low-rank” matrix structure in matrix completion;

  2. Alternating minimization (“AltMin”) [7]: This factorizes as a product of two low-rank matrices, and then uses alternating gradient descent for optimization.

As a further baseline, we also compare with the GLRL variant that does not perform coefficient update. We do not compare with greedy efficient component optimization (GECO) [20], matrix norm boosting [30] and active subspace selection [11], as they have been shown to be slower than AIS-Impute and AltMin [27].

Experiments are performed on the Epinions and Slashdot data sets1 [7] (Table ?). Each row/column of the matrix corresponds to a user (users with fewer than two observations are removed). For Epinions, if user trusts user , and otherwise. Similarly for Slashdot, if user tags user as friend, and otherwise.

Signed network data sets used.
#rows #columns #observations
Epinions 42,470 40,700
Slashdot 30,670 39,196

As in [25], we fix the number of power method iterations to . Following [7], we use 10-fold cross-validation and fix the rank to 40. Note that AIS-Impute uses the nuclear norm regularizer and does not explicitly constrain the rank. We select its regularization parameter so that its output rank is 40. To obtain a rank- solution, GLRL is simply run for iterations. For AIS-Impute and AltMin, they are stopped when the relative change in the objective is smaller than . The output predictions are binarized by thresholding at zero. As in [7], the sign prediction accuracy is used as performance measure.

Table 1 shows the sign prediction accuracy on the test set. All methods, except the GLRL variant that does not perform coefficient update, have comparable prediction performance. However, as shown in Figure ?, AltMin and AIS-Impute are much slower (as discussed in Section 3.3). EGLRL has the lowest per-iteration cost, and is also faster than GLRL.

Table 1: Testing sign prediction accuracy (%) on link prediction. The best and comparable results (according to the pairwise t-test with 95% confidence) are highlighted.
Epinions Slashdot
AIS-Impute 93.30.1 84.20.1
AltMin 93.50.1 84.90.1
GLRL w/o coef upd 92.40.1 82.60.3
GLRL 93.60.1 84.10.4
EGLRL 93.60.1 84.40.3


5.2Robust Matrix Factorization

Instead of using the square loss, robust matrix factorization uses the loss to reduce sensitivities to outliers [6]. This can be formulated as the optimization problem [14]:

Note that the objective is only general convex, and its subgradient is bounded (). Since there is no smooth component in the objective, AIS-Impute and AltMin cannot be used. Instead, we compare GLRL in Algorithm ? (with and ) with the following:

  1. Alternating direction method of multipliers (ADMM)2 [14]: The rank constraint is replaced by the nuclear norm regularizer, and ADMM [4] is then used to solve the equivalent problem: .

  2. The Wiberg algorithm [8]: It factorizes into and optimizes them by linear programming. Here, we use the linear programming solver in Matlab.

Experiments are performed on the MovieLens data sets3 (Table ?), which have been commonly used for evaluating recommender systems [25]. They contain ratings assigned by various users on movies. The setup is the same as in [25]. of the ratings are randomly sampled for training while the rest for testing. The ranks used for the 100K, 1M, 10M data sets are 10, 10, and 20, respectively. For performance evaluation, we use the mean absolute error on the unobserved entries :

where is the predicted matrix [8]. Experiments are repeated five times with random training/testing splits.

MovieLens data sets used in the experiments.
#users #movies #ratings
100K 943 1,682
1M 6,040 3,449
10M 69,878 10,677

Results are shown in Table 2. As can be seen, ADMM performs slightly better on the 100K data set, and GLRL is more accurate than Wiberg. However, ADMM is computationally expensive as SVD is required in each iteration. Thus, it cannot be run on the larger 1M and 10M data sets. Figure ? shows the convergence of MABS with CPU time. As can be seen, GLRL is the fastest, which is then followed by Wiberg, and ADMM is the slowest.

Table 2: Testing MABS on the MovieLens data sets. The best results (according to the pairwise t-test with 95% confidence) are highlighted. ADMM cannot converge in seconds on the 1M and 10M data sets, and thus is not shown.
100K 1M 10M
ADMM 0.7170.004  —   — 
Wiberg 0.7260.001 0.7280.006 0.7150.005
GLRL 0.7240.004 0.6940.001 0.6830.001

6Conclusion

In this paper, we propose an efficient greedy algorithm for the learning of generalized low-rank models. Our algorithm is based on the state-of-art R1MP algorithm, but allows the optimization objective to be smooth or nonsmooth, general convex or strongly convex. Convergence analysis shows that the proposed algorithm has fast convergence rates, and is compatible with those obtained on other (convex) smooth/nonsmooth optimization problems. Specifically, on smooth problems, it converges with a rate of on general convex problems and a linear rate on strongly convex problems. On nonsmooth problems, it converges with a rate of on general convex problems and rate on strongly convex problems. Experimental results on link prediction and robust matrix factorization show that the proposed algorithm achieves comparable or better prediction performance as the state-of-the-art, but is much faster.

Acknowledgments

Thanks for helpful discussion from Lu Hou. This research was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region (Grant 614513).


Proposition

At the th iteration, we have . Construct as

and let . Thus, . As is -Lipschitz smooth,

Note that . Together with , we have

For the second term on the RHS of (Equation 12),

where . Note that is the largest singular value of . Thus, , and

Then (Equation 12) becomes

In Algorithm ?, is obtained by minimizing over or . Once we warm start using , we can ensure that , and thus the Proposition holds.

Theorem

Since is -strongly convex,

The minimum is achieved at , and .

On optimal , using Lemma ?:

Combine it with Proposition ?, then

Induct from to , we then have the Theorem.

Theorem

First, we show that is upper-bounded.

Let , from Proposition ?

Summing (Equation 14) from to , then

Since is lower bounded, thus on , we must have , i.e. is a convergent sequence to and will not diverse. As a result, there must exist a constant such that .

Now, we start to prove Theorem ?.

From convexity, and since is the minimum, therefore, we have

Next, since and use Cauchy inequality , then

Then, from Proposition ?, there exist a such that , combining it with (Equation 15), we have:

Combine above inequality with Proposition ?,

Therefore, by Lemma B.2 of [19], the above sequence converges to of rate

which proves the Theorem.

Theorem

When restrict to in :

On optimal , using above inequality, then

As a result, we get

Since is -Lipschitz smooth and , together with Proposition ? and (Equation 16), then

Induct from to , we then have

Thus, we get the Theorem, and linear rate exists.

Theorem

Proof follows Theorem 5 at [9].

Rearranging items, we have:

As is -strongly convex,

Sum from to , and using (Equation 17)

Recall, the step size is , then (Equation 19) becomes

where is can be picked up as . For the second term in (Equation 20), it is simply bounded as

Let , since , thus . For last term in (Equation 20), we use , let and , then

By definition of and assumption , for the first iteration ()

In Algorithm ?, we ensure

thus (Equation 22) becomes:

Now, using (Equation 21) and (Equation 23), we can bound (Equation 20) as

Thus, we can get convergence rate as

Finally, note that

Let , we get the theorem.

Theorem

Proof follows Theorem 6 at [9]. First, we introduce below Proposition

Now, we start to prove Theorem ?.

For weak convex convexity (obtained from with ), we have

where is picked up at . The step size is , use it back into (Equation 25), we get