Fast Generalized Conditional Gradient Method with Applications to Matrix Recovery Problems

Fast Generalized Conditional Gradient Method with Applications to Matrix Recovery Problems

Dan Garber
dangar@technion.ac.il
   Atara Kaplan
ataragold@technion.ac.il
   Shoham Sabach
ssabach@technion.ac.il

Technion - Israel Institute of Technology
Abstract

Motivated by matrix recovery problems such as Robust Principal Component Analysis, we consider a general optimization problem of minimizing a smooth and strongly convex loss applied to the sum of two blocks of variables, where each block of variables is constrained or regularized individually. We present a novel Generalized Conditional Gradient method which is able to leverage the special structure of the problem to obtain faster convergence rates than those attainable via standard methods, under a variety of interesting assumptions. In particular, our method is appealing for matrix problems in which one of the blocks corresponds to a low-rank matrix and avoiding prohibitive full-rank singular value decompositions, which are required by most standard methods, is most desirable.

Importantly, while our initial motivation comes from problems which originated in statistics, our analysis does not impose any statistical assumptions on the data.

1 Introduction

In this paper we consider the following general convex optimization problem

(1)

where E is some finite linear space over the reals, is assumed to be smooth and strongly convex, while and are proper, lower semicontinuous and convex functions which can be thought of either as regularization functions, or indicator functions of certain convex feasible sets and .

Problem (1) captures several important problems of interest, perhaps the most well-studied is that of Robust Principal Component Analysis [3, 11, 8], in which the goal is to (approximately) decompose an input matrix into the sum of a low-rank matrix and a sparse matrix . The underlying optimization problem for Robust PCA can be written as (see for instance [8])

(2)

where denotes the Frobenius norm, denotes the nuclear norm, i.e., the sum of singular values which is a highly popular convex surrogate for low-rank penalty, and is the entry-wise -norm which is a well-known convex surrogate for entry-wise sparsity.

Other variants of interest of Problem (2) are when the data matrix is a corrupted covariance matrix, in which case it is reasonable to further constrain to be positive semidefinite, i.e., use the constraints and . In the case that is assumed to have several fully corrupted rows or columns, a popular alternative to the regularizer on the variable is to use either the norm (sum of norm of rows) in case of corrupted rows, or the norm (sum of norm of columns) in case of corrupted columns, as a regularizer/constraint [12]. Finally, moving beyond Robust PCA, a different choice of interest for the loss could be , where is a linear sensing operator such that is positive definite (so is strongly-convex).

In this paper we present an algorithm and novel analyses that builds on the special structure of Problem (1) which improve upon state-of-the-art complexity bounds, under several assumptions of interest. Importantly, key to all of our results is the ability to exploit the strong convexity of to obtain improved complexity bounds. Here it is important to note that while is assumed to be strongly convex, Problem (1) is in general not strongly convex in . This can already be observed when choosing , and , where . In this case, denoting the overall objective as , it is easily observed that the Hessian matrix of is given by , and hence is not full-rank.

The fastest known convergence rate for first-order methods applicable to Problem (1), is achievable via accelerated gradient methods such as Nesterov’s optimal method [9] and FISTA [2], which converge at a rate of . However, in the context of low-rank matrix optimization problems such as Robust PCA, these methods require to compute a full-rank singular value decomposition on each iteration to update the low-rank component which is often prohibitive for large scale instances. A different type of first-order methods is the Conditional Gradient Method (aka Frank-Wolfe algorithm) and variants of [7, 5, 6], which in the context of low-rank matrix optimization, simply requires to compute an approximated leading singular vector pair of (minus) the gradient direction on each iteration, i.e., a rank-one SVD, and hence are much more scalable, in this case, than projection/proximal based methods. However, the rate of convergence is slower, e.g., if both and are indicator functions for certain convex sets and , then the convergence rate of the conditional gradient method is of the form , where and denote the Euclidean diameter of the corresponding feasible sets.

Recently, two novel variants of the conditional gradient method for low-rank matrix optimization were suggested which enjoy faster convergence rates when the optimal solution has low-rank, which is indeed a key implicit assumption in such problems, while requiring to compute only a single low-rank SVD on each iteration [4, 1]. However, both of these new methods require the objective function to be strongly convex, which as we discussed above, does not hold in our case. Nevertheless, both our algorithm and analysis are inspired by these two works. In particular, we generalize the low-rank SVD approach of [1] to arbitrary regularizer/constraint in order to allow more efficient computations in our algorithm.

In another recent related work [8], which also servers as a motivation for this current work, the authors considered a variant of the conditional gradient method tailored for low-rank and robust matrix recovery problems such as (2), which combines standard conditional gradient updates of the low-rank variable (i.e., rank-one SVD) and proximal updates for the sparse noise component. However, both the worst-case convergence rate and running time do not improve over the standard conditional gradient method.

Finally, it is important to note that while efficient non-convex optimization-based algorithms for Robust PCA with provable guarantees is an active subject of research, see for instance [10, 13], such works are (a) not flexible as the general model (1), which allows for instance to impose a PSD constraint on the low-rank component or to consider various sparsity-promoting regularizers for the sparse component , and (b) all provable guarantees for such algorithms are heavily based on structural statistical assumptions on the input matrix (such as incoherence of the singular value decomposition of the low-rank component and assuming certain patterns of the sparse component), which can be quite limiting in practice. In this work on the other-hand, we are free of such assumptions completely.

To overcome the shortcomings of previous methods applicable to Problem (1), in this paper we present a new first-order method for tackling Problem (1). In particular we show that under several assumptions of interest, despite the fact that the objective in (1) is in general not strongly-convex, it is possible to leverage the strong convexity of towards obtaining better complexity results, while applying update steps that are scalable to large scale problems. Informally speaking, our main improved complexity bounds are as follows:

  1. In the case that both and are indicators of compact and convex sets (as in (2)), we obtain convergence rate of . In particular when is constrained, for example, via a low-rank promoting constraint, such as the nuclear-norm, our method requires on each iteration only a SVD computation of rank=, where is part of some optimal solution . This result improves (in terms of running time), in a wide regime of parameters, mainly when , over the conditional gradient method which converges with rate , and over accelerated gradient methods which require, in the context of low-rank matrix optimization problems, a full-rank SVD computation on each iteration.

  2. In the case that is an indicator of a strongly convex set (e.g., an ball for ), our method achieves a fast convergence rate of . As in the previous case, if is constrained/regularized via the nuclear norm, then our method only requires a SVD computation of rank=. To the best of our knowledge, this is the first result that combines an convergence rate and low-rank SVD computations in this setting. In particular, in the context of Robust PCA, such a result allows us to replace a traditional sparsity-promoting constraint of the form , with , for some small constant . Using the norm instead of the gives rise to a strongly convex feasible set and, as we demonstrate empirically in Section 3.2, may provide a satisfactory approximation to the constraint in terms of sparsity.

  3. In the case that either or are strongly convex (though not necessarily differentiable), our method achieves a linear convergence rate. In fact, we show that even if only one of the variables is regularized by a strongly convex function, then the entire objective (1) becomes strongly convex in . Here also, in the case of a nuclear norm constraint/regularization on one of the variables, we are able to leverage the use of only low-rank SVD computations. In the context of Robust PCA such a natural strongly convex regularizer may arise by replacing the regularization on with the elastic net regularizer, which combines both the norm and the squared norm, and serves as a popular alternative to the regularizer in LASSO.

A quick summary of the above results in the context of Robust PCA problems, such as Problem (2), is given in Table 1. See Section 3.2 in the sequel for a detailed discussion.

Cond. Grad. [7] FISTA [2] Algorithm 1
setting rate SVD rate SVD rate SVD
rank rank rank
(“high SNR regime”)
(“low SNR regime”)

Table 1: Comparison of convergence rates and iteration complexity bounds for Robust PCA problems (see (2)) with a input matrix . For all methods the computational bottleneck is a single SVD computation to update the variable , hence we focus on the rank of the SVD required. For clarity of presentation the results are given in simplified form. The dependence on in the rate for Algorithm 1 in the third row comes from the strong convexity parameter of the set w.r.t. the Frobenius norm.

2 Preliminaries

Throughout the paper we let E denote an arbitrary finite-dimensional normed vector space over where and denotes a pair of primal and dual norms over E.

2.1 Smoothness and strong convexity of functions and sets

Definition 1 (smooth function).

Let be a differentiable function over a convex set . We say that is -smooth over with respect to , if for all it holds that .

Definition 2 (strongly convex function).

Let be a differentiable function over a convex set . We say that is -strongly convex over with respect to , if it satisfies for all that .

The above definition combined with the first order optimality condition implies that for a differentiable and -strongly convex function , if , then for any it holds that .

This last inequality further implies that the magnitude of the gradient of at point , is at least of the order of the square-root of the objective value approximation error at , that is, . Indeed, this follows since

where the second inequality follows from Holder’s inequality and the third from the convexity of . Thus, at any point , it holds that

(3)
Definition 3 (strongly convex set).

We say that a convex set is -strongly convex with respect to if for any , any and any vector such that , it holds that . That is, contains a ball of of radius induced by the norm centered at .

For more details on strongly convex sets, examples and connections to optimization, we refer the reader to [5].

3 Algorithm and Results

As discussed in the introduction, in this paper we study efficient algorithms for the minimization model (1), where, throughout the paper, our blanket assumption is as follows

Assumption 1.
  • is -smooth and -strongly convex.

  • and are proper, lower semicontinuous and convex functions.

It should be noted that since (similarly for ) is assumed to be extended-valued function, it allows the inclusion of constraint through the indicator function of the corresponding constraint set. Indeed, in this case one will consider , where is a nonempty, closed and convex.

We now present the main algorithmic framework which will be used to derive all of our theoretical results.

1:  input: - sequence of step-sizes
2:   arbitrary point in , arbitrary point in
3:  for  do
4:     
5:      {in fact it suffices that for some optimal solution }
6:     
7:  end for
Algorithm 1 Generalized Conditional Gradient Method for Problem (1)

The algorithm presented above is based on three well-known corner stones: alternating minimization, conditional gradient and proximal gradient. Since Problem (1) involves two variables and , we update each one of them separately and differently in an alternating fashion. Indeed, the -variable is first updated using a conditional gradient step (see step (4)) and then the alternating idea comes into a play and we use the updated information in order to update the -variable using a proximal gradient step (see step (5)).

Remark: Note that a practical implementation of Algorithm 1 for a specific problem, such as Problem (2), may require to account for approximation errors in the computation of or , since exact computation is not always practically feasible. Such considerations which can be easily incorporated both into Algorithm 1 and our corresponding analyses (see examples in [7, 4, 1]), are beyond the scope of this current paper, and for the simplicity and clarity of presentation, we assume all such computations are precise.

3.1 Outline of main results

Let us denote by the optimal value of the optimization problem (1). In the sequel we prove the following three theorems on the performance of Algorithm 1. For clarity, below we present a concise version of the results. In section 4, in which we provide complete proofs for these theorems, we also restate them with complete detail. In all three theorems we assume that Assumption 1 holds true, and we bound the convergence rate of the sequence produced by Algorithm 1 with a suitable choice of step-sizes .

Theorem 1.

Assume that where is a nonempty, closed and convex subset of . There exists a (explicit) choice of step-sizes such that Algorithm 1 converges with rate .

Remark: Note that since and are in principle interchangeable, Theorem 1 implies a rate of . This improves over the a rate of achieved by standard analyses of projected/proximal gradient methods and the conditional gradient method.

Theorem 2.

Assume where is a nonempty, closed and convex subset of and , where is an strongly convex and closed subset of . There exists a (explicit) choice of step-sizes such that Algorithm 1 converges with rate .

Moreover, if there exists such that , then using a fixed step-size, Algorithm 1 converges with rate .

Theorem 3.

Assume that is strongly convex. Then, using a fixed step-size, Algorithm 1 converges with rate .

3.2 Putting our results in the context of Robust PCA problems

As discussed in the introduction, this work is mostly motivated by low-rank matrix optimization problems such as Robust PCA (see Problem (2)). Thus, towards better understanding of our results for this setting, we now briefly detail the applications to Problem (2). As often standard in such problems, we assume that the there exists an optimal solution such that the signal matrix is of rank at most , where 111our results could be easily extended to the case in which is nearly of rank , i.e., of distance much smaller than the required approximation accuracy to a rank matrix, however for the sake of clarity we simply assume that is of low-rank..

Using low-rank SVD computations:

note that the computation of in Algorithm 1, which is used to update the estimate , simply requires that satisfies

(4)

where we use the short notation . Since is assumed to have rank at most , it follows that

(5)

The solution to the minimization problem on the RHS of (5) is given simply by computing the rank- singular value decomposition of the matrix , and projecting the vector of singular values onto the ball of radius (which can be done in time). Thus, indeed the time to compute the update for on each iteration of Algorithm 1 is dominated by a single rank- SVD computation. This observation holds for all the following discussions in this section as well. This low-rank SVD approach was already suggested in the recent work [1], that studied smooth and strongly convex minimization over the nuclear-norm ball (which differs from our setting). Finally, we note that this low-rank update could be easily extended to the case in which is not constrained, but regularized via the nuclear norm, or when is also constrained to be positive semidefinite (e.g., when attempting to recover a low-rank covariance matrix).

Faster rates for low/high SNR regimes:

in the case that is an (say, unique) optimal solution of Problem (2) which satisfies , i.e., a high signal-to-noise ratio regime, we expect that , where and are the Euclidean diameters of the nuclear norm ball and the ball, respectively. In this case, the result of Theroem 1 is appealing since the convergence rate depends only on and not on as standard algorithms/analyses. In the opposite case, i.e., , which naturally corresponds to a low signal-to-noise ratio regime, since and are interchangeable in our setting, we can reverse their rolls in the optimization and get via Theorem 1 dependency only on . Moreover, now the nuclear-norm constrained variable (assuming the role of in Algorithm 1) is only updated via a conditional gradient update - i.e., with a rank one matrix which minimizes the inner product with minus the gradient direction, and hence can be computed using only a rank-one SVD computation on each iteration. In particular, statistical recovery results such as the seminal work [3], show that under suitable assumptions on the data, exact recovery is possible in both of these cases.

Replacing the constraint with a constraint:

the norm is traditionally used in Robust PCA to constrain/regularize the sparse noise component. The standard geometric intuition is that since the boundary of the -ball becomes sharp near the axes, this choice promotes sparse solutions. This property also holds for a -ball where is sufficiently close to 1. Thus, it might be reasonable to replace the constraint for with an constraint for some small constant , which results in a strongly convex feasible set for the variable (see [5]). Using Theorem 2, we will obtain an convergence rate instead of , practically without increasing the computational complexity per iteration (since is updated via a conditional gradient update and linear optimization over a -ball can be carried-out in linear time [5]).

In order to demonstrate the plausibility of using the norm instead of , in Table 2 we present results on synthetic data (similar to those done in [3]), which show that already for a moderate value of we obtain quite satisfactory recovery results.

Similarly, when dealing with fully corrupted rows, it might make sense to replace a constraint on (as proposed in [12]) with a constraint, which again gives rise to a strongly-convex set.

Table 2: Empirical results for solving Problem (2) with constraint on instead of . The input matrix is , where for with entries sampled i.i.d. from and . Every entry in is set i.i.d. to w.p. and to either or w.p. . For simplicity we set exact bounds , . First row gives the relative recovery error of the low-rank signal and the second row gives the relative recovery error of the sparse noise component, averaged over 10 i.i.d. experiments.

Replacing the regularizer with an elastic net regularizer:

in certain cases it may be beneficial to replace the constraint/regularization of the variable in (2) with an elastic net regularizer, i.e., to take . The elastic net is a popular alternative to the standard regularizer for problems such as LASSO (see, for instance, [14]). As opposed to the regularizer, the elastic net is strongly convex (though not differentiable). Thus, with such a choice for , by invoking Theorem 3, Algorithm 1 guarantees a linear convergence rate. We note that when using the elastic net regularizer, the computation of on each iteration of Algorithm 1 requires to solve the optimization problem:

where we again use the short notation . The optimization problem on the RHS of (3.2) admits a well known closed-form solution given by the shrinkage/soft-thresholding operator, which can be computed in linear time (i.e., time), see for instance [2].

4 Rate of Convergence Analysis

In this section we provide the proofs of all the three theorems mentioned in Section 3.1. We also restate the theorems with complete detail. To this end, we first prove the following technical result which forms the basis for the proofs of all stated theorems. Throughout this section and for the simplicity of the yet to come developments we denote, for all , , , and . Note that, using these notations we obviously have that . Similarly, for an optimal solution of problem (1) we denote and .

Proposition 1.

Assume that Assumption 1 holds true. Let be a sequence generated by Algorithm 1. Then, for all , we have that

(6)
Proof.

Fix . Observe that by the update rule of Algorithm 1, it holds that

Thus, since is -smooth, it follows that

Using the above inequality we can write,

(7)

where (a) follows from the convexity of and , while (b) follows from the choice of . Note that

(8)

where (a) follows from the inequality , and (b) follows from the definitions of and .

Finally, plugging the RHS of Eq. (4) into the RHS of Eq. (4) completes the proof of the proposition. ∎

We now prove Theorem 1. For convenience, we first restate the theorem.

Theorem 4.

Assume that Assumption 1 holds true. Assume further that where is a nonempty, closed and convex subset of . Let be a sequence generated by Algorithm 1 with the following step-sizes:

(9)

where , for satisfying . Then, it holds that

Proof.

From the choice of we have that

(10)

Now, using this in (6), we get that for all , that

(11)

where the last inequality follows from the fact that . On the other hand, from the strong convexity of we obtain that

Therefore, by combining these two inequalities we derive that

Subtracting from both sides of the inequality and by denoting , we obtain that,

Let us define , . Dividing both sides of the above inequality by , we obtain the recursion

(12)

Let be such that , which implies . Let , as defined in the theorem, and consider the step-size sequence , given in (9).

Consider now executing the first iterations using the step-size choice in (9), i.e., . Using Eq. (12), we have that

Thus, for , we obtain .

We now show that for all , it holds that . For the base case , we indeed have already showed that . Assuming the induction holds for some , using the recursion (12), the induction hypothesis, and the step-size choice in (9), , we have that

Hence, the induction implies that for all . The proof is completed by recalling that . ∎

Now, we turn to prove Theorem 2. Again, we first restate the theorem.

Theorem 5.

Assume that Assumption 1 holds true, where is a nonempty, closed and convex subset of and , where is an -strongly convex and closed subset of . Let be a sequence produced by Algorithm 1 using the step-size for all . Then, it holds that

Moreover, if there exists such that , then using a fixed step-size for all , guarantees that

Proof.

Fix some iteration and define the points and .

Note that since is an -strongly convex set, it follows from Definition 3 that . Moreover, from the optimal choice of we have that . Thus, we have that

(13)

where (a) follows from the fact that , and (b) follows from the definition of and Holder’s inequality.

Plugging Eq. (4) into Eq. (6), and recalling that (hence, ), we have that