Fast First-Order Methods for Stable Principal Component Pursuit5footnote 55footnote 5Research partially supported by ONR grant N000140310514, NSF Grant DMS 10-16571 and DOE Grant DE-FG02-08-25856.

Fast First-Order Methods for Stable Principal Component Pursuit555Research partially supported by ONR grant N000140310514, NSF Grant DMS 10-16571 and DOE Grant DE-FG02-08-25856.

N. S. Aybat 222IEOR Department, Columbia University. Email: nsa2106@columbia.edu.    D. Goldfarb 333IEOR Department, Columbia University. Email: goldfarb@columbia.edu.     G. Iyengar 444IEOR Department, Columbia University. Email: gi10@columbia.edu.
Abstract

The stable principal component pursuit (SPCP) problem is a non-smooth convex optimization problem, the solution of which has been shown both in theory and in practice to enable one to recover the low rank and sparse components of a matrix whose elements have been corrupted by Gaussian noise. In this paper, we first show how several existing fast first-order methods can be applied to this problem very efficiently. Specifically, we show that the subproblems that arise when applying optimal gradient methods of Nesterov, alternating linearization methods and alternating direction augmented Lagrangian methods to the SPCP problem either have closed-form solutions or have solutions that can be obtained with very modest effort. Later, we develop a new first order algorithm, NSA, based on partial variable splitting. All but one of the methods analyzed require at least one of the non-smooth terms in the objective function to be smoothed and obtain an -optimal solution to the SPCP problem in iterations. NSA, which works directly with the fully non-smooth objective function, is proved to be convergent under mild conditions on the sequence of parameters it uses. Our preliminary computational tests show that the latter method, NSA, although its complexity is not known, is the fastest among the four algorithms described and substantially outperforms ASALM, the only existing method for the SPCP problem. To best of our knowledge, an algorithm for the SPCP problem that has iteration complexity and has a per iteration complexity equal to that of a singular value decomposition is given for the first time.

1 Introduction

In [2, 12], it was shown that when the data matrix is of the form , where is a low-rank matrix, i.e. , and is a sparse matrix, i.e. ( counts the number of nonzero elements of its argument), one can recover the low-rank and sparse components of by solving the principal component pursuit problem

 minX∈Rm×n∥X∥∗+ξ ∥D−X∥1, \hb@xt@.01(1.1)

where .

For , denotes the nuclear norm of , which is equal to the sum of its singular values, , and , where is the maximum singular value of .

To be more precise, let with and let denote the singular value decomposition (SVD) of . Suppose that for some , and satisfy

 maxi∥UTei∥22≤μrm,maxi∥VTei∥22≤μrn,∥UVT∥∞≤√μrmn, \hb@xt@.01(1.2)

where denotes the -th unit vector.

Theorem 1.1

[2] Suppose , where with satisfies (LABEL:eq:assumption) for some , and the support set of is uniformly distributed. Then there are constants , , such that with probability of at least , the principal component pursuit problem (LABEL:eq:component_pursuit) exactly recovers and provided that

 rank(X0)≤ρrmμ−1(log(n))−2and∥S0∥0≤ρsmn. \hb@xt@.01(1.3)

In [13], it is shown that the recovery is still possible even when the data matrix, , is corrupted with a dense error matrix, such that , by solving the stable principal component pursuit (SPCP) problem

 (P):minX,S∈Rm×n{∥X∥∗+ξ ∥S∥1: ∥X+S−D∥F≤δ}. \hb@xt@.01(1.4)

Specifically, the following theorem is proved in [13].

Theorem 1.2

[13] Suppose , where with satisfies (LABEL:eq:assumption) for some , and the support set of is uniformly distributed. If and satisfy (LABEL:eq:assumption2), then for any such that the solution, , to the stable principal component pursuit problem (LABEL:eq:stable_component_pursuit) satisfies for some constant with high probability.

Principal component pursuit and stable principal component pursuit both have applications in video surveillance and face recognition. For existing algorithmic approaches to solving principal component pursuit see  [2, 3, 6, 7, 13] and references therein. In this paper, we develop four different fast first-order algorithms to solve the SPCP problem . The first two algorithms are direct applications of Nesterov’s optimal algorithm [9] and the proximal gradient method of Tseng [11], which is inspired by both FISTA and Nesterov’s infinite memory algorithms that are introduced in [1] and [9], respectively. In this paper it is shown that both algorithms can compute an -optimal, feasible solution to in iterations. The third and fourth algorithms apply an alternating direction augmented Lagrangian approach to an equivalent problem obtained by partial variable splitting. The third algorithm can compute an -optimal, feasible solution to the problem in iterations, which can be easily improved to complexity. Given , all first three algorithms use suitably smooth versions of at least one of the norms in the objective function. The fourth algorithm (NSA) works directly with the original non-smooth objective function and can be shown to converge to an optimal solution of , provided that a mild condition on the increasing sequence of penalty multipliers holds. To best of our knowledge, an algorithm for the SPCP problem that has iteration complexity and has a per iteration complexity equal to that of a singular value decomposition is given for the first time.

The only algorithm that we know of that has been designed to solve the SPCP problem is the algorithm ASALM [10]. The results of our numerical experiments comparing NSA algorithm with ASALM has shown that NSA is faster and also more robust to changes in problem parameters.

2 Proximal Gradient Algorithm with Smooth Objective Function

In this section we show that Nesterov’s optimal algorithm [8, 9] for simple sets is efficient for solving .

For fixed parameters and , define the smooth functions and as follows

 fμ(X)=maxU∈Rm×n:∥U∥2≤1⟨X,U⟩−μ2∥U∥2F, \hb@xt@.01(2.1) gν(S)=maxW∈Rm×n:∥W∥∞≤1⟨S,W⟩−ν2∥W∥2F. \hb@xt@.01(2.2)

Clearly, and closely approximate the non-smooth functions and , respectively. Also let and , where and are the Lipschitz constants for the gradients of and , respectively. Then Nesterov’s optimal algorithm [8, 9] for simple sets applied to the problem:

 minX,S∈Rm×n{fμ(X)+ξ gν(S): (X,S)∈χ}, \hb@xt@.01(2.3)

is given by Algorithm LABEL:alg:sNesterov.

Because of the simple form of the set , it is easy to ensure that all iterates , and lie in . Hence, Algorithm LABEL:alg:sNesterov enjoys the full convergence rate of of the Nesterov’s method. Thus, setting and , Algorithm LABEL:alg:sNesterov computes an -optimal and feasible solution to problem in iterations. The iterates and that need to be computed at each iteration of Algorithm LABEL:alg:sNesterov are solutions to an optimization problem of the form:

 (Ps): minX,S∈Rm×n{L2(∥X−~X∥2F+∥S−~S∥2F)+⟨Qx,X⟩+⟨Qs,S⟩: (X,S)∈χ}. \hb@xt@.01(2.4)

The following lemma shows that the solution to problems of the form can be computed efficiently.

Lemma 2.1

The optimal solution to problem can be written in closed form as follows.

When ,

 X∗=(θ∗L+2θ∗)(D−qs(~S))+(L+θ∗L+2θ∗)qx(~X), \hb@xt@.01(2.5) S∗=(θ∗L+2θ∗)(D−qx(~X))+(L+θ∗L+2θ∗)qs(~S), \hb@xt@.01(2.6)

where , and

 θ∗=max{0, L2(∥qx(~X)+qs(~S)−D∥Fδ−1)}. \hb@xt@.01(2.7)

When ,

 X∗=12(D−qs(~S))+12 qx(~X) andS∗=12(D−qx(~X))+12 qs(~S). \hb@xt@.01(2.8)

Proof. Suppose that . Writing the constraint in problem , , as

 12∥X+S−D∥2F≤δ22, \hb@xt@.01(2.9)

the Lagrangian function for (LABEL:eq:subproblem_sa) is given as

 L(X,S;θ)=L2(∥X−~X∥2F+∥S−~S∥2F)+⟨Qx,X−~X⟩+⟨Qs,S−~S⟩+θ2(∥X+S−D∥2F−δ2).

Therefore, the optimal solution and optimal Lagrangian multiplier must satisfy the Karush-Kuhn-Tucker (KKT) conditions:

1. ,

2. ,

3. ,

4. ,

5. .

Conditions LABEL:condition4_sa and LABEL:condition5_sa imply that satisfy and (LABEL:lemeq:S_smooth), from which it follows that

 X∗+S∗−D=(LL+2θ∗)(qx(~X)+qs(~S)−D). \hb@xt@.01(2.10)

Case 1: ∥qx(~X)+qs(~S)−D∥F≤δ

Setting , and , clearly satisfies , (LABEL:lemeq:S_smooth) and conditions LABEL:condition1_sa (from (LABEL:eq:smooth_infeasibility)), LABEL:condition2_sa and LABEL:condition3_sa. Thus, this choice of variables satisfies all the five KKT conditions.

Case 2: ∥qx(~X)+qs(~S)−D∥F>δ

Set . Since , ; hence, LABEL:condition2_sa is satisfied. Moreover, for this value of , it follows from (LABEL:eq:smooth_infeasibility) that . Thus, KKT conditions LABEL:condition1_sa and LABEL:condition3_sa are satisfied.

Therefore, setting and according to (LABEL:lemeq:X_smooth) and (LABEL:lemeq:S_smooth), respectively; and setting

 θ∗=max{0, L2(∥qx(~X)+qs(~S)−D∥Fδ−1)},

satisfies all the five KKT conditions.

Now, suppose that . Since , problem can be written as

 minX∈Rm×n∥X−~X+QxL∥2F+∥D−X−~S+QsL∥2F,

which is also equivalent to the problem: . Then (LABEL:lemeq:XS_smooth_delta0) trivially follows from first-order optimality conditions for this problem and the fact that .

3 Proximal Gradient Algorithm with Partially Smooth Objective Function

In this section we show how the proximal gradient algorithm, Algorithm 3 in [11], can be applied to the problem

 minX,S∈Rm×n{fμ(X)+ξ ∥S∥1: (X,S)∈χ}, \hb@xt@.01(3.1)

where is the smooth function defined in (LABEL:eq:smooth_f) such that is Lipschitz continuous with constant . This algorithm is given in Algorithm LABEL:alg:nsNesterov.

Mimicking the proof in [11], it is easy to show that Algorithm LABEL:alg:nsNesterov, which uses the prox function , converges to the optimal solution of (LABEL:eq:problem_sns). Given , e.g. and , the current algorithm keeps all iterates in as in Algorithm LABEL:alg:sNesterov, and hence it enjoys the full convergence rate of . Thus, setting , Algorithm LABEL:alg:nsNesterov computes an -optimal, feasible solution of problem in iterations.

The only thing left to be shown is that the optimization subproblems in Algorithm LABEL:alg:nsNesterov can be solved efficiently. The subproblem that has to be solved at each iteration to compute has the form:

 (Pns): min{ξ∥S∥1+⟨Q,X−~X⟩+ρ2∥X−~X∥2F: (X,S)∈χ}, \hb@xt@.01(3.2)

for some . Lemma LABEL:lem:subproblem shows that these computations can be done efficiently.

Lemma 3.1

The optimal solution to problem can be written in closed form as follows.

When ,

 S∗=sign(D−q(~X))⊙max{|D−q(~X)|−ξ(ρ+θ∗)ρθ∗ E, 0}, \hb@xt@.01(3.3) X∗=θ∗ρ+θ∗ (D−S∗)+ρρ+θ∗ q(~X), \hb@xt@.01(3.4)

where , and are matrices with all components equal to ones and zeros, respectively, and denotes the componentwise multiplication operator. if ; otherwise, is the unique positive solution of the nonlinear equation , where

 ϕ(θ):=∥min{ξθ E, ρρ+θ |D−q(~X)|}∥F. \hb@xt@.01(3.5)

Moreover, can be efficiently computed in time.

When ,

 S∗=sign(D−q(~X))⊙max{|D−q(~X)|−ξρ E, 0} and X∗=D−S∗. \hb@xt@.01(3.6)

Proof. Suppose that . Let be an optimal solution to problem and denote the optimal Lagrangian multiplier for the constraint written as (LABEL:eq:quadratic_constraint). Then the KKT optimality conditions for this problem are

1. ,

2. and ,

3. ,

4. ,

5. .

From LABEL:condition1 and LABEL:condition2, we have

 [(ρ+θ∗)Iθ∗Iθ∗Iθ∗I][X∗S∗]=[θ∗D+ρ q(~X)θ∗D−ξG], \hb@xt@.01(3.7)

where . From (LABEL:eq:FTOC_1) it follows that

 ⎡⎣(ρ+θ∗)Iθ∗I0(ρθ∗ρ+θ∗) I⎤⎦[X∗S∗]=⎡⎣θ∗D+ρ q(~X)ρθ∗ρ+θ∗ (D−q(~X))−ξG⎤⎦. \hb@xt@.01(3.8)

From the second equation in (LABEL:eq:FTOC_2), we have

 ξ(ρ+θ∗)ρθ∗ G+S∗+q(~X)−D=0. \hb@xt@.01(3.9)

But (LABEL:eq:shrinkS) is precisely the first-order optimality conditions for the “shrinkage” problem

 minS∈Rm×n{ξ(ρ+θ∗)ρθ∗∥S∥1+12∥S+q(~X)−D∥2F}.

Thus, is the optimal solution to the “shrinkage” problem and is given by (LABEL:lemeq:S). (LABEL:lemeq:X) follows from the first equation in (LABEL:eq:FTOC_2), and it implies

 X∗+S∗−D=ρρ+θ∗ (S∗+q(~X)−D). \hb@xt@.01(3.10)

Therefore,

 ∥X∗+S∗−D∥F =ρρ+θ∗ ∥S∗+q(~X)−D∥F, =ρρ+θ∗ ∥sign(D−q(~X))⊙max{|D−q(~X)|−ξ(ρ+θ∗)ρθ∗ E, 0}−(D−q(~X))∥F, =ρρ+θ∗ ∥max{|D−q(~X)|−ξ(ρ+θ∗)ρθ∗ E, 0}−|D−q(~X)| ∥F, =ρρ+θ∗ ∥min{ξ(ρ+θ∗)ρθ∗ E, |D−q(~X)|}∥F, =∥min{ξθ∗ E, ρρ+θ∗ |D−q(~X)|}∥F, \hb@xt@.01(3.11)

where the second equation uses (LABEL:lemeq:S). Now let be

 ϕ(θ):=∥min{ξθ E, ρρ+θ |D−q(~X)|}∥F. \hb@xt@.01(3.12)

Case 1: ∥D−q(~X)∥F≤δ

, and trivially satisfy all the KKT conditions.

Case 2: ∥D−q(~X)∥F>δ

It is easy to show that is a strictly decreasing function of . Since and , there exists a unique such that . Given , and can then be computed from equations (LABEL:lemeq:S) and (LABEL:lemeq:X), respectively. Moreover, since and , (LABEL:eq:Fnorm) implies that , and satisfy the KKT conditions.

We now show that can be computed in time. Let and be the elements of the matrix sorted in increasing order, which can be done in time. Defining and , we then have for all that

 ρρ+θ a(j)≤ξθ≤ρρ+θ a(j+1)⇔1ξ a(j)−1ρ≤1θ≤1ξ a(j+1)−1ρ. \hb@xt@.01(3.13)

For all define such that and let . Then for all

 ϕ(θj)= ⎷(ρρ+θj)2 j∑i=0a2(i)+(mn−j) (ξθj)2. \hb@xt@.01(3.14)

Also define and so that and . Note that contains all the points at which may not be differentiable for . Define . Then is the unique solution of the system

  ⎷(ρρ+θ)2 j∗∑i=0a2(i)+(mn−j∗) (ξθ)2=δ % and θ>0, \hb@xt@.01(3.15)

since is continuous and strictly decreasing in for . Solving the equation in (LABEL:eq:root) requires finding the roots of a fourth-order polynomial (a.k.a. quartic function); therefore, one can compute using the algebraic solutions of quartic equations (as shown by Lodovico Ferrari in 1540), which requires operations.

Note that if , then is the solution of the equation

  ⎷(ρρ+θ∗)2 mn∑i=1a2(i)=δ, \hb@xt@.01(3.16)

i.e. . Hence, we have proved that problem  can be solved efficiently.

Now, suppose that . Since , problem  can be written as

 minS∈Rm×nξρ∥S∥1+12∥S−(D−q(~X))∥2F. \hb@xt@.01(3.17)

Then (LABEL:lemeq:XS_nonsmooth_delta0) trivially follows from first-order optimality conditions for the above problem and the fact that .

The following lemma will be used later in Section LABEL:sec:nsa. However, we give its proof here, since it uses some equations from the proof of Lemma LABEL:lem:subproblem. Let denote the indicator function of the closed convex set , i.e. if , then ; otherwise, .

Lemma 3.2

Suppose that . Let be an optimal solution to problem  and be an optimal Lagrangian multiplier such that and together satisfy the KKT conditions, LABEL:condition1-LABEL:condition5 in the proof of Lemma LABEL:lem:subproblem. Then , where .

Proof. Let , then from LABEL:condition1 and LABEL:condition5 of the KKT optimality conditions in the proof of Lemma LABEL:lem:subproblem, we have and

 ∥W∗∥F=θ∗∥X∗+S∗−D∥=θ∗(∥X∗+S∗−D∥−δ)+θ∗δ=θ∗δ. \hb@xt@.01(3.18)

Moreover, for all , it follows from the definition of that . Thus, for all , we have . Hence,

 0≥⟨W∗,θ∗(X+S−D)−W∗⟩=⟨W∗,θ∗(X−X∗+S−S∗)⟩∀ (X,S)∈χ. \hb@xt@.01(3.19)

It follows from the proof of Lemma LABEL:lem:subproblem that if , then , where . Therefore, (LABEL:eq:subgradient_key) implies that

 0≥⟨W∗,X−X∗+S−S∗⟩∀ (X,S)∈χ. \hb@xt@.01(3.20)

On the other hand, if , then . Hence