A variational approach to stable principal component pursuit

# A variational approach to stable principal component pursuit

## Abstract

We introduce a new convex formulation for stable principal component pursuit (SPCP) to decompose noisy signals into low-rank and sparse representations. For numerical solutions of our SPCP formulation, we first develop a convex variational framework and then accelerate it with quasi-Newton methods. We show, via synthetic and real data experiments, that our approach offers advantages over the classical SPCP formulations in scalability and practical parameter selection.

## 1 Introduction

Linear superposition is a useful model for many applications, including nonlinear mixing problems. Surprisingly, we can perfectly distinguish multiple elements in a given signal using convex optimization as long as they are concise and look sufficiently different from one another. Popular examples include robust principal component analysis (RPCA) where we decompose a signal into low rank and sparse components and stable principal component pursuit (SPCP), where we also seek an explicit noise component within the RPCA decomposition. Applications include alignment of occluded images (Peng et al., 2012), scene triangulation (Zhang et al., 2011), model selection (Chandrasekaran et al., 2012), face recognition, and document indexing (Candès et al., 2011).

The SPCP formulation can be mathematically stated as follows. Given a noisy matrix , we decompose it as a sum of a low-rank matrix and a sparse matrix via the following convex program

 Missing dimension or its units for \kern ($\text{SPCP}_{\text{sum}}$) subject to∥L+S−Y∥F≤ε,

where the 1-norm and nuclear norm are given by where is the vector of singular values of . In (($\text{SPCP}_{\text{sum}}$)), the parameter controls the relative importance of the low-rank term vs. the sparse term , and the parameter accounts for the unknown perturbations in the data not explained by and .

When , (($\text{SPCP}_{\text{sum}}$)) is the “robust PCA” problem as analyzed by Chandrasekaran et al. (2009); Candès et al. (2011), and it has perfect recovery guarantees under stylized incoherence assumptions. There is even theoretical guidance for selecting a minimax optimal regularization parameter (Candès et al., 2011). Unfortunately, many practical problems only approximately satisfy the idealized assumptions, and hence, we typically tune RPCA via cross-validation techniques. SPCP further complicates the practical tuning due to the additional parameter .

To cope with practical tuning issues of SPCP, we propose the following new variant called “max-SPCP”:

 minimizeL,Smax(|||L|||∗,λmax∥S∥1) ($\text{SPCP}_{\text{max}}$) subject to∥L+S−Y∥F≤ε,

where acts similar to . Our work shows that this new formulation offers both modeling and computational advantages over (($\text{SPCP}_{\text{sum}}$)).

Cross-validation with (($\text{SPCP}_{\text{max}}$)) to estimate is significantly easier than estimating in (($\text{SPCP}_{\text{sum}}$)). For example, given an oracle that provides an ideal separation , we can use in both cases. However, while we can estimate , it is not clear how to choose from data. Such cross validation can be performed on a similar dataset, or it could be obtained from a probabilistic model.

Our convex approach for solving (($\text{SPCP}_{\text{sum}}$)) generalizes to other source separation problems (Baldassarre et al., 2013) beyond SPCP. Both (($\text{SPCP}_{\text{max}}$)) and (($\text{SPCP}_{\text{sum}}$)) are challenging to solve when the dimensions are large. We show in this paper that these problems can be solved more efficiently by solving a few (typically 5 to 10) subproblems of a different functional form. While the efficiency of the solution algorithms for (($\text{SPCP}_{\text{sum}}$)) relies heavily on the efficiency of the 1-norm and nuclear norm projections, the efficiency of our solution algorithm (($\text{SPCP}_{\text{max}}$)) is preserved for arbitrary norms. Moreover, (($\text{SPCP}_{\text{max}}$)) allows a faster algorithm in the standard case, discussed in Section 6.

## 2 A Primer on Spcp

The theoretical and algorithmic research on SPCP formulations (and source separation in general) is rapidly evolving. Hence, it is important to set the stage first in terms of the available formulations to highlight our contributions.

To this end, we illustrate (($\text{SPCP}_{\text{sum}}$)) and (($\text{SPCP}_{\text{max}}$)) via different convex formulations. Flipping the objective and the constraints in (($\text{SPCP}_{\text{max}}$)) and (($\text{SPCP}_{\text{sum}}$)), we obtain the following convex programs

 minimizeL,S12∥L+S−Y∥2F ($\text{flip-SPCP}_{\text{sum}}$) s.t.|||L|||∗+λsum∥S∥1≤τsum
 minimizeL,S12∥L+S−Y∥2F ($\text{flip-SPCP}_{\text{max}}$) Missing or unrecognized delimiter for \left
###### Remark 2.1.

The solutions of (($\text{flip-SPCP}_{\text{sum}}$)) and (($\text{flip-SPCP}_{\text{max}}$)) are related to the solutions of (($\text{SPCP}_{\text{sum}}$)) and (($\text{SPCP}_{\text{max}}$)) via the Pareto frontier by Aravkin et al. (2013a, Theorem 2.1). If the constraint is tight at the solution, then there exist corresponding parameters and , for which the optimal value of (($\text{flip-SPCP}_{\text{sum}}$)) and (($\text{flip-SPCP}_{\text{max}}$)) is , and the corresponding optimal solutions and are also optimal for (($\text{SPCP}_{\text{sum}}$)) and (($\text{SPCP}_{\text{max}}$)).

For completeness, we also include the Lagrangian formulation, which is covered by our new algorithm:

 minimizeL,SλL|||L|||∗+λS∥S∥1+12∥L+S−Y∥2F (Lag-SPCP)

Problems (($\text{flip-SPCP}_{\text{max}}$)) and (($\text{flip-SPCP}_{\text{sum}}$)) can be solved using projected gradient and accelerated gradient methods. The disadvantage of some of these formulations is that it may not be clear how to tune the parameters. Surprisingly, an algorithm we propose in this paper can solve (($\text{SPCP}_{\text{max}}$)) and (($\text{SPCP}_{\text{sum}}$)) using a sequence of flipped problems that specifically exploits the structured relationship cited in Remark 2.1. In practice, we will see that better tuning also leads to faster algorithms, e.g., fixing ahead of time to an estimated ‘noise floor’ greatly reduces the amount of required computation if parameters are to be selected via cross-validation.

Finally, we note that in some cases, it is useful to change the term to where is a linear operator. For example, let be a subset of the indices of a matrix. We may only observe restricted to these entries, denoted , in which case we choose . Most existing RPCA/SPCP algorithms adapt to the case but this is due to the strong properties of the projection operator . The advantage of our approach is that it seamlessly handles arbitrary linear operators . In fact, it also generalizes to smooth misfit penalties, that are more robust than the Frobenius norm, including the Huber loss. Our results also generalize to some other penalties on besides the 1-norm.

The paper proceeds as follows. In Section 3, we describe previous work and algorithms for SPCP and RPCA. In Section 4, we cast the relationships between pairs of problems (($\text{flip-SPCP}_{\text{sum}}$)), (($\text{SPCP}_{\text{sum}}$)) and (($\text{flip-SPCP}_{\text{max}}$)), (($\text{SPCP}_{\text{max}}$)) into a general variational framework, and highlight the product-space regularization structure that enables us solve the formulations of interest using corresponding flipped problems. We discuss computationally efficient projections as optimization workhorses in Section 5, and develop new accelerated projected quasi-Newton methods for the flipped and Lagrangian formulations in Section 6. Finally, we demonstrate the efficacy of the new solvers and the overall formulation on synthetic problems and a real cloud removal example in Section 7, and follow with conclusions in Section 8.

## 3 Prior Art

While problem (($\text{SPCP}_{\text{sum}}$)) with has several solvers (e.g., it can be solved by applying the widely known Alternating Directions Method of Multipliers (ADMM)/Douglas-Rachford method (Combettes & Pesquet, 2007)), the formulation assumes the data are noise free. Unfortunately, the presence of noise we consider in this paper introduces a third term in the ADMM framework, where the algorithm is shown to be non-convergent (Chen et al., 2013). Interestingly, there are only a handful of methods that can handle this case. Those using smoothing techniques no longer promote exactly sparse and/or exactly low-rank solutions (Aybat et al., 2013). Those using dual decomposition techniques require high iteration counts. Because each step requires a partial singular value decomposition (SVD) of a large matrix, it is critical that the methods only take a few iterations.

As a rough comparison, we start with related solvers that solve (($\text{SPCP}_{\text{sum}}$)) for . Wright et al. (2009a) solves an instance of (($\text{SPCP}_{\text{sum}}$)) with and a system in hours. By switching to the ((Lag-SPCP)) formulation, Ganesh et al. (2009) uses the accelerated proximal gradient method (Beck & Teboulle, 2009) to solve a matrix in under one hour. This is improved further in Lin et al. (2010) which again solves (($\text{SPCP}_{\text{sum}}$)) with using the augmented Lagrangian and ADMM methods and solves a system in about a minute. As a prelude to our results, our method can solve some systems of this size in about seconds (c.f., Fig. 1).

In the case of (($\text{SPCP}_{\text{sum}}$)) with , Tao & Yuan (2011) propose the alternating splitting augmented Lagrangian method (ASALM), which exploits separability of the objective in the splitting scheme, and can solve a system in about five minutes.

The partially smooth proximal gradient (PSPG) approach of Aybat et al. (2013) smooths just the nuclear norm term and then applies the well-known FISTA algorithm (Beck & Teboulle, 2009). Aybat et al. (2013) show that the proximity step can be solved efficiently in closed-form, and the dominant cost at every iteration is that of the partial SVD. They include some examples on video, lopsided matrices: or so, in about 1 minute). solving formulations in under half a minute.

The nonsmooth adaptive Lagrangian (NSA) algorithm of Aybat & Iyengar (2014) is a variant of the ADMM for (($\text{SPCP}_{\text{sum}}$)), and makes use of the insight of Aybat et al. (2013). The ADMM variant is interesting in that it splits the variable , rather than the sum or residual . Their experiments solve a 1500 1500 synthetic problems in between 16 and 50 seconds (depending on accuracy) .

Shen et al. (2014) develop a method exploiting low-rank matrix factorization scheme, maintaining . This technique has also been effectively used in practice for matrix completion (Lee et al., 2010; Recht & Ré, 2011; Aravkin et al., 2013b), but lacks a full convergence theory in either context. The method of (Shen et al., 2014) was an order of magnitude faster than ASALM, but encountered difficulties in some experiments where the sparse component dominated the low rank component in some sense. Mansour & Vetro (2014) attack the ($\text{SPCP}_{\text{sum}}$) formulation using a factorized approach, together with alternating solves between and . Non-convex techniques also include hard thresholding approaches, e.g. the approach of Kyrillidis & Cevher (2014). While the factorization technique may potentially speed up some of the methods presented here, we leave this to future work, and only work with convex formulations.

## 4 Variational Framework

Both of the formulations of interest (($\text{SPCP}_{\text{sum}}$)) and (($\text{SPCP}_{\text{max}}$)) can be written as follows:

 minϕ(L,S)s.t.ρ(L+S−Y)≤ε. (4.1)

Classic formulations assume to be the Frobenius norm; however, this restriction is not necessary, and we consider to be smooth and convex. In particular, can be taken to be the robust Huber penalty (Huber, 2004). Even more importantly, this formulation allows pre-composition of a smooth convex penalty with an arbitrary linear operator , which extends the proposed approach to a much more general class of problems. Note that a simple operator is already embedded in both formulations of interest:

 L+S=[II][LS]. (4.2)

Projection onto a set of observed indices is also a simple linear operator that can be included in . Operators may include different transforms (e.g., Fourier) applied to either or .

The main formulations of interest differ only in the functional . For (($\text{SPCP}_{\text{sum}}$)), we have

 ϕsum(L,S)=|||L|||∗+λsum∥S∥1,

while for (($\text{SPCP}_{\text{max}}$)),

 ϕmax(L,S)=max(|||L|||∗,λmax∥S∥1).

The problem class (4.1) falls into the class of problems studied by van den Berg & Friedlander (2008, 2011) for and by Aravkin et al. (2013a) for arbitrary convex . Making use of this framework, we can define a value function

 v(τ)=minL,Sρ(A(L,S)−Y)s.t. ϕ(L,S)≤τ, (4.3)

and use Newton’s method to find a solution to . The approach is agnostic to the linear operator (it can be of the simple form (4.2); include restriction in the missing data case, etc.).

For both formulations of interest, is a norm defined on a product space , since we can write

 ϕsum(L,S) =∥∥∥|||L|||∗λsum∥S∥1∥∥∥1, (4.4) ϕmax(L,S) =∥∥∥|||L|||∗λmax∥S∥1∥∥∥∞. (4.5)

In particular, both and are gauges. For a convex set containing the origin, the gauge is defined by

 γ(x∣C)=infλ{λ:x∈λC}. (4.6)

For any norm , the set defining it as a gauge is simply the unit ball . We introduce gauges for two reasons. First, they are more general (a gauge is a norm only if is bounded with nonempty interior and symmetric about the origin). For example, gauges trivially allow inclusion of non-negativity constraints. Second, definition (4.6) and the explicit set simplify the exposition of the following results.

In order to implement Newton’s method for (4.3), the optimization problem to evaluate must be solved (fully or approximately) to obtain . Then the parameter for the next (4.3) problem is updated via

 τk+1=τk−v(τ)−τv′(τ). (4.7)

Given , can be written in closed form using (Aravkin et al., 2013a, Theorem 5.2), which simplifies to

 v′(τ)=−ϕ∘(AT∇ρ(A(¯L,¯¯¯¯S)−Y)), (4.8)

with denoting the polar gauge to . The polar gauge is precisely , with

 C∘={v:⟨v,x⟩≤1∀x∈C}. (4.9)

In the simplest case, where is given by (4.2), and is the least squares penalty, the formula (4.8) becomes

 v′(τ)=−ϕ∘([¯¯¯¯L+¯¯¯¯S−Y¯¯¯¯L+¯¯¯¯S−Y]).

The main computational challenge in the approach outlined in (4.3)-(4.8) is to design a fast solver to evaluate . Section 6 does just this.

The key to RPCA is that the regularization functional is a gauge over the product space used to decompose into summands and . This makes it straightforward to compute polar results for both and .

###### Theorem 4.1 (Max-Sum Duality for Gauges on Product Spaces).

Let and be gauges on and , and consider the function

 g(x,y)=max{γ1(x),γ2(y)}.

Then is a gauge, and its polar is given by

 g∘(z1,z2)=γ∘1(z1)+γ∘2(z2).
###### Proof.

Let and denote the canonical sets corresponding to gauges and . It immediately follows that is a gauge for the set , since

 inf{λ≥0|(x,y)∈λC} =inf{λ|x∈λC1 and y∈λC2} =max{γ1(x),γ2(y)}.

By (Rockafellar, 1970, Corollary 15.1.2), the polar of the gauge of is the support function of , which is given by

 supx∈C1,y∈C2⟨(x,y),(z1,z2)⟩ =supx∈C1⟨x,z1⟩+supy∈C2⟨y,z2⟩ =γ∘1(z1)+γ∘2(z2).

This theorem allows us to easily compute the polars for and in terms of the polars of and , which are the dual norms, the spectral norm and infinity norm, respectively.

###### Corollary 4.2 (Explicit variational formulae for (($\text{SPCP}_{\text{sum}}$)) and (($\text{SPCP}_{\text{max}}$))).

We have

 ϕ∘sum(Z1,Z2) =max{|||Z1|||2,1λ% sum∥Z2∥∞} (4.10) ϕ∘max(Z1,Z2) Missing dimension or its units for \kern

where denotes the spectral norm (largest eigenvalue of ).

This result was also obtained by (van den Berg & Friedlander, 2011, Section 9), but is stated only for norms. Theorem 4.1 applies to gauges, and in particular now allows asymmetric gauges, so non-negativity constraints can be easily modeled.

We now have closed form solutions for in (4.8) for both formulations of interest. The remaining challenge is to design a fast solver for (4.3) for formulations (($\text{SPCP}_{\text{sum}}$)) and (($\text{SPCP}_{\text{max}}$)). We focus on this challenge in the remaining sections of the paper. We also discuss the advantage of (($\text{SPCP}_{\text{max}}$)) from this computational perspective.

## 5 Projections

In this section, we consider the computational issues of projecting onto the set defined by . For this is straightforward since the set is just the product set of the nuclear norm and norm balls, and efficient projectors onto these are known. In particular, projecting an matrix (without loss of generality let ) onto the nuclear norm ball takes operations, and projecting it onto the -ball can be done on operations using fast median-finding algorithms (Brucker, 1984; Duchi et al., 2008).

For , the projection is no longer straightforward. Nonetheless, the following lemma shows this projection can be efficiently implemented.

###### Proposition 5.1.

(van den Berg & Friedlander, 2011, Section 5.2) Projection onto the scaled -ball, that is, for some , can be done in time.

The proof of the proposition follows by noting that the solution can be written in a form depending only on a single scalar parameter, and this scalar can be found by sorting followed by appropriate summations. We conjecture that fast median-finding ideas could reduce this to in theory, the same as the optimal complexity for the -ball.

Armed with the above proposition, we state an important lemma below. For our purposes, we may think of as a vector in rather than a matrix in .

###### Lemma 5.2.

(van den Berg & Friedlander, 2011, Section 9.2) Let and , and let be any ordering of the elements of . Then the projection of onto the ball is , where is the projection onto the scaled -ball .

###### Sketch of proof.

We need to solve

 min{(L′,S′)∣ϕsum(L′,S′)≤1}12∥∥L′−L∥∥2F+12∥∥S′−S∥∥2F.

Alternatively, solve

 Missing dimension or its units for \kern

The inner minimization is equivalent to projecting onto the nuclear norm ball, and this is well-known to be soft-thresholding of the singular values. Since it depends only on the singular values, recombining the two minimization terms gives exactly a joint projection onto a scaled -ball. ∎

###### Remark 5.1.

All the references to the -ball can be replaced by the intersection of the -ball and the non-negative cone, and the projection is still efficient. As noted in Section 4, imposing non-negativity constraints is covered by the gauge results of Theorem 4.1 and Corollary 4.2. Therefore, both the variational and efficient computational framework can be applied to this interesting case.

## 6 Solving the Sub-Problem via Projected Quasi-Newton Methods

In order to accelerate the approach, we can use quasi-Newton (QN) methods since the objective has a simple structure.1 The main challenge here is that for the term, it is tricky to deal with a weighted quadratic term (whereas for , we can obtain a low-rank Hessian and solve it efficiently via coordinate descent).

We wish to solve (($\text{flip-SPCP}_{\text{max}}$)). Let be the full variable, so we can write the objective function as . To simplify the exposition, we take to be the matrix, but the presented approach applies to general linear operators (including terms like ). The matrix structure of and is not important here, so we can think of them as vectors instead of matrices.

The gradient is . For convenience, we use and

 ∇f(X)=(∇Lf(X)∇Sf(X))=AT(r(X)r(X)),rk≡r(Xk).

The Hessian is . We cannot simultaneously project onto their constraints with this Hessian scaling (doing so would solve the original problem!), since the Hessian removes separability. Instead, we use to approximate the cross-terms.

The true function is a quadratic, so the following quadratic expansion around is exact:

 f(L,S)=f(Xk) +⟨(∇Lf(Xk)∇Sf(Xk)),(L−LkS−Sk)⟩ +⟨(L−LkS−Sk),∇2f(L−LkS−Sk)⟩ =f(Xk) +⟨(rkrk),(L−LkS−Sk)⟩ +⟨(L−LkS−Sk),(1111)(L−LkS−Sk)⟩ =f(Xk) +⟨(rkrk),(L−LkS−Sk)⟩ +⟨(L−LkS−Sk),(L−Lk+S−SkL−Lk+S−Sk)⟩

The coupling of the second order terms, shown in bold, prevents direct 1-step minimization of , subject to the nuclear and 1-norm constraints. The FISTA (Beck & Teboulle, 2009) and spectral gradient methods (SPG) (Wright et al., 2009b) replace the Hessian with the upper bound , which solves the coupling issue, but potentially lose too much second order information. After comparing FISTA and SPG, we use the SPG method for solving (($\text{flip-SPCP}_{\text{sum}}$)). However, for (($\text{flip-SPCP}_{\text{max}}$)) (and for ((Lag-SPCP)), which has no constraints but rather non-smooth terms, which can be treated like constraints using proximity operators), the constraints are uncoupled and we can take a “middle road” approach, replacing

 ⟨(L−LkS−Sk),(L−Lk+S−SkL−Lk+S−Sk)⟩

with

 ⟨(L−LkS−Sk),(L−Lk+Sk−Sk−1Lk+1−Lk+S−Sk)⟩.

The first term is decoupled, allowing us to update , and then this is plugged into the second term in a Gauss-Seidel fashion. In practice, we also scale this second-order term with a number slightly greater than but less than (e.g., ) which leads to more robust behavior. We expect this “quasi-Newton” trick to do well when is similar to .

## 7 Numerical Results

The numerical experiments are done with the algorithms suggested in this paper as well as code from PSPG (Aybat et al., 2013), NSA (Aybat & Iyengar, 2014), and ASALM (Tao & Yuan, 2011)2. We modified the other software as needed for testing purposes. PSPG, NSA and ASALM all solve (($\text{SPCP}_{\text{sum}}$)), but ASALM has another variant which solves ((Lag-SPCP)) so we test this as well. All three programs also use versions of PROPACK from Larsen (1998); Becker & Candès (2008) to compute partial SVDs. Since the cost of a single iteration may vary among the solvers, we measure error as a function of time, not iterations. When a reference solution is available, we measure the (relative) error of a trial solution as . The benchmark is designed so the time required to calculate this error at each iteration does not factor into the reported times. Since picking stopping conditions is solver dependent, we show plots of error vs time, rather than list tables. All tests are done in Matlab and the dominant computational time was due to matrix multiplications for all algorithms; all code was run in the same quad-core  GHz i7 computer.

For our implementations of the (($\text{flip-SPCP}_{\text{max}}$)), (($\text{flip-SPCP}_{\text{sum}}$)) and ((Lag-SPCP)), we use a randomized SVD (Halko et al., 2011). Since the number of singular values needed is not known in advance, the partial SVD may be called several times (the same is true for PSPG, NSA and ASALM). Our code limits the number of singular values on the first two iterations in order to speed up calculation without affecting convergence. Unfortunately, the delicate projection involved in (($\text{flip-SPCP}_{\text{sum}}$)) makes incorporating a partial SVD to this setting more challenging, so we use Matlab’s dense SVD routine.

### 7.1 Synthetic test with exponential noise

We first provide a test with generated data. The observations with and were created by first sampling a rank matrix with random singular vectors (i.e., from the Haar measure) and singular values drawn from a uniform distribution with mean , and then adding exponential random noise (with mean equal to one tenth the median absolute value of the entries of ). This exponential noise, which has a longer tail than Gaussian noise, is expected to be captured partly by the term and partly by the term.

Given , the reference solution was generated by solving ((Lag-SPCP)) to very high accuracy; the values and were picked by hand tuning to find a value such that both and are non-zero. The advantage to solving ((Lag-SPCP)) is that knowledge of allows us to generate the parameters for all the other variants, and hence we can test different problem formulations.

With these parameters, was rank 17 with nuclear norm , had 54 non-zero entries (most of them positive) with norm , the normalized residual was , and , , , and .

Results are shown in Fig. 1. Our methods for (($\text{flip-SPCP}_{\text{max}}$)) and ((Lag-SPCP)) are extremely fast, because the simple nature of these formulations allows the quasi-Newton acceleration scheme of Section 6. In turn, since our method for solving (($\text{SPCP}_{\text{max}}$)) uses the variational framework of Section 4 to solve a sequence of (($\text{flip-SPCP}_{\text{max}}$)) problems, it is also competitive (shown in cyan in Figure 1). The jumps are due to re-starting the sub-problem solver with a new value of , generated according to (4.7).

Our proximal gradient method for (($\text{flip-SPCP}_{\text{sum}}$)), which makes use of the projection in Lemma 5.2, converges more slowly, since it is not easy to accelerate with the quasi-Newton scheme due to variable coupling, and it does not make use of fast SVDs. Our solver for (($\text{SPCP}_{\text{sum}}$)), which depends on a sequence of problems (($\text{flip-SPCP}_{\text{sum}}$)), converges slowly.

The ASALM performs reasonably well, which was unexpected since it was shown to be worse than NSA and PSPG in Aybat et al. (2013); Aybat & Iyengar (2014). The PSPG solver converges to the wrong answer, most likely due to a bad choice of the smoothing parameter ; we tried choosing several different values other than the default but did not see improvement for this test (for other tests, not shown, tweaking helped significantly). The NSA solver reaches moderate error quickly but stalls before finding a highly accurate solution.

### 7.2 Synthetic test from Aybat & Iyengar (2014)

We show some tests from the test setup of Aybat & Iyengar (2014) in the case. The default setting of was used, and then the NSA solver was run to high accuracy to obtain a reference solution . From the knowledge of , one can generate , but not and , and hence we did not test the solvers for ((Lag-SPCP)) in this experiment. The data was generated as , where was sampled by multiplication of and normal Gaussian matrices, had randomly chosen entries uniformly distributed within , and was white noise chosen to give a SNR of  dB. We show three tests that vary the rank from and the sparsity ranging from . Unlike Aybat & Iyengar (2014), who report error in terms of a true noiseless signal , we report the optimization error relative to .

For the first test (with and ), had rank and nuclear norm ; had of its elements nonzero and norm , and . The other parameters were , , , and . An interesting feature of this test is that while is low-rank, is nearly low-rank but with a small tail of significant singular values until number . We expect methods to converge quickly to low-accuracy where only a low-rank approximation is needed, and then slow down as they try to find a larger rank highly-accurate solution.

The results are shown in Fig. 2. Errors barely dip below (for comparison, an error of is achieved by setting ). The NSA and PSPG solvers do quite well. In contrast to the previous test, ASALM does poorly. Our methods for (($\text{flip-SPCP}_{\text{sum}}$)), and hence (($\text{SPCP}_{\text{sum}}$)), are not competitive, since they use dense SVDs. We imposed a time-limit of about one minute, so these methods only manage a single iteration or two. Our quasi-Newton method for (($\text{flip-SPCP}_{\text{max}}$)) does well initially, then takes a long time due to a long partial SVD computation. Interestingly, (($\text{SPCP}_{\text{max}}$)) does better than pure (($\text{flip-SPCP}_{\text{max}}$)). One possible explanation is that it chooses a fortuitous sequence of values, for which the corresponding (($\text{flip-SPCP}_{\text{max}}$)) subproblems become increasingly hard, and therefore benefit from the warm-start of the solution of the easier previous problem. This is consistent with empirical observations regarding continuation techniques, see e.g., (van den Berg & Friedlander, 2008; Wright et al., 2009b).

Figure 3 is the same test but with and , and the conclusions are largely similar.

### 7.3 Cloud removal

Figure 4 shows 15 images of size from the MODIS satellite,3 after some transformations to turn images from different spectral bands into one grayscale images. Each image is a photo of the same rural location but at different points in time over the course of a few months. The background changes slowly and the variability is due to changes in vegetation, snow cover, and different reflectance. There are also outlying sources of error, mainly due to clouds (e.g., major clouds in frames 5 and 7, smaller clouds in frames 9, 11 and 12), as well as artifacts of the CCD camera on the satellite (frame 4 and 6) and issues stitching together photos of the same scene (the lines in frames 8 and 10).

There are hundreds of applications for clean satellite imagery, so removing the outlying error is of great practical importance. Because of slow changing background and sparse errors, we can model the problem using the robust PCA approach. We use the (($\text{flip-SPCP}_{\text{max}}$)) version due to its speed, and pick parameters by using a Nelder-Mead simplex search. For an error metric to use in the parameter tuning, we remove frame from the data set (call it ) and set to be frames 2–15. From this training data , the algorithm generates and . Since is a matrix, it has far from full column span. Thus our error is the distance of from the span of , i.e., .

Our method takes about 11 iterations and 5 seconds, and uses a dense SVD instead of the randomized method due to the high aspect ratio of the matrix. Some results of the obtained outputs are in Fig. 5, where one can see that some of the anomalies in the original data frames are picked up by the term and removed from the term. Frame 4 has what appears to be a camera pixel error; frame 6 has another artificial error (that is, caused by the camera and not the scene); and frame 12 has cloud cover.

## 8 Conclusions

In this paper, we reviewed several formulations and algorithms for the RPCA problem. We introduced a new denoising formulation (($\text{SPCP}_{\text{max}}$)) to the ones previously considered, and discussed modeling and algorithmic advantages of denoising formulations (($\text{SPCP}_{\text{max}}$)) and (($\text{SPCP}_{\text{sum}}$)) compared to flipped versions (($\text{flip-SPCP}_{\text{max}}$)) and (($\text{flip-SPCP}_{\text{sum}}$)). In particular, we showed that these formulations can be linked using a variational framework, which can be exploited to solve denoising formulations using a sequence of flipped problems. For (($\text{flip-SPCP}_{\text{max}}$)), we proposed a quasi-Newton acceleration that is competitive with state of the art, and used this innovation to design a fast method for (($\text{SPCP}_{\text{max}}$)) through the variational framework. The new methods were compared against prior art on synthetic examples, and applied to a real world cloud removal application application using publicly available MODIS satellite data.

\pdfbookmark

[1]ReferencesrefSection

### Footnotes

1. We use “quasi-Newton” to mean an approximation to a Newton method and it should not be confused with methods like BFGS
2. PSPG, NSA and ASALM available from the experiment package at http://www2.ie.psu.edu/aybat/codes.html

### References

1. Aravkin, A. Y., Burke, J., and Friedlander, M. P. Variational properties of value functions. SIAM J. Optimization, 23(3):1689–1717, 2013a.
2. Aravkin, A. Y., Kumar, R., Mansour, H., Recht, B., and Herrmann, F. J. A robust SVD-free approach to matrix completion, with applications to interpolation of large scale data. 2013b.
3. Aybat, N., Goldfarb, D., and Ma, S. Efficient algorithms for robust and stable principal component pursuit. Computational Optimization and Applications, (accepted), 2013.
4. Aybat, N. S. and Iyengar, G. An alternating direction method with increasing penalty for stable principal component pursuit. Computational Optimization and Applications, (submitted), 2014.
5. Baldassarre, L., Cevher, V., McCoy, M., Tran Dinh, Q., and Asaei, A. Convexity in source separation: Models, geometry, and algorithms. Technical report, 2013.
6. Beck, A. and Teboulle, M. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sciences, 2(1):183–202, January 2009.
7. Becker, S. and Candès, E. Singular value thresholding toolbox, 2008. Available from http://svt.stanford.edu/.
8. Brucker, P. An O(n) algorithm for quadratic knapsack problems. Operations Res. Lett., 3(3):163 – 166, 1984.
9. Candès, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? J. Assoc. Comput. Mach., 58(3):1–37, May 2011.
10. Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. Sparse and low-rank matrix decompositions. In SYSID 2009, Saint-Malo, France, July 2009.
11. Chandrasekaran, V., Parrilo, P. A., and Willsky, A. S. Latent variable graphical model selection via convex optimization. Ann. Stat., 40(4):1935–2357, 2012.
12. Chen, Caihua, He, Bingsheng, Ye, Yinyu, and Yuan, Xiaoming. The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Optimization Online, 2013.
13. Combettes, P. L. and Pesquet, J.-C. A Douglas–Rachford Splitting Approach to Nonsmooth Convex Variational Signal Recovery. IEEE J. Sel. Topics Sig. Processing, 1(4):564–574, December 2007.
14. Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Efficient projections onto the l1-ball for learning in high dimensions. In Intl. Conf. Machine Learning (ICML), pp. 272–279, New York, July 2008. ACM Press.
15. Ganesh, A., Lin, Z., Wright, J., Wu, L., Chen, M., and Ma, Y. Fast algorithms for recovering a corrupted low-rank matrix. In Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP), pp. 213–215, Aruba, Dec. 2009.
16. Halko, N., Martinsson, P.-G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
17. Huber, P. J. Robust Statistics. John Wiley and Sons, 2 edition, 2004.
18. Kyrillidis, Anastasios and Cevher, Volkan. Matrix recipes for hard thresholding methods. Journal of Mathematical Imaging and Vision, 48(2):235–265, 2014.
19. Larsen, R. M. Lanczos bidiagonalization with partial reorthogonalization. Tech. Report. DAIMI PB-357, Department of Computer Science, Aarhus University, September 1998.
20. Lee, J., Recht, B., Salakhutdinov, R., Srebro, N., and Tropp, J.A. Practical large-scale optimization for max-norm regularization. In Neural Information Processing Systems (NIPS), Vancouver, 2010.
21. Lin, Z., Chen, M., and Ma, Y. The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010.
22. Mansour, H. and Vetro, A. Video background subtraction using semi-supervised robust matrix completion. In To appear in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2014.
23. Peng, Y., Ganesh, A., Wright, J., Xu, W., and Ma, Y. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE Trans. Pattern Analysis and Machine Intelligence, 34(11):2233–2246, 2012.
24. Recht, Benjamin and Ré, Christopher. Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Prog. Comput., pp. 1–26, 2011.
25. Rockafellar, R. T. Convex analysis. Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.
26. Shen, Y., Wen, Z., and Zhang, Y. Augmented Lagrangian alternating direction method for matrix separation based on low-rank factorization. Optimization Methods and Software, 29(2):239–263, March 2014.
27. Tao, M. and Yuan, X. Recovering low-rank and sparse components of matrices from incomplete and noisy observations. SIAM J. Optimization, 21:57–81, 2011.
28. van den Berg, E. and Friedlander, M. P. Probing the Pareto frontier for basis pursuit solutions. SIAM J. Sci. Computing, 31(2):890–912, 2008. software: http://www.cs.ubc.ca/~mpf/spgl1/.
29. van den Berg, E. and Friedlander, M. P. Sparse optimization with least-squares constraints. SIAM J. Optimization, 21(4):1201–1229, 2011.
30. Wright, J., Ganesh, A., Rao, S., and Ma, Y. Robust principal component analysis: Exact recovery of corrupted low-rank matrices by convex optimization. In Neural Information Processing Systems (NIPS), 2009a.
31. Wright, S. J., Nowak, R. D., and Figueiredo, M. A. T. Sparse Reconstruction by Separable Approximation. IEEE Trans. Sig. Processing, 57(7):2479–2493, July 2009b.
32. Zhang, Z., Liang, X., Ganesh, A., and Ma, Y. TILT: Transform invariant low-rank textures. In Kimmel, R., Klette, R., and Sugimoto, A. (eds.), Computer Vision – ACCV 2010, volume 6494 of Lecture Notes in Computer Science, pp. 314–328. Springer, 2011.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters