# Robust PCA as Bilinear Decomposition

with Outlier-Sparsity Regularization

###### Abstract

Principal component analysis (PCA) is widely used for dimensionality reduction, with well-documented merits in various applications involving high-dimensional data, including computer vision, preference measurement, and bioinformatics. In this context, the fresh look advocated here permeates benefits from variable selection and compressive sampling, to robustify PCA against outliers. A least-trimmed squares estimator of a low-rank bilinear factor analysis model is shown closely related to that obtained from an -(pseudo)norm-regularized criterion encouraging sparsity in a matrix explicitly modeling the outliers. This connection suggests robust PCA schemes based on convex relaxation, which lead naturally to a family of robust estimators encompassing Huber’s optimal M-class as a special case. Outliers are identified by tuning a regularization parameter, which amounts to controlling sparsity of the outlier matrix along the whole robustification path of (group) least-absolute shrinkage and selection operator (Lasso) solutions. Beyond its neat ties to robust statistics, the developed outlier-aware PCA framework is versatile to accommodate novel and scalable algorithms to: i) track the low-rank signal subspace robustly, as new data are acquired in real time; and ii) determine principal components robustly in (possibly) infinite-dimensional feature spaces. Synthetic and real data tests corroborate the effectiveness of the proposed robust PCA schemes, when used to identify aberrant responses in personality assessment surveys, as well as unveil communities in social networks, and intruders from video surveillance data.

^{0}

^{0}0 This work was supported by MURI (AFOSR FA9550-10-1-0567) grant. Part of the paper appeared in the Proc. of the 44th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, Nov. 7-10, 2010.

^{0}

^{0}0 The authors are with the Dept. of Electrical and Computer Engineering, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-2002; Emails: {mate0058,georgios}@umn.edu

Submitted: June 30, 2019

Robust statistics, principal component analysis, outlier rejection, sparsity, (group) Lasso.

EDICS Category: MLR-LEAR

## I Introduction

Principal component analysis (PCA) is the workhorse of high-dimensional data analysis and dimensionality reduction, with numerous applications in statistics, engineering, and the biobehavioral sciences; see, e.g., [J02]. Nowadays ubiquitous e-commerce sites, the Web, and urban traffic surveillance systems generate massive volumes of data. As a result, the problem of extracting the most informative, yet low-dimensional structure from high-dimensional datasets is of paramount importance [HTF09]. To this end, PCA provides least-squares (LS) optimal linear approximants in to a data set in , for . The desired linear subspace is obtained from the dominant eigenvectors of the sample data covariance matrix [J02].

Data obeying postulated low-rank models include also outliers, which are samples not adhering to those nominal models. Unfortunately, LS is known to be very sensitive to outliers [RL87, hr09], and this undesirable property is inherited by PCA as well [J02]. Early efforts to robustify PCA have relied on robust estimates of the data covariance matrix; see, e.g., [c80]. Related approaches are driven from statistical physics [xu95], and also from M-estimators [dlt03]. Recently, polynomial-time algorithms with remarkable performance guarantees have emerged for low-rank matrix recovery in the presence of sparse – but otherwise arbitrarily large – errors [clmw09, cspw]. This pertains to an ‘idealized robust’ PCA setup, since those entries not affected by outliers are assumed error free. Stability in reconstructing the low-rank and sparse matrix components in the presence of ‘dense’ noise have been reported in [zlwcm10, Outlier_pursuit]. A hierarchical Bayesian model was proposed to tackle the aforementioned low-rank plus sparse matrix decomposition problem in [bayes_rpca].

In the present paper, a robust PCA approach is pursued requiring minimal assumptions on the outlier model. A natural least-trimmed squares (LTS) PCA estimator is first shown closely related to an estimator obtained from an -(pseudo)norm-regularized criterion, adopted to fit a low-rank bilinear factor analysis model that explicitly incorporates an unknown sparse vector of outliers per datum (Section II). As in compressive sampling [tropp06tit], efficient (approximate) solvers are obtained in Section III, by surrogating the -norm of the outlier matrix with its closest convex approximant. This leads naturally to an M-type PCA estimator, which subsumes Huber’s optimal choice as a special case [fuchs99]. Unlike Huber’s formulation though, results here are not confined to an outlier contamination model. A tunable parameter controls the sparsity of the estimated matrix, and the number of outliers as a byproduct. Hence, effective data-driven methods to select this parameter are of paramount importance, and systematic approaches are pursued by efficiently exploring the entire robustifaction (a.k.a. homotopy) path of (group-) Lasso solutions [HTF09, yl06grouplasso]. In this sense, the method here capitalizes on but is not limited to sparse settings where outliers are sporadic, since one can examine all sparsity levels along the robustification path. The outlier-aware generative data model and its sparsity-controlling estimator are quite general, since minor modifications discussed in Section III-C enable robustifiying linear regression [uspacor], dictionary learning [Dictionary_learning_SP_mag_10, mairal10], and K-means clustering as well [HTF09, pf_vk_gg_clustering]. Section IV deals with further modifications for bias reduction through nonconvex regularization, and automatic determination of the reduced dimension .

Beyond its neat ties to robust statistics, the developed outlier-aware PCA framework is versatile to accommodate scalable robust algorithms to: i) track the low-rank signal subspace, as new data are acquired in real time (Section V); and ii) determine principal components in (possibly) infinite-dimensional feature spaces, thus robustifying kernel PCA [ssm97], and spectral clustering as well [HTF09, p. 544] (Section VI). The vast literature on non-robust subspace tracking algorithms includes [yang95, mairal10], and [balzano_tracking]; see also [GRASTA] for a first-order algorithm that is robust to outliers and incomplete data. Relative to [GRASTA], the online robust (OR-) PCA algorithm of this paper is a second-order method, which minimizes an outlier-aware exponentially-weighted LS estimator of the low-rank factor analysis model. Since the outlier and subspace estimation tasks decouple nicely in OR-PCA, one can readily devise a first-order counterpart when minimal computational loads are at a premium. In terms of performance, online algorithms are known to be markedly faster than their batch alternatives [balzano_tracking, GRASTA], e.g., in the timely context of low-rank matrix completion [Recht_SIAM_2010, Recht_Parallel_2011]. While the focus here is not on incomplete data records, extensions to account for missing data are immediate and will be reported elsewhere.

In Section VII, numerical tests with synthetic and real data corroborate the effectiveness of the proposed robust PCA schemes, when used to identify aberrant responses from a questionnaire designed to measure the Big-Five dimensions of personality traits [BF_chapter], as well as unveil communities in a (social) network of college football teams [network_data], and intruders from video surveillance data [dlt03]. Concluding remarks are given in Section VIII, while a few technical details are deferred to the Appendix.

Notation: Bold uppercase (lowercase) letters will denote matrices (column vectors). Operators , , , and will denote transposition, matrix trace, median, and Hadamard product, respectively. Vector collects the diagonal entries of , whereas the diagonal matrix has the entries of on its diagonal. The -norm of is for ; and is the matrix Frobenious norm. The identity matrix will be represented by , while will denote the vector of all zeros, and . Similar notation will be adopted for vectors (matrices) of all ones. The -th vector of the canonical basis in will be denoted by , .

## Ii Robustifying PCA

Consider the standard PCA formulation, in which a set of data in the -dimensional Euclidean input space is given, and the goal is to find the best -rank linear approximation to the data in ; see e.g., [J02]. Unless otherwise stated, it is assumed throughout that the value of is given. One approach to solving this problem, is to adopt a low-rank bilinear (factor analysis) model

(1) |

where is a location (mean) vector; matrix has orthonormal columns spanning the signal subspace; are the so-termed principal components, and are zero-mean i.i.d. random errors. The unknown variables in (1) can be collected in , and they are estimated using the LS criterion as

(2) |

PCA in (2) is a nonconvex optimization problem due to the bilinear terms , yet a global optimum can be shown to exist; see e.g., [yang95]. The resulting estimates are and ; while is formed with columns equal to the -dominant right singular vectors of the data matrix [HTF09, p. 535]. The principal components (entries of) are the projections of the centered data points onto the signal subspace. Equivalently, PCA can be formulated based on maximum variance, or, minimum reconstruction error criteria; see e.g., [J02].

### Ii-a Least-trimmed squares PCA

Given training data possibly contaminated with outliers, the goal here is to develop a robust estimator of that requires minimal assumptions on the outlier model. Note that there is an explicit notational differentiation between: i) the data in which adhere to the nominal model (1); and ii) the given data in that may also contain outliers, i.e., those not adhering to (1). Building on LTS regression [RL87], the desired robust estimate for a prescribed can be obtained via the following LTS PCA estimator [cf. (2)]

(3) |

where is the -th order statistic among the squared residual norms , and . The so-termed coverage determines the breakdown point of the LTS PCA estimator [RL87], since the largest residuals are absent from the estimation criterion in (3). Beyond this universal outlier-rejection property, the LTS-based estimation offers an attractive alternative to robust linear regression due to its high breakdown point and desirable analytical properties, namely -consistency and asymptotic normality under mild assumptions [RL87].

###### Remark 1 (Robust estimation of the mean)

In most applications of PCA, data in are typically assumed zero mean. This is without loss of generality, since nonzero-mean training data can always be rendered zero mean, by subtracting the sample mean from each . In modeling zero-mean data, the known vector in (1) can obviously be neglected. When outliers are present however, data in are not necessarily zero mean, and it is unwise to center them using the non-robust sample mean estimator which has a breakdown point equal to zero [RL87]. Towards robustifying PCA, a more sensible approach is to estimate robustly, and jointly with and the principal components .

Because (3) is a nonconvex optimization problem, a nontrivial issue pertains to the existence of the proposed LTS PCA estimator, i.e., whether or not (3) attains a minimum. Fortunately, the answer is in the affirmative as asserted next.

###### Property 1

The LTS PCA estimator is well defined, since (3) has (at least) one solution.

Existence of can be readily established as follows: i) for each subset of with cardinality (there are such subsets), solve the corresponding PCA problem to obtain a unique candidate estimator per subset; and ii) pick as the one among all candidates with the minimum cost.

Albeit conceptually simple, the solution procedure outlined under Property 1 is combinatorially complex, and thus intractable except for small sample sizes . Algorithms to obtain approximate LTS solutions in large-scale linear regression problems are available; see e.g., [RL87].

### Ii-B -norm regularization for robustness

Instead of discarding large residuals, the alternative approach here explicitly accounts for outliers in the low-rank data model (1). This becomes possible through the vector variables one per training datum , which take the value whenever datum is an outlier, and otherwise. Thus, the novel outlier-aware factor analysis model is

(4) |

where can be deterministic or random with unspecified distribution. In the under-determined linear system of equations (4), both as well as the matrix are unknown. The percentage of outliers dictates the degree of sparsity (number of zero rows) in . Sparsity control will prove instrumental in efficiently estimating , rejecting outliers as a byproduct, and consequently arriving at a robust estimator of . To this end, a natural criterion for controlling outlier sparsity is to seek the estimator [cf. (2)]

(5) |

where , , and denotes the nonconvex -norm that is equal to the number of nonzero rows of . Vector (group) sparsity in the rows of can be directly controlled by tuning the parameter .

As with compressive sampling and sparse modeling schemes that rely on the -norm [tropp06tit], the robust PCA problem (5) is NP-hard [Natarajan_NP_duro]. In addition, the sparsity-controlling estimator (5) is intimately related to LTS PCA, as asserted next.

Proposition 1: If minimizes (5) with chosen such that , then .

###### Proof:

Given such that , the goal is to characterize as well as the positions and values of the nonzero rows of . Note that because , the last term in the cost of (5) is constant, hence inconsequential to the minimization. Upon defining , it is not hard to see from the optimality conditions that the rows of satisfy

(6) |

This is intuitive, since for those nonzero the best thing to do in terms of minimizing the overall cost is to set , and thus null the corresponding squared-residual terms in (5). In conclusion, for the chosen value of it holds that squared residuals effectively do not contribute to the cost in (5).

To determine and the row support of , one alternative is to exhaustively test all admissible row-support combinations. For each one of these combinations (indexed by ), let be the index set describing the row support of , i.e., if and only if ; and . By virtue of (6), the corresponding candidate solves subject to , while is the one among all that yields the least cost. Recognizing the aforementioned solution procedure as the one for LTS PCA outlined under Property 1, it follows that . \qed

The importance of Proposition II-B is threefold. First, it formally justifies model (4) and its estimator (5) for robust PCA, in light of the well documented merits of LTS [RL87]. Second, it further solidifies the connection between sparsity-aware learning and robust estimation. Third, problem (5) lends itself naturally to efficient (approximate) solvers based on convex relaxation, the subject dealt with next.

## Iii Sparsity-Controlling Outlier Rejection

Recall that the row-wise -norm sum of matrix is the closest convex approximation of . This property motivates relaxing problem (5) to

(7) |

The nondifferentiable -norm regularization term encourages row-wise (vector) sparsity on the estimator of , a property that has been exploited in diverse problems in engineering, statistics, and machine learning [HTF09]. A noteworthy representative is the group Lasso [yl06grouplasso], a popular tool for joint estimation and selection of grouped variables in linear regression.

It is pertinent to ponder on whether problem (7) still has the potential of providing robust estimates in the presence of outliers. The answer is positive, since it is shown in the Appendix that (7) is equivalent to an M-type estimator

(8) |

where is a vector extension to Huber’s convex loss function [hr09]; see also [kgg10], and

(9) |

M-type estimators (including Huber’s) adopt a fortiori an -contaminated probability distribution for the outliers, and rely on minimizing the asymptotic variance of the resultant estimator for the least favorable distribution of the -contaminated class (asymptotic min-max approach) [hr09]. The assumed degree of contamination specifies the tuning parameter in (9) (and thus the threshold for deciding the outliers in M-estimators). In contrast, the present approach is universal in the sense that it is not confined to any assumed class of outlier distributions, and can afford a data-driven selection of the tuning parameter. In a nutshell, M-estimators can be viewed as a special case of the present formulation only for a specific choice of , which is not obtained via a data-driven approach, but from distributional assumptions instead.

All in all, the sparsity-controlling role of the tuning parameter in (7) is central, since model (4) and the equivalence of (7) with (8) suggest that is a robustness-controlling constant. Data-driven approaches to select are described in detail under Section III-B. Before dwelling into algorithmic issues to solve (7), a couple of remarks are in order.

###### Remark 2 (-norm regularization for entry-wise outliers)

In computer vision applications where robust PCA schemes are particularly attractive, one may not wish to discard the entire (vectorized) images , but only specific pixels deemed as outliers [dlt03]. This can be accomplished by replacing in (7) with , a Lasso-type regularization that encourages entry-wise sparsity in .

###### Remark 3 (Outlier rejection)

From the equivalence between problems (7) and (8), it follows that those data points deemed as containing outliers are not completely discarded from the estimation process. Instead, their effect is downweighted as per Huber’s loss function [cf. (9)]. Nevertheless, explicitly accounting for the outliers in provides the means of identifying and removing the contaminated data altogether, and thus possibly re-running PCA on the outlier-free data.

### Iii-a Solving the relaxed problem

To optimize (7) iteratively for a given value of , an alternating minimization (AM) algorithm is adopted which cyclically updates per iteration . AM algorithms are also known as block-coordinate-descent methods in the optimization parlance; see e.g., [Bertsekas_Book_Nonlinear, Tseng_cd_2001]. To update each of the variable groups, (7) is minimized while fixing the rest of the variables to their most up-to-date values. While the overall problem (7) is not jointly convex with respect to (w.r.t.) , fixing all but one of the variable groups yields subproblems that are efficiently solved, and attain a unique solution.

Towards deriving the updates at iteration and arriving at the desired algorithm, note first that the mean update is . Next, form the centered and outlier-compensated data matrix . The principal components are readily given by

Continuing the cycle, solves

a constrained LS problem also known as reduced-rank Procrustes rotation [zouPCA]. The minimizer is given in analytical form in terms of the left and right singular vectors of [zouPCA, Thm. 4]. In detail, one computes the SVD of and updates . Next, the minimization of (7) w.r.t. is an orthonormal group Lasso problem. As such, it decouples across rows giving rise to -norm regularized subproblems, namely

where . The respective solutions are given by (see e.g., [puig_hero])

(10) |

where . For notational convenience, these parallel vector soft-thresholded updates are denoted as under Algorithm 1, where the thresholding operator sets the entire outlier vector to zero whenever does not exceed , in par with the group sparsifying property of group Lasso. Interestingly, this is the same rule used to decide if datum is deemed an outlier, in the equivalent formulation (8) which involves Huber’s loss function. Whenever an -norm regularizer is adopted as discussed in Remark 2, the only difference is that updates (10) boil down to soft-thresholding the scalar entries of .

The entire AM solver is tabulated under Algorithm 1, indicating also the recommended initialization. Algorithm 1 is conceptually interesting, since it explicitly reveals the intertwining between the outlier identification process, and the PCA low-rank model fitting based on the outlier compensated data .

The AM solver is also computationally efficient. Computing the matrix requires operations per iteration, and equally costly is to obtain . The cost of computing the SVD of is of order , while the rest of the operations including the row-wise soft-thresholdings to yield are linear in both and . In summary, the total cost of Algorithm 1 is roughly , where is the number of iterations required for convergence (typically to iterations suffice). Because is typically small, Algorithm 1 is attractive computationally both under the classic setting where , and is not large; as well as in high-dimensional data settings where , a situation typically arising e.g., in microarray data analysis.

Because each of the optimization problems in the per-iteration cycles has a unique minimizer, and the nondifferentiable regularization only affects one of the variable groups , the general results of [Tseng_cd_2001] apply to establish convergence of Algorithm 1 as follows.

### Iii-B Selection of : robustification paths

Selecting controls the number of outliers rejected. But this choice is challenging because existing techniques such as cross-validation are not effective when outliers are present [RL87]. To this end, systematic data-driven approaches were devised in [uspacor], which e.g., require a rough estimate of the percentage of outliers, or, robust estimates of the nominal noise variance that can be obtained using median absolute deviation (MAD) schemes [hr09]. These approaches can be adapted to the robust PCA setting considered here, and leverage the robustification paths of (group-)Lasso solutions [cf. (7)], which are defined as the solution paths corresponding to , for all values of . As decreases, more vectors enter the model signifying that more of the training data are deemed to contain outliers.

Consider then a grid of values of in the interval , evenly spaced on a logarithmic scale. Typically, is chosen as the minimum value such that , while with , say. Because Algorithm 1 converges quite fast, (7) can be efficiently solved over the grid of values for . In the order of hundreds of grid points can be easily handled by initializing each instance of Algorithm 1 (per value of ) using warm starts [HTF09]. This means that multiple instances of (7) are solved for a sequence of decreasing values, and the initialization of Algorithm 1 per grid point corresponds to the solution obtained for the immediately preceding value of in the grid. For sufficiently close values of , one expects that the respective solutions will also be close (the row support of will most likely not change), and hence Algorithm 1 will converge after few iterations.

Based on the samples of the robustification paths and the prior knowledge available on the outlier model (4), a couple of alternatives are also possible for selecting the ‘best’ value of in the grid. A comprehensive survey of options can be found in [uspacor].

Number of outliers is known: By direct inspection of the robustification paths one can determine the range of values for , such that the number of nonzero rows in equals the known number of outliers sought. Zooming-in to the interval of interest, and after discarding the identified outliers, -fold cross-validation methods can be applied to determine the ‘best’ .

Nominal noise covariance matrix is known: Given , one can proceed as follows. Consider the estimates obtained using (7) after sampling the robustification path for each point . Next, pre-whiten those residuals corresponding to training data not deemed as containing outliers; i.e., form , and find the sample covariance matrices . The winner corresponds to the grid point minimizing an absolute variance deviation criterion, namely .

### Iii-C Connections with robust linear regression, dictionary learning, and clustering

Previous efforts towards robustifying linear regression have pointed out the equivalence between M-type estimators and -norm regularized regression [fuchs99], and capitalized on this neat connection under a Bayesian framework [rao_robust]. However, they have not recognized the link to LTS via convex relaxation of the -norm in (5). The treatment here goes beyond linear regression by considering the PCA framework, which entails a more challenging bilinear factor analysis model. Linear regression is subsumed as a special case, when matrix is not necessarily tall but assumed known, while .

As an alternative to PCA, it is possible to device dimensionality reduction schemes when the data admit a sparse representation over a perhaps unknown basis. Such sparse representations comprise only a few elements (atoms) of the overcomplete basis (a.k.a. dictionary) to reconstruct the original data record. Thus, each datum is represented by a coefficient vector whose effective dimensionality (number of nonzero coefficients) is smaller than that of the original data vector. Recently, the dictionary learning paradigm offers techniques to design a dictionary over which the data assume a sparse representation; see e.g., [Dictionary_learning_SP_mag_10] for a tutorial treatment. Dictionary learning schemes are flexible, in the sense that they utilize training data to learn an appropriate overcomplete basis customized for the data at hand [mairal10, Dictionary_learning_SP_mag_10].

However, as in PCA the criteria adopted typically rely on a squared-error loss function as a measure of fit, which is known to be very sensitive to outliers [RL87, hr09]. Interestingly, one can conceivably think of robustifying dictionary learning via minor modifications to the framework described so far. For instance, with the same matrix notation used in e.g., (5), one seeks to minimize

(11) |

Different from the low-rank outlier-aware model adopted for PCA [cf. (4)], here the dictionary is fat , with column vectors that are no longer orthogonal but still constrained to have unit -norm. (This constraint is left implicit in (11) for simplicity.) Moreover, one seeks a sparse vector to represent each datum , in terms of a few atoms of the learnt dictionary . This is why (11) includes an additional sparsity-promoting -norm regularization on , that is not present in (7). Sparsity is thus present both in the representation coefficients , as well as in the outliers .

Finally, it is shown here that a generative data model for K-means clustering [HTF09] can share striking similarities with the bilinear model (1). Consequently, the sparsity-controlling estimator (7) can be adapted to robustify the K-means clustering task too [pf_vk_gg_clustering]. Consider for instance that the data in come from clusters, each of which is represented by a centroid , . Moreover, for each input vector , K-means introduces the unknown membership variables , , where whenever comes from cluster , and otherwise. Typically, the membership variables are also constrained to satisfy (no empty clusters), and (single cluster membership). Upon defining and the membership vectors , a pertinent model for hard K-means clustering assumes that input vectors can be expressed as , where and are as in (4). Because the aforementioned constraints imply , if belongs to cluster , then and in the absence of outliers one effectively has . Based on this data model, a natural approach towards robustifying K-means clustering solves [pf_vk_gg_clustering]

(12) |

Recall that in the robust PCA estimator (7), the subspace matrix is required to be orthonormal and the principal components are unrestrained. In the clustering context however, the centroid columns of are free optimization variables, whereas the cluster membership variables adhere to the constraints in (12). Suitable relaxations to tackle the NP-hard problem (12) have been investigated in [pf_vk_gg_clustering].

## Iv Further Algorithmic Issues

### Iv-a Bias reduction through nonconvex regularization

Instead of substituting in (5) by its closest convex approximation, namely , letting the surrogate function to be nonconvex can yield tighter approximations, and improve the statistical properties of the estimator. In rank minimization problems for instance, the logarithm of the determinant of the unknown matrix has been proposed as a smooth surrogate to the rank [fazel_log_det]; an alternative to the convex nuclear norm in e.g., [Recht_SIAM_2010]. Nonconvex penalties such as the smoothly clipped absolute deviation (SCAD) have been also adopted to reduce bias [scad], present in uniformly weighted -norm regularized estimators such as (7) [HTF09, p. 92]. In the context of sparse signal reconstruction, the -norm of a vector was surrogated in [candes_l0_surrogate] by the logarithm of the geometric mean of its elements; see also [nacho_universal].

Building on this last idea, consider approximating (5) by the nonconvex formulation

(13) |

where the small positive constant is introduced to avoid numerical instability. Since the surrogate term in (13) is concave, the overall minimization problem is nonconvex and admittedly more complex to solve than (7). Local methods based on iterative linearization of around the current iterate , are adopted to minimize (13). Skipping details that can be found in [kgg10], application of the majorization-minimization technique to (13) leads to an iteratively-reweighted version of (7), whereby is used for updating in Algorithm 1. Specifically, per one updates

where the weights are given by Note that the thresholds vary both across rows (indexed by ), and across iterations. If the value of is small, then in the next iteration the regularization term has a large weight, thus promoting shrinkage of that entire row vector to zero. If is large, the cost in the next iteration downweighs the regularization, and places more importance to the LS component of the fit.

All in all, the idea is to start from the solution of (7) for the ‘best’ , which is obtained using Algorithm 1. This initial estimate is refined after runnning a few iterations of the iteratively-reweighted counterpart to Algorithm 1. Extensive numerical tests suggest that even a couple iterations of this second stage refinement suffices to yield improved estimates , in comparison to those obtained from (7). The improvements can be leveraged to bias reduction – and its positive effect with regards to outlier support estimation – also achieved by similar weighted norm regularizers proposed for linear regression [HTF09, p. 92].

### Iv-B Automatic rank determination: from nuclear- to Frobenius-norm regularization

Recall that is the dimensionality of the subspace where the outlier-free data (1) are assumed to live in, or equivalently, in the absence of noise. So far, was assumed known and fixed. This is reasonable in e.g., compression/quantization, where a target distortion-rate tradeoff dictates the maximum . In other cases, the physics of the problem may render known. This is indeed the case in array processing for direction-of-arrival estimation, where is the dimensionality of the so-termed signal subspace, and is given by the number of plane waves impinging on a uniform linear array; see e.g., [yang95].

Other applications however, call for signal processing tools that can determine the ‘best’ , as well as robustly estimate the underlying low-dimensional subspace from data . Noteworthy representatives for this last kind of problems include unveiling traffic volume anomalies in large-scale networks [morteza_gonzalo_gg_asilomar11], and automatic intrusion detection from video surveillance frames [dlt03, clmw09], just to name a few. A related approach in this context is (stable) principal components pursuit (PCP) [zlwcm10, Outlier_pursuit], which solves

(14) |

with the objective of reconstructing the low-rank matrix
, as well as the sparse matrix of
outliers
in the presence of dense
noise with known variance.^{1}^{1}1Actually, [zlwcm10] considers
entrywise outliers and adopts an -norm regularization on
. Note
that denotes the
matrix nuclear norm, defined as the sum of the singular values of
. The same way that the -norm regularization promotes
sparsity in the rows of , the nuclear norm encourages a
low-rank since it effects
sparsity in the vector of singular values of . Upon solving the
convex optimization problem (14), it is possible to obtain
using the SVD.
Interestingly, (14) does not fix (or require the knowledge of)
a fortiori, but controls it through the tuning
parameter . Adopting a Bayesian framework, a similar problem
was considered in [bayes_rpca].

Instead of assuming that is known, suppose that only an upper bound is given. Then, the class of feasible noise-free low-rank matrix components of in (1) admit a factorization , where and are and matrices, respectively. Building on the ideas used in the context of finding minimum rank solutions of linear matrix equations [Recht_SIAM_2010], a novel alternative approach to robustifying PCA is to solve

(15) |

Different from (14) and (7), a Frobenius-norm regularization on both and is adopted to control the dimensionality of the estimated subspace . Relative to (7), in (15) is not constrained to be orthonormal. It is certainly possible to include the mean vector in the cost of (15), as well as an -norm regularization for entrywise outliers. The main motivation behind choosing the Frobenius-norm regularization comes from the equivalence of (14) with (15), as asserted in the ensuing result which adapts [Recht_SIAM_2010, Lemma 5.1] to the problem formulation considered here.

###### Proof:

Because , the relevant feasible subset of (14) can be re-parametrized as , where and are and matrices, respectively. For every triplet the objective of (15) is no smaller than the one of (14), since it holds that [Recht_SIAM_2010]

(16) |

One can show that the gap between the objectives of (14) and (15) vanishes at , , and ; where is the SVD of . Therefore, from the previous arguments it follows that (14) and (15) attain the same global minimum objective, which completes the proof. \qed

Even though problem (15) is nonconvex, the number of optimization variables is reduced from to , which becomes significant when is in the order of a few dozens and both and are large. Also note that the dominant -term in the variable count of (15) is due to , which is sparse and can be efficiently handled. While the factorization could have also been introduced in (14) to reduce the number of unknowns, the cost in (15) is separable and much simpler to optimize using e.g., an AM solver comprising the iterations tabulated as Algorithm 2. The decomposability of the Frobenius-norm regularizer has been recently exploited for parallel processing across multiple processors when solving large-scale matrix completion problems [Recht_Parallel_2011], or to unveil network anomalies [morteza_gonzalo_gg_asilomar11].

Because (15) is a nonconvex optimization problem, most solvers one can think of will at most provide convergence guarantees to a stationary point that may not be globally optimum. Nevertheless, simulation results in Section VII demonstrate that Algorithm 2 is effective in providing good solutions most of the time, which is somehow expected since there is quite a bit of structure in (15). Formally, the next proposition adapted from [morteza_gonzalo_gg_asilomar11, Prop. 1] provides a sufficient condition under which Algorithm 2 yields an optimal solution of (14). For a proof of a slightly more general result, see [morteza_gonzalo_gg_asilomar11].

## V Robust Subspace Tracking

E-commerce and Internet-based retailing sites, the World Wide Web, and video surveillance systems generate huge volumes of data, which far outweigh the ability of modern computers to analyze them in real time. Furthermore, data are generated sequentially in time, which motivates updating previously obtained learning results rather than re-computing new ones from scratch each time a new datum becomes available. This calls for low-complexity real-time (adaptive) algorithms for robust subspace tracking.

One possible adaptive counterpart to (7) is the exponentially-weighted LS (EWLS) estimator found by

(17) |

where is a forgetting factor. In this context, should be understood as a temporal variable, indexing the instants of data acquisition. Note that in forming the EWLS estimator (17) at time , the entire history of data is incorporated in the real-time estimation process. Whenever , past data are exponentially discarded thus enabling operation in nonstationary environments. Adaptive estimation of sparse signals has been considered in e.g., [da_jn_gg_2010] and [mairal10].

Towards deriving a real-time, computationally efficient, and recursive (approximate) solver of (17), an AM scheme will be adopted in which iterations coincide with the time scale of data acquisition. Per time instant , a new datum is drawn and the corresponding pair of decision variables are updated via

(18) |

As per (18), only is updated at time , rather than the whole (growing with time) matrix that minimization of (17) would dictate; see also [mairal10] for a similar approximation.

Because (18) is a smooth optimization problem w.r.t. , from the first-order optimality condition the principal component update is . Interestingly, this resembles the projection approximation adopted in [yang95], and can only be evaluated after is obtained. To this end, plug in (18) to obtain via a particular instance of the group Lasso estimator

(19) |

with a single group of size equal to . The cost in (19) is non-differentiable at the origin, and different from e.g., ridge regression, it does not admit a closed-form solution. Upon defining

(20) | ||||

(21) |

one can recognize as the multidimensional shrinkage-thresholding operator introduced in [puig_hero]. In particular, as per [puig_hero, Corollary 2] it follows that

(22) |

where parameter is such that solves the scalar optimization

(23) |

Remarkably, one can easily determine if , by forming and checking whether . This will be the computational burden incurred to solve (19) for most , since outliers are typically sporadic and one would expect to obtain most of the time. When datum is deemed an outlier, , and one needs to carry out the extra line search in (23) to determine as per (22); further details can be found in in [puig_hero]. Whenever an -norm outlier regularization is adopted, the resulting counterpart of (19) can be solved using e.g., coordinate descent [da_jn_gg_2010], or, the Lasso variant of least-angle regression (LARS) [mairal10].

Moving on, the subspace update is given by

and can be efficiently obtained from , via a recursive LS update leveraging the matrix inversion lemma; see e.g., [yang95]. Note that the orthonormality constraint on is not enforced here, yet the deviation from orthonormality is typically small as observed in [yang95]. Still, if orthonormal principal directions are required, an extra orthonormalization step can be carried out per iteration, or, once at the end of the process. Finally, is obtained recursively as the exponentially-weighted average of the outlier-compensated data . The resulting online robust (OR-)PCA algorithm and its initialization are summarized under Algorithm 3, where and its update have been omitted for brevity.

For the batch case where all data in are available for joint processing, two data-driven criteria to select have been outlined in Section III-B. However, none of these sparsity-controlling mechanisms can be run in real-time, and selecting for subspace tracking via OR-PCA is challenging. One possibility to circumvent this problem is to select once during a short initialization (batch) phase of OR-PCA, and retain its value for the subsequent time instants. Specifically, the initialization phase of OR-PCA entails solving (7) using Algorithm 1, with a typically small batch of data . At time , the criteria in Section III-B are adopted to find the ‘best’ , and thus obtain the subspace estimate required to initialize the OR-PCA iterations.

Convergence analysis of OR-PCA algorithm is beyond the scope of the present paper, and is only confirmed via simulations. The numerical tests in Section VII also show that in the presence of outliers, the novel adaptive algorithm outperforms existing non-robust alternatives for subspace tracking.

## Vi Robustifying Kernel PCA

Kernel (K)PCA is a generalization to (linear) PCA, seeking principal components in a feature space nonlinearly related to the input space where the data in live [ssm97]. KPCA has been shown effective in performing nonlinear feature extraction for pattern recognition [ssm97]. In addition, connections between KPCA and spectral clustering [HTF09, p. 548] motivate well the novel KPCA method developed in this section, to robustly identify cohesive subgroups (communities) from social network data.

Consider a nonlinear function , that maps elements from the input space to a feature space of arbitrarily large – possibly infinite – dimensionality. Given transformed data , the proposed approach to robust KPCA fits the model

(24) |

by solving ()

(25) |

It is certainly possible to adopt the criterion (7) as well, but (25) is chosen here for simplicity in exposition. Except for the principal components’ matrix , both the data and the unknowns in (25) are now vectors/matrices of generally infinite dimension. In principle, this challenges the optimization task since it is impossible to store, or, perform updates of such quantities directly. For these reasons, assuming zero-mean data , or, the possibility of mean compensation for that matter, cannot be taken for granted here [cf. Remark 1]. Thus, it is important to explicitly consider the estimation of .

Interestingly, this hurdle can be overcome by endowing with the structure of a reproducing kernel Hilbert space (RKHS), where inner products between any two members of boil down to evaluations of the reproducing kernel , i.e., . Specifically, it is possible to form the kernel matrix , without directly working with the vectors in . This so-termed kernel trick is the crux of most kernel methods in machine learning [HTF09], including kernel PCA [ssm97]. The problem of selecting (and indirectly) will not be considered here.

Building on these ideas, it is shown in the sequel that Algorithm 2 can be kernelized, to solve (25) at affordable computational complexity and memory storage requirements that do not depend on the dimensionality of .