A Dictionary-Based Generalization of Robust PCA Part I: Study of Theoretical Properties
We consider the decomposition of a data matrix assumed to be a superposition of a low-rank matrix and a component which is sparse in a known dictionary, using a convex demixing method. We consider two sparsity structures for the sparse factor of the dictionary sparse component, namely entry-wise and column-wise sparsity, and provide a unified analysis, encompassing both undercomplete and the overcomplete dictionary cases, to show that the constituent matrices can be successfully recovered under some relatively mild conditions on incoherence, sparsity, and rank. We corroborate our theoretical results by presenting empirical evaluations in terms of phase transitions in rank and sparsity, in comparison to related techniques. Investigation of a specific application in hyperspectral imaging is included in an accompanying paper.
Leveraging structure of a given dataset is at the heart of all machine learning and data analysis tasks. A priori knowledge about the structure often makes the problem well-posed, leading to improvements in the solutions. Perhaps the most common of these, one that is often encountered in practice, is approximate low-rankness of the dataset, which is exploited by the popular principal component analysis (PCA). The low-rank structure encapsulates the model assumption that the data in fact spans a lower dimensional subspace than the ambient dimension of the data. However, in a number of applications, the data may not be inherently low-rank, but may be decomposed as a superposition of a low-rank component, and a component which has a sparse representation in a known dictionary. This scenario is particularly interesting in target identification applications [2, 3], where the a priori knowledge of the target signatures (dictionary), can be leveraged for localization.
In this work, we analyze a matrix demixing problem where a data matrix is formed via a superposition of a low-rank component of rank- for , and a dictionary sparse part . Here, the matrix is an a priori known dictionary, and is an unknown sparse coefficient matrix. Specifically, we will study the following model for :
and identify the conditions under which the components and can be successfully recovered given and by solving appropriate convex formulations.
We consider the demixing problem described above for two different sparsity models on the matrix . First, we consider a case where has at most total non-zero entries (entry-wise sparse case), and second where has non-zero columns (column-wise sparse case). To this end, we develop the conditions under which solving
for the entry-wise sparsity case, and
for the column-wise sparse case, will recover and for regularization parameters and , respectively, given the data and the dictionary . Here, the known dictionary can be overcomplete (fat, i.e., ) or undercomplete (thin, i.e., ).
Here, “D-RPCA” refers to “Dictionary based Robust Principal Component Analysis”, while the qualifiers “E” and “C” indicate the entry-wise and column-wise sparsity patterns, respectively. In addition, , , and refer to the nuclear norm, - norm of the vectorized matrix, and norm (sum of the norm of the columns), respectively, which serve as convex relaxations of rank, sparsity, and column-wise sparsity inducing optimization, respectively.
These two types of sparsity patterns capture different structural properties of the dictionary sparse component. The entry-wise sparsity model allows individual data points to span low-dimensional subspaces, still allowing the dataset to span the entire space. In case of the column-wise sparse coefficient matrix , the component is also column-wise sparse. Therefore, this model effectively captures the structured (which depend upon the dictionary ) corruptions in the otherwise low-rank structured columns of data matrix . Note that the non-zero columns of are not restricted to be sparse in the column-wise sparsity model.
A wide range of problems can be expressed in the form described in (1). Perhaps the most celebrated of these is principal component analysis (PCA) , which can be viewed as a special case of (1), with the matrix set to zero. Next, in the absence of the component , the problem reduces to that of sparse recovery [4, 5, 6]; See  and references therein for an overview of related works. Further, the popular framework of Robust PCA tackles a case when the dictionary is an identity matrix [8, 9]; variants include [10, 11, 12, 13].
The model described in (1) is also closely related to the one considered in , which explores the overcomplete dictionary setting with applications to detection of network traffic anomalies. However, the analysis therein applies to a case where the dictionary is overcomplete with orthogonal rows, and the coefficient matrix has a small number of non-zero elements per row and column, which may be restrictive assumptions in some applications.
In particular, for the entry-wise case, the model shown in (1) is propitious in a number of applications. For example, it can be used for target identification in hyperspectral imaging [2, 3], and in topic modeling applications to identify documents with certain properties, on similar lines as . We analyze and demonstrate the application of this model for a hyperspectral demixing task in an application extension of this work . Further, in source separation tasks, a variant of this model was used in singing voice separation in [17, 18]. In addition, we can also envision source separation tasks where is not low-rank, but can in turn be modeled as being sparse in a known  or unknown  dictionary.
For the column-wise setting, model (1) is also closely related to outlier identification [21, 22, 23, 24], which is motivated by a number of contemporary “big data” applications. Here, the sparse matrix , also called outliers in this regime, may be of interest and can be used in identifying malicious responses in collaborative filtering applications , finding anomalous patterns in network traffic  or estimating visually salient regions of images [27, 28, 29].
I-B Our Contributions
As described above, we propose and analyze a dictionary based generalization of robust PCA as shown in (1). Here, we consider two distinct sparsity patterns of , i.e., entry-wise and column-wise sparse , arising from different structural assumptions on the dictionary sparse component. Our specific contributions for each sparsity pattern are summarized below.
Entry-wise case: We make the following contributions towards guaranteeing the recovery of and via the convex optimization problem in D-RPCA(E). First, we analyze the thin case (i.e. ), where we assume that the matrix has at most non-zero elements globally, i.e., Next, for the fat case, we first extend the analysis presented in  to eliminate the orthogonality constraint on the rows of the dictionary . Further, we relax the sparsity constraints required by  on rows and columns of the sparse coefficient matrix , to study the case when with at most non-zero elements per column . Hence, we provide a unified analysis for both the thin and the fat case, making the model (1) amenable to a wide range of applications.
Column-wise case: We propose and analyze a dictionary based generalization of robust PCA, specifically Outlier Pursuit (OP) , wherein the coefficient matrix admits a column sparse structure which can be viewed as “outliers”; see also .
Note that, in this case there is an inherent ambiguity regarding the recovery of the true component pair corresponding to the low-rank part and the dictionary sparse component, respectively. Specifically, any pair satisfying , where and have the same column space, and and have the identical column support, is a solution of D-RPCA(C). To this end, we develop the sufficient conditions under which solving the convex optimization in D-RPCA(C) recovers the column space of the low-rank component , while identifying the outlier columns of . Here, the difference between D-RPCA(C) and OP being the inclusion of the known dictionary .
Next, we demonstrate how the a priori knowledge of the dictionary helps us identify the corrupted columns via phase transitions in rank and sparsity for recovery of the outlier columns. Specifically, we show that in comparison to OP, D-RPCA(C) works for potentially higher ranks of , when is a fixed proportion of .
The thin dictionary case – an interesting result: As suggested by , when the dictionary is thin, i.e., , one can envision a pseudo-inversed based technique, wherein we pre-multiply both sides in (1) with the Moore-Penrose pseudo-inverse , i.e., (this is not applicable for the fat case due to the non-trivial null space of the pseudo-inverse). This operation leads to a formulation which resembles the robust PCA (RPCA) [8, 9] model for the entry-wise case and Outlier Pursuit (OP)  for the column-wise case, i.e.,
An interesting finding of our work is that although this transformation algebraically reduces the entry-wise and column-wise sparsity cases to Robust PCA and OP settings, respectively, the specific model assumptions of Robust PCA and OP may not hold for all choices of dictionary size and rank . Specifically, we find that in cases where , this pre-multiplication may not lead to a “low-rank” . This suggests that the notion of “low” or “high” rank is relative to the maximum possible rank of , which in this case is . Therefore, if , can be full-rank, and the low-rank assumptions of RPCA and OP may no longer hold. As a result, these two models (the pseudo inversed case and the current work) cannot be used interchangeably for the thin dictionary case. We corroborate these via experimental evaluations presented in Section V111The code is made available at github.com/srambhatla/Diction ary-based-Robust-PCA, and the results are therefore reproducible.
The rest of the paper is organized as follows. We formalize the problem, introduce the notation, and describe various considerations on the structure of the component matrices in Section II. In Section III, we present our main theorems for the entry-wise and column-wise cases along with discussion on the implication of the results, followed by an outline of the analysis in Section IV. Numerical evaluations are provided in Section V. Finally, we summarize our contributions and conclude this discussion in Section VI with insights on future work.
Notation: Given a matrix , we use for the spectral norm, where denotes the maximum singular value of the matrix, , , and . Here, denotes the element of and denotes the cannonical basis vector with at the -th location.
We start formalizing the problem set-up and introduce model parameters pertinent to our analysis. We begin our discussion with our notion of optimality for the two sparsity modalities; we also summarize the notation in Table I in the appendix.
Ii-a Optimality of the Solution Pair
For the entry-wise case, we recover the low-rank component , and the sparse coefficient matrix , given the dictionary , and data generated according to the model described in (1). Recall that is the global sparsity, denotes the number of non-zero entries in a column of when the dictionary is fat.
In the the column-wise sparsity setting, due to the inherent ambiguity in the model (1), as discussed in Section I-B, we can only hope to recover the column-space for the low-rank matrix and the identities of the non-zero columns for the sparse matrix. Therefore, in this case any solution in the Oracle Model (defined below) is deemed to be optimal.
Definition D.1 (Oracle Model for Column-wise Sparsity Case).
Let the pair be the matrices forming the data as per (1), define the oracle model . Then, any pair is in the Oracle Model , if , and hold simultaneously, where and are projections onto the column space of and column support of , respectively.
Ii-B Conditions on the Dictionary
We require that the dictionary follows the generalized frame property (GFP) as defined as follows.
A matrix satisfies the generalized frame property (GFP), on vectors , if for any fixed vector where , we have
where and are the lower and upper generalized frame bounds with .
The GFP shown above is met as long as the vectors are not in the null-space of the matrix , and has a finite . Therefore, for the thin dictionary setting for both entry-wise and column-wise cases can be the entire space, and GFP is satisfied as long as has full column rank. For example, being a frame suffices; see  for a brief overview of frames.
On the other hand, for the fat dictionary setting, we need the space to be structured such that the GFP is met for both the entry-wise and column-wise sparsity cases. Specifically, for the entry-wise sparsity case, we also require that the frame bounds and be close to each other. To this end, we assume that satisfies the restricted isomtery property (RIP) of order with a restricted isometric constant (RIC) of in this case, and that and .
Ii-C Relevant Subspaces
We now define the subspaces relevant for our discussion. For the following discussion, let the pair denote the solution to D-RPCA(E) in the entry-wise sparse case. Further, for the column-wise sparse setting, let denote a solution pair in the oracle model as defined in D.1, obtained by solving D-RPCA(C).
For the low-rank matrix , let the compact singular value decomposition (SVD) be defined as
where and are the left and right singular vectors of , respectively, and is the diagonal matrix with singular values on the diagonal. Here, matrices and each have orthogonal columns, and the non-negative entries are arranged in descending order. We define as the linear subspace consisting of matrices spanning the same row or column space as , i.e.,
Next, let ( for the column-wise sparsity setting) be the space spanned by matrices with the same non-zero support (column support, denoted as ) as , and let the space denote the space spanned by the dictionary sparse component under our model be defined as
Here, denotes the index set containing the non-zero column index set of for the column-wise case.
Also, we denote the corresponding complements of the spaces described above by appending ‘’. In addition, we use calligraphic ‘’ to denote the projection operator onto a subspace , and ‘’ to denote the corresponding projection matrix. For instance, we define and as the projection operators corresponding to the column space and row space of the low-rank component . Therefore, for a given matrix ,
where and . With this, the projection operators onto, and orthogonal to the subspace are respectively defined as
Ii-D Incoherence Measures and Other Parameters
We employ various notions of incoherence to identify the conditions under which our procedures succeed. To this end, we first define the incoherence parameter , that characterizes the relationship between the low-rank part, , and the dictionary sparse part as,
The parameter is the measure of degree of similarity between the low-rank part and the dictionary sparse component. Here, a larger implies that the dictionary sparse component is close to the low-rank part, while a small indicates otherwise. In addition, we also define the parameter as
which measures the similarity between the orthogonal complement of the column-space and the dictionary .
The next two measures of incoherence can be interpreted as a way to identify the cases where for with SVD as : (a) resembles the dictionary , and/or (b) resembles the sparse coefficient matrix . In these cases, the low-rank part may mimic the dictionary sparse component. To this end, similar to , we define the following to measure these properties respectively as
Here, , and achieves the upper bound when a dictionary element is exactly aligned with the column space of . Moreover, achieves the upper bound when the row-space of is “spiky”, i.e., a certain row of is -sparse, meaning that a column of is supported by (can be expressed as a linear combination of) a column of . The lower bound here is attained when it is “spread-out”, i.e., each column of is a linear combination of all columns of . In general, our recovery of the two components is easier when the incoherence parameters and are closer to their lower bounds. Further, for notational convenience, we define constants
Here, is the maximum absolute entry of , which measures how close columns of are to the singular vectors of . Similarly, for the column-wise case, measures the closeness of columns of to the singular vectors of under a different metric (column-wise maximum -norm).
Iii Main Results
We present the main results corresponding to each sparsity structure of in this section.
Iii-a Exact Recovery for Entry-wise Sparsity Case
Our main result establishes the existence of a regularization parameter , for which solving the optimization problem D-RPCA(E) will recover the components and exactly. To this end, we will show that such a belongs to a non-empty interval with and defined as
where is a constant that captures the relationship between different model parameters, and is defined as
and is defined as
Given these definitions, we formalize the theorem for the entry-wise case as following, and its corresponding analysis is provided in Section IV-A.
Suppose , where and has at most non-zeros, i.e., . Given , , , defined in (2), (4), (5), and any with defined in (6), the dictionary obeys the generalized frame property D.2 with frame bounds , solving D-RPCA(E) will recover matrices and if the following conditions hold:
For , may contain the entire space and follows
For for a constant , consists of all sparse vectors, and follows
Theorem 1 establishes the sufficient conditions for the existence of to guarantee recovery of for both the thin and the fat cases. The conditions on dictated by (7) and (8), for the thin and fat case, respectively, arise from ensuring that . Further, the condition , translates to the following sufficient condition on rank in terms of the sparsity ,
for the recovery of . This relationship matches with our empirical evaluations and will be revisited in Section V-A.
We note that for both, thin and fat dictionary case, the conditions are closely related to the incoherence measures (, , and ) between the low-rank part, , the dictionary, , and the sparse component . In general, smaller sparsity, rank, and incoherence parameters are sufficient for ensuring the recovery of the components for a particular problem. This is in line with our intuition – the more distinct the two components, the easier it should be to tease them apart. Moreover, we observe that the theorem imposes an an upper-bound on the global sparsity, i.e., . This bound is similar to the result in , and is due to the deterministic nature of our analysis w.r.t. the locations of the non-zero elements of coefficients .
Iii-B Exact Recovery for Column-wise Sparsity Case
Recall that we consider the oracle model in this case as described in D.1 owing to the intrinsic ambiguity in recovery of ; see our discussion in Section I-B. To demonstrate its recoverability, the following lemma establishes the sufficient conditions for the existence of an optimal pair . The proof is provided in Appendix A-B.
Given , , and , any pair satisfies and if .
Analogous to the entry-wise case, we will show the existence of a non-empty interval for the regularization parameter , for which solving D-RPCA(C) recovers an optimal pair as per Lemma 2. Here, for a constant , and are defined as
Then, our main result for the column-wise case is as follows, and its analysis is provided in Section IV-B.
Suppose with defining the oracle model , where , for . Given , , , defined in (2), (3), (4), (5), and any , for defined in (10), solving D-RPCA(C) will recover a pair of components , if the space is structured such that the dictionary obeys the generalized frame property D.2 with frame bounds , for .
Theorem 3 states the conditions under which the solution to the optimization problem D-RPCA(C) will be in the oracle model defined in D.1. The condition on the column sparisty is a result of the constraint that . Similar to (9), requiring leads to the following sufficient condition on the rank in terms of the sparsity ,
Moreover, suppose that , which can be easily met by a tight frame when , or a RIP type condition when . Further, if is a constant, then since , we have that . This is of the same order with the upper bound of in the Outlier Pursuit (OP) .
Iv Proof of Main Results
Iv-a Proof of Theorem 1
We use dual certificate construction procedure to prove the main result in Theorem. 1; the proofs of all lemmata used here are listed in Appendix A-A. To this end, we start by constructing a dual certificate for the convex problem shown in D-RPCA(E). Here, we first show the conditions the dual certificate needs to satisfy via the following lemma.
If there exists a dual certificate satisfying
then the pair is the unique solution of D-RPCA(E).
We will now proceed with the construction of the dual certificate which satisfies the conditions outlined by (C1)-(C4) by Lemma 4. Using the analysis similar to  (Section V. B.), we construct the dual certificate as
for arbitrary . The condition (C1) is readily satisfied by our choice of . For (C2), we substitute the expression for to arrive at
we can write (12) as . Further, we can vectorize the equation above as . Let be a length vector containing elements of corresponding to the support of . Now, note that can be represented in terms of a Kronecker product as follows,
On defining , we have . Further, let denote the rows of that correspond to support of , and let correspond to the remaining rows of . Using these definitions and results, we have . Thus, for conditions (C1) and (C2) to be satisfied, we need
Here, the following result ensures the existence of the inverse.
If and , satisfies the bound .
Now, we look at the condition (C3) . This is where our analysis departs from ; we write
where we have used the fact that and . Now, as is the pseudo-inverse of , i.e., , we have that , where is the smallest singular value of . Therefore, we have
The following lemma establishes an upper bound on .
satisfies the bound .
Now, we move onto finding conditions under which (C4) is satisfied by our dual certificate. For this we will bound . Our analysis follows the similar procedure as employed in deriving (16) in , reproduced here for completeness. First, by the definition of and properties of the norm, we have
We now focus on simplifying the term . By definition of , and using the fact that , we have , which implies
where we have used the result on shown in (13).
Now, since can be written as
Now, using the following upper bound on ,
satisfies the bound .
and on defining
where we have the following bound for .
satisfies the bound , where
where and is defined as
By simplifying (18), we arrive at the lower bound for as in (6), from which (C4) holds. Gleaning from the expressions for and , we observe that for the existence of that can recover the desired matrices. This completes the proof. ∎
Characterizing : In the previous section, we characterized the and based on the dual certificate construction procedure. For the recovery of the true pair , we require . Since and by definition, we need for , i.e.,
Conditions for thin :
To simplify the analysis we assume, without loss of generality, that . Specifically, we will assume that , where is a constant. With this assumption in mind, we will analyze the following cases for the global sparsity, when and .
Case 1: . For this case is given by
From (19), we have , which leads to
As per the GFP of D.2, we also require that . Therefore we arrive at
Further, since , we require the numerator to be positive, and since the lower bound on , we have
which also implies . Now, the condition implies
Since, the R.H.S. of this inequality is upper bounded by (achieved when and are zero). This condition on is satisfied by our assumption that .
Case 2: . For this case, we have
From (19), we have
Again, due to the requirement that , following a similar argument as in the previous case we conclude that