The Use of Mutual Coherence to Prove \ell^{1}/\ell^{0}-Equivalencein Classification Problems

# The Use of Mutual Coherence to Prove ℓ1/ℓ0-Equivalence in Classification Problems

Chelsea Weaver111Current address: Amazon Web Services, Seattle, WA Naoki Saito Department of Mathematics
University of California, Davis
One Shields Avenue
Davis, California, 95616, United States
###### Abstract

We consider the decomposition of a signal over an overcomplete set of vectors. Minimization of the -norm of the coefficient vector can often retrieve the sparsest solution (so-called “-equivalence”), a generally NP-hard task, and this fact has powered the field of compressed sensing. Wright et al.’s sparse representation-based classification (SRC) applies this relationship to machine learning, wherein the signal to be decomposed represents the test sample and columns of the dictionary are training samples. We investigate the relationships between -minimization, sparsity, and classification accuracy in SRC. After proving that the tractable, deterministic approach to verifying -equivalence fundamentally conflicts with the high coherence between same-class training samples, we demonstrate that -minimization can still recover the sparsest solution when the classes are well-separated. Further, using a nonlinear transform so that sparse recovery conditions may be satisfied, we demonstrate that approximate (not strict) equivalence is key to the success of SRC.

###### keywords:
sparse representation, representation-based classification, mutual coherence, compressed sensing
###### Msc:
[2016] 00-01, 99-00
journal: Applied and Computational Harmonic Analysis

## 1 Introduction

The decomposition of a given signal or sample over a pre-determined set of vectors is a technique often used in signal processing and pattern recognition. We can store a signal by decomposing it over a fixed basis and keeping only the largest coefficients; in linear regression, predictions are made by estimating parameters via least-squared error using the training data. In the case that the system is underdetermined, so that an infinite number of representations of the signal or sample exist, regularization is often used to make the problem well-posed. The question, naturally, is how to choose the type of regularization used, so that the representation is well-suited to the task at hand and can be found efficiently.

In compressed sensing, a fairly recent advancement in signal processing, it is assumed that a vector of signal measurements is represented using an overcomplete set of vectors (often called a dictionary) and that the (unknown) coefficient vector is sparse. Obtaining this sparse solution vector is the key to recovering the complete signal in a way that requires fewer measurements than traditional methods can:cs (). Thus, to determine the unknown coefficients, an appropriate regularization term should enforce sparsity, i.e., seek the solution requiring the fewest nonzero coefficients. Determining tractable methods for solving such optimization problems are the core of compressed sensing techniques, as minimizing the -“norm” (which counts the number of nonzero coefficients) is NP-hard in general. However, in addition to successful greedy methods such as orthogonal matching pursuit tro:omp (), it was found that sparse regularization can, in many circumstances, be replaced with minimization of the -norm (which sums the coefficient magnitudes) to the same effect. That is, under certain conditions, minimization of the -norm is equivalent to sparse regularization, hence the term “-equivalence”. Though requiring an iterative algorithm to solve, this relaxation to -minimization reduces the optimization problem to a linear program and can be solved efficiently. There has been a lot of work done (see, for example, the seminal papers by Candes and Tao can:decode () and Donoho don:cs ()) showing that, under certain conditions, -minimization exactly recovers the sparsest solution, and analogous results hold in the case of noisy data. We review some of these results in Section 2.2.

A similar technique used in compressed sensing has been successfully applied to tasks in pattern recognition. The popular classification method sparse representation-based classification (SRC) wri:src (), proposed by Wright et al. in 2009, classifies a given test sample by decomposing it over an overcomplete set of training samples so that the -norm of the coefficient vector is minimized. The test sample is assigned to the class with the most contributing coefficients (in terms of reconstruction). By minimizing the -norm, the goal is that the sparsest such representation will be found (as in compressed sensing), and that this will automatically produce nontrivial nonzero coefficients at training samples in the same class as the test sample, rendering correct classification. Similar approaches have been used in dimensionality reduction qiao:spp (), semi-supervised learning chen:l1graph (), and clustering chen:l1graph ().

In this paper, we investigate the role of sparsity in SRC, specifically, the two-fold question of: (i) whether or not -equivalence can be achieved in practice, i.e., whether -minimization reliably produces the sparsest solution in the classification context; and (ii) whether this equivalence is necessary for good classification performance. The inherent problem with (i) is that practically-implementable recovery conditions under which -minimization is guaranteed to find the sparsest solution require that the vectors in the dictionary be incoherent, or in some way “spread out” in space. These guarantees hold with high probability, for example, on dictionaries of vectors that are randomly-generated from certain probability distributions and dictionaries consisting of randomly-selected rows of the discrete Fourier transform matrix can:rob (); don:und (); can:decode (). Obviously, unlike these examples, data samples in the same class are often highly-correlated. In fact, strong inner-class similarity generally makes the data easier to classify.

Our contributions in this paper are the following:

1. We show that the fundamental assumptions of SRC are in direct contradiction with applicable and tractable sparse recovery guarantees. It follows that the experimental success of SRC should not automatically imply the usefulness of sparsity in this framework.

2. Using a randomly-generated database designed to model facial images, we show that -minimization can still recover the sparsest solution on highly-correlated data, provided that the classes are sufficiently well-separated. Thus the lack of implementable equivalence guarantee does not automatically imply lack of equivalence in SRC, at least on certain databases.

3. We investigate the feasibility and implementation of a nonlinear transform that maximally spreads out the training samples in each class while maintaining the dataset’s class structure. Though there are strict limitations on the design of such a transform, which we describe in detail in Section 7, we demonstrate that the higher-dimensional space can allow for the application of equivalence guarantees while still allowing us to classify the dataset. This renders a method for examining the relationship between classification accuracy and the sparsity of the coefficient vector in SRC, and how close this is to the (provably) sparsest solution. We demonstrate that approximate (and not strict) equivalence between the -minimized solution and the sparsest solution is the key to the success of SRC.

The paper is organized as follows: We begin by motivating and reviewing the basics of compressed sensing and sparsity recovery guarantees in Section 2, and we give an overview of SRC in Section 3. In Section 4, we formerly describe the conflict between -recovery guarantees and classification data, and in Section 5, we rigorously assess the applicability of these recovery guarantees in the classification context. Section 6 presents empirical findings relating sparse recovery and highly-correlated data. In Section 7, we investigate the feasibility of a nonlinear data transform to force the aforementioned recovery guarantees to hold and insights that can be gained from this procedure. We conclude this paper in Section 8.

## 2 Compressed Sensing and Recovery Guarantees

In this section, we detail the motivation behind -equivalence and state practically-implementable equivalence theorems.

### 2.1 Motivation from Compressed Sensing

Suppose that we wish to collect information about (i.e., sample or take measurements of) a continuous signal and then send or store this information in an efficient manner. For example, could be a sound wave or an image. Also suppose that a good approximation of the original signal must later be recovered. According to the Nyquist/Shannon sampling theorem, we must sample at a rate of at least twice its maximum frequency in order to be able to reconstruct exactly shan:samp_thm (). But in some applications, doing so may be expensive or even impossible.

In the circumstances that we are able to take many measurements of to obtain its discrete analog , one efficient method of compressing it is the following procedure: Let the columns of form an orthonormal basis for , and suppose that has a sparse representation in this basis, i.e., that we can write , where , , and is sparse. Setting all but the largest (in absolute value) entries of to 0 in order to obtain , it can be shown that gives the best -term least squares approximation of in this basis. Clearly, the sparser is, the better approximation we will obtain of , and in the case that has no more than nonzero coefficients, we recover the exact solution. This is the basic idea behind the so-called transform coding, and the most popular one is the JPEG image compression standard pen:jpeg (), which uses the discrete cosine transform as the sparsifying basis .

The problem with this procedure is that it is inefficient to collect all samples if we are only going to throw most (all but ) of them away when the signal is compressed. This is the motivation behind compressed sensing, originally proposed by Candès and Tao can:decode () and Donoho don:cs () (see also Candès and Tao’s work can:near_opt () and the paper by Candès et al. can:sta ()). Let be a sensing or measurement matrix with and consider the underdetermined system

 y0:=Φf=ΦΨα=Xα

for sparse , where we have set . Using to denote the number of nonzero coordinates of (hence the terminology “-‘norm’ ”—observe that is only a pseudonorm because it does not satisfy homogeneity), we would ideally recover by solving the optimization problem

 α0:=argminα∈RN∥α∥0 subject to Xα=y0 (1)

and setting with . Unfortunately, solving Eq. (1) is NP-hard. When satisfies certain conditions and when is sufficiently sparse, however, the solution to Eq. (1) can be found by solving the -minimization problem

 α1:=argminα∈RN∥α∥1 subject to Xα=y0. (2)

This was a riveting finding, as the optimization problem in Eq. (2) is convex and can be solved efficiently. It has been shown that, under certain conditions (e.g., when the columns of are uniformly random on the sphere ), this procedure produces an approximation of that is as good as that of its best -term approximation don:cs (). Further, theoretical and experimental results demonstrate that in many situations, the number of measurements needed to recover is significantly less than and can be much lower than the number required by the Nyquist/Shannon theorem. For example, when the measurement matrix contains i.i.d. Gaussian entries, then exact recovery of via -minimization can be achieved (with high probability) in only measurements, where can:decode ().

Even more astoundingly, similar results hold in the presence of noise. Suppose that the noiseless vector is replaced with , for a vector of errors satisfying . It follows that under certain conditions (see Section 2.2.1), the -minimization problem

 α1,ϵ:=argminα∈RN∥α∥1 subject to ∥Xα−y∥2≤ϵ (3)

is guaranteed to recover a coefficient vector approximating the ground truth sparse vector (the solution to Eq. (1)) with don:sta (). The constant depends on properties of the matrix and the sparsity level .

A popular application of compressed sensing is magnetic resonance imaging (MRI), in which the measurement matrix consists of randomly-selected rows of the discrete Fourier transform in don:mri (). Other applications abound in the areas of data acquisition and compression, including sensor networks xia:sens (), seismology her:seis (), and single pixel cameras dua:sin_pix ().

### 2.2 Recovery Guarantees

The conditions under which -minimization can guarantee exact or approximate recovery of the sparsest solution (e.g., conditions under which the solutions to Eq. (1) and Eq. (2) are equal, i.e, -equivalence holds) are called recovery guarantees. These conditions concern the incoherence (or spread) of the vectors in the dictionary. Essentially, recovery guarantees cannot be applied when the vectors are too correlated. A prototypical example is that if the dataset contains two copies of the same vector (i.e., a pair of maximally-correlated vectors), then the minimum -norm solution may contain a nonzero coefficient at either one of the copies or at a combination of the two. Contrast this with the sparsest solution, which would never contain nonzero coefficients at both copies.

There are various ways of measuring the incoherence in a dictionary, each leading its own theory relating the solutions of Eq. (1) and Eq. (2) (or its noise version Eq. (3)). In this paper, we focus primarily on recovery guarantees stated in terms of mutual coherence, and we review mutual coherence-based recovery guarantees below. Unlike other approaches, the mutual coherence method is both tractable and deterministic, as we subsequently discuss.

To make the problem more general, we no longer explicitly assume the use of a sparsifying transform matrix and consider the general system , for with .

#### 2.2.1 Recovery Guarantees in Terms of Mutual Coherence

###### Definition 2.1.

Given a matrix with normalized columns (so that for ), the mutual coherence of , denoted , is given by

 μ(X):=max1≤i≠j≤N|⟨xi,xj⟩|. (4)

Note that mutual coherence costs to compute.

###### Theorem 2.1 (Donoho and Elad don:osr () ; Gribonval and Nielsen grib:union ()).

Let , , have normalized columns and mutual coherence . If satisfies with

 ∥α∥0<12(1+1μ(X)), (5)

then is the unique solution to the -minimization problem in Eq. (2).

This means that if -minimization finds a solution with less than nonzeros, then it is necessarily the sparsest solution and so -equivalence holds.

Given noise tolerance and approximation error bound , the following theorem by Donoho et al. gives conditions for -equivalence in the noisy setting:

###### Theorem 2.2 (Donoho, Elad, and Temlyakov don:sta ()).

Let , , have normalized columns and mutual coherence . Suppose there exists an ideal noiseless signal such that and

 ∥α∥0=k≤14(1+1μ(X)). (6)

Then is the unique sparsest representation of over . Further, suppose that we only observe with . Then we have

 ∥α1,ϵ−α0∥22≤(ϵ+ζ)21−μ(X)(4k−1), (7)

where is the solution to Eq. (3).

That is, if the ideal sparse vector is sparse enough and the mutual coherence of is small enough, -minimization will give us a solution close to , with “how close” depending on the sparsity level , mutual coherence , noise tolerance , and approximation error bound .

Something can also be said regarding the support of in the noisy setting:

###### Theorem 2.3 (Donoho, Elad, Temlyakov don:sta ()).

Suppose that , where , and . Suppose that (so ). Set

 γ:=√1−β1−2β. (8)

Then given the solution to Eq. (3) with exaggerated error tolerance where , we have that .

This says that when the mutual coherence is very small relative to the sparsity level, the solution to Eq. (3) has the same support as the sparsest solution . (Observe that is indeed the sparsest solution by Theorem 2.1, since .) Since and , is required in Theorem 2.3.

#### 2.2.2 Other Recovery Guarantees

There are methods of proving -equivalence that do not involve mutual coherence. For example, those using the restricted isometry constant involve a quantification of how close any set of columns of is to being an orthonormal basis can:rip (); cai:rip (), and other guarantees use the smallest number of linearly dependent columns of , defined as the spark of don:osr (). However, these approaches are generally not tractable in deterministic settings; their usefulness is largely limited to applications in which is a random matrix with known (with high probability) restricted isometry constant or spark.

Alternatively, if we desire stochastic results, there are other recovery guarantees involving versions of mutual incoherence. When applied to random matrices, these guarantees are generally stronger than those in Theorem 2.1 and 2.2 (in terms of requiring less measurements and/or less sparsity of the solution vector). For example, Candès and Plan can:ripless () provide conditions that guarantee recovery (with high probability) of sparse and approximately sparse solutions in the case that the rows of the dictionary are sampled independently from certain probability distributions. These conditions are in terms of incoherence defined as an upper bound on the squared norms of the rows of (either deterministically or stochastically), and require an isotropy property can:ripless (). In the case that the probability distribution has mean , this property states that the covariance matrix of the probability distribution is equal to the identity matrix. In another paper can:ripless2 (), Candès and Plan guarantee probabilistic recovery in terms of a condition on mutual coherence (as defined in Definition 2.1) that is satisfied with high probability on certain random matrices. These recovery guarantees allow for the sparsity level in the case of these random matrices to be notably larger than in Eq. (5) in Theorem 2.1. We also mention the results by Tropp tro:ran_sub () concerning recovery in terms of mutual coherence and the extreme singular values of randomly-chosen subsets of dictionary columns.

If we do not assume that classification data are drawn from a particular probability distribution, then these stochastic results either do not apply or are intractable to compute. Thus Donoho et al.’s theorems discussed in Section 2.2.1 are the best tool we have to prove -equivalence given an arbitrary (possibly large) matrix of training data. That said, it is important to note that these mutual coherence theorems produce what are generally considered to be fairly loose bounds on the sparsity level , given experimental results and cases for which restricted isometry constants are known (has:sta, , Chap. 10).

## 3 Sparse Representation-Based Classification

We next review Wright et al.’s application of the -norm/sparsity relationship to classification. In reviewing the compressed sensing framework, we referred to our underdetermined system using the notation (or , if the represented signal was expected to be noisy), for . To differentiate the classification context, let be the matrix of training samples, and let be an arbitrary test sample.

SRC solves

 α∗:=argminα∈RNtr∥α∥1, subject to y=Xtrα. (9)

Alternatively, in the case of noise in which an exact representation may not be desirable (see the discussion at the beginning of Section 5), one can solve the regularized optimization problem

 α∗:=argminα∈RNtr{12∥y−Xtrα∥22+λ∥α∥1}. (10)

Here, is the trade-off between error in the approximation and the sparsity of the coefficient vector.

For a classification problem with classes, define the indicator function , , to set all coordinates corresponding to training samples not in class to 0 (and to act as the identity on all remaining coordinates). After obtaining from Eq. (9) or (10), the class label of is predicted using

 class_label(y)=argmin1≤l≤L∥∥y−Xtrδl(α∗)∥∥2. (11)

As mentioned in the introduction, it is assumed that by constraining the number of nonzero representation coefficients, nonzeros will occur at training samples most similar to the test sample, and thus Eq. (11) will reveal the correct class. This works as follows: It is assumed that each class manifold is a linear subspace spanned by its set of training samples, so that if the number of classes is large with regard to , there exists a sparse (in terms of the entire training set) representation of using training samples in its ground truth class. The coefficient vector is an attempt at finding this class representation, and Eq. (11) is used to allow for a certain amount of error.

In essence, SRC classifies to the class that contributes the most to its sparse (via -minimization) representation (or approximation, if Eq. (10) is used). SRC is summarized in Algorithm 1.

## 4 The Conflict

In classification problems, samples from the same class may be highly correlated. As demonstrated in Table 1, the mutual coherence (as defined in Eq. (4)) of a training matrix is often quite large.

When , the mutual coherence bound in Theorem 2.1 becomes

 ∥α∥0<12(1+1μ(Xtr))≈1.

Since denotes the number of nonzero coefficients in the representation of over , it will never satisfy . Thus we cannot use Theorem 2.1 to prove -equivalence in SRC, for example, on the databases used in Table 1.

It follows that the “theory” behind sparse representation-based methods for learning (like SRC) is missing a significant piece. In the next three sections, we aim to provide insight into the following three questions:

1. Can Theorem 2.1 ever be used to prove -equivalence in SRC?

2. Regardless of theoretical guarantees, is -minimization finding the sparsest solution in practice in SRC?

3. What is the role of sparsity in SRC’s classification performance?

## 5 Mutual Coherence Equivalence and Classification

In this section, we identify cases in which the condition given in Eq. (5) from Theorem 2.1 provably does not hold, and thus we cannot use Theorem 2.1 to prove -equivalence. We also discuss analogous results in the noisy case, i.e., Eq. (6) in Theorem 2.2. In particular, we are concerned with the applicability of these theorems for classification problems.

Before we begin, we take a moment to clarify notation:

• In discussing compressed sensing in Section 2, we used to refer to a clean measurement vector and to refer to its noisy version. In contrast, in this section and in Section 7, may represent either a clean or noisy measurement vector, or an arbitrary test sample (as it does in Algorithm 1). We do this because, in the context of representation-based classification, there are reasons other than noise in the test sample for allowing the equality to hold only approximately: the training data could also be corrupted, or we may want to relax the assumption that class manifolds are linear subspaces (perhaps this is only approximately, or locally, the case). Additionally, it is difficult to determine the amount of noise in test samples in real-world problems. To keep the situation general and to avoid confusion, we will only differentiate between and when we explicitly consider with the noise vector, as in Donoho et al.’s Theorems 2.2 and 2.3.

When we explicitly consider data from a classification problem, we will use the subscript “tr.” That is, in the general compressed sensing representation , we set when we want to denote a matrix of training samples, and when this is done, it is assumed that specifically designates a test sample.

• For the underdetermined system , we have already seen several instantiations of the coefficient vector . We denoted the sparsest coefficient vector, i.e., the solution to the -minimization problem given in Eq. (1), by , and we used and to denote the coefficient vectors found using -minimization (in particular, the solutions to Eq. (2) and Eq. (3), respectively). In contrast, denotes the solution to the SRC optimization problem (the solution to Eq. (9) or (10)). It is possible to have or , depending on the optimization problem used in SRC and the amount of noise in the test sample. In particular, if Eq. (9) is used in SRC, and if Eq. (10) is used and the test sample satisfies with .

### 5.1 Preliminary Results

We will use the following lemma which gives a lower-bound on mutual coherence in the underdetermined setting:

###### Lemma 5.1 (Welch Welch (), Rosenfeld ros:gram ()).

For with normalized columns and , we have that

 μ(X)≥√N−mm(N−1). (12)

It is straightforward to show that Lemma 5.1 implies that , since monotonically increases in for , with a minimum value of attained at . Thus to have even a chance of Theorem 2.1 or 2.2 holding, we must have

 ∥α∥0<1c(1+1μ(X))≤1c(1+m), (13)

where in the noiseless case and in the noisy case.

We next consider the smallest possible value of the number of nonzeros in any classification problem representation . Let us assume that the test sample is not a scalar multiple of any training sample. It follows that . Thus in order for Theorem 2.1 or 2.2 to hold, we must have

 2≤∥α∥0<1c(1+1μ(Xtr)) ⇒μ(Xtr)<12c−1 ⇒μ(Xtr)<{1/3,noiseless% case1/7,noisy setting.

Note that these upper bounds for are very small compared to the values of in Table 1. These findings produce the following small-scale result:

###### Proposition 5.1.

Suppose that . If and is not a scalar multiple of any training sample, then the inequality in Eq. (5) with does not hold. That is, we cannot use Theorem 2.1 to prove -equivalence in SRC.

###### Proof.

By Lemma 5.1, we must have that . An analogous statement holds in the noisy setting (Theorem 2.2) for . ∎

### 5.2 Main Result

###### Proposition 5.2 (Main Result).

Suppose that the sparsest representation of over the dictionary is given by for . Set to be the number of columns of contained in

 ˜X:=span{xj1,…,xjk},

where clearly . If , then the inequality in Eq. (5) does not hold. That is, we cannot use Theorem 2.1 to prove -equivalence.

###### Proof.

Suppose that . Then there are more than dictionary elements in the subspace . Since the vectors are linearly independent (because otherwise, could be expressed more sparsely), the dimension of is exactly .

Define to be the matrix of the dictionary elements contained in . Let the singular value decomposition of be given by , and set to contain the first columns of , to contain the first columns of , and to contain the first columns and rows of . Because has rank , we can alternatively write

 ˜X=UkΣkVTk.

The matrix has the same mutual coherence as , since they have the same Gram matrices:

 (UTk˜X)T(UTk˜X) =˜XTUkUTk˜X =(UkΣkVTk)TUkUTk(UkΣkVTk) =VkΣTkUTkUkUTkUkΣkVTk =VkΣTkUTkUkΣkVTk =(UkΣkVTk)T(UkΣkVTk) =˜XT˜X.

By Lemma 5.1, we have that

 μ(X)≥μ(˜X)=μ(UTk˜X)≥ ⎷˜N−kk(˜N−1)≥√(k+1)−kk((k+1)−1)=1k.

Thus the bound on in Theorem 2.1 requires that

 k<12(1+1μ(X))≤12(1+k)⇒k<1, (14)

which contradicts with being a natural number. ∎

We present several corollaries to Proposition 5.2. The first is a consequence applicable to any -minimization problem, regardless of whether or not the dictionary elements have class structure:

###### Corollary 5.1 (Consequence for general ℓ1-minimization).

If a measurement vector is not at all sparse over the dictionary , i.e., if every representation of requires no less than dictionary elements, then the condition in Eq. (5) from Theorem 2.1 does not hold.

###### Proof.

Because the dimension of (as defined in Proposition 5.2) is actually , every dictionary element is contained in . ∎

Corollary 5.1 illustrates the importance of choosing a dictionary that awards a sparse representation of in any application of -minimization, including compressed sensing.

The following corollary follows from the proof of Proposition 5.2:

###### Corollary 5.2.

Let with , and let be any positive integer such that . If any set of linearly independent columns of spans an additional, distinct column of , then the bound

 k<12(1+1μ(X))

does not hold.

Of course, this bound will not hold for any larger values of , either. This means that if we can find an integer satisfying the conditions of Corollary 5.2, then any attempt to prove -equivalence using Theorem 2.1 will require with .222Corollary 5.2 can alternatively be proven using the equivalence theorem involving spark; see the work of Donoho and Elad don:osr ().

The following corollary is an explicit consequence for dictionaries consisting of training samples:

###### Corollary 5.3 (Consequence for Class-Structured Dictionaries).

Suppose that is a test sample with , and define . If adding to the set of training samples does not increase its mutual coherence, that is, if for all , i.e., , then we cannot have both that (i) and (ii) .

###### Proof.

If we can write for , then the (linearly independent) training samples with nonzero coefficients in the representation span a -dimensional subspace containing . Setting in Corollary 5.2, we have that

 k≮12(1+1μ(X))=12(1+1μ).

On the other hand, if

 k<12(1+1μ(X))=12(1+1μ)

for some positive integer , then also by Corollary 5.2, it must be the case that is not contained in the subspace spanned by any linearly independent distinct columns of , i.e., columns of . Thus we cannot write for any satisfying . ∎

It might initially seem that the hypothesis of Corollary 5.3 is unlikely to hold. However, if one assumes that the data is sampled randomly with test samples having the same distribution as the training samples in their ground truth classes, then the hypothesis that becomes much more probable. We discuss this further in Section 7.

Our final corollary determines conditions under which the bound in Eq. (5) from Theorem 2.1 is theoretically incompatible with the explicit assumptions made in SRC wri:src (). We review these assumptions briefly:

###### Assumption 1 (Linear Subspaces).

The ground truth class manifolds of the given dataset are linear subspaces.

###### Assumption 2 (Spanning Training Set).

The training matrix contains sufficient samples in each class to span the corresponding linear subspace.

###### Corollary 5.4 (Consequence for SRC).

Suppose that the SRC Assumptions 1 and 2 hold. Let have ground truth class , and suppose that the number of class training samples, , is large, i.e., , for the dimension of the linear subspace representing the class manifold. Then there exists a test sample which requires the maximum number of class training samples to represent it. If this representation of is its sparsest representation over the dictionary , then the condition in Eq. (5) from Theorem 2.1 cannot hold. Thus we cannot use Theorem 2.1 to prove -equivalence in SRC.

Corollary 5.4 says that if we have a surplus of class training samples (i.e., more than enough to span the class subspace), then, provided that the “class representations” (representations of the test samples in terms of their ground truth classes) truly are the sparsest representations of the test samples over the training set (as argued by the SRC authors wri:src ()), there will be some test samples for which Theorem 2.1 cannot hold. These test samples are exactly those requiring class training samples in their representations. In general, such test samples must exist; otherwise, the dimension of the class subspace would be less than . To reiterate, if everything we want to happen in SRC actually happens (large class sizes, sparse class representations), then we cannot consistently use Theorem 2.1 to prove -equivalence.

On a more positive note, the assumptions in SRC make it possible to estimate whether or not the conditions of Proposition 5.2 hold. Though these conditions are difficult to check in general (if we knew the sparsest solution of over the dictionary, then we would not need to use -minimization to find it), the linear subspace assumption in SRC gives us a heuristic for doing so. We could potentially estimate the dimension of each class (using a method such as multiscale SVD mag:msvd () or DANCo cer:dan (), for example) and compare this with the number of training samples in that class. If the latter is larger than the former, then we expect that Theorem 2.1 cannot be applied for some test samples.

In typical applications, we must deal with noisy data. Thus we should consider the application of Theorem 2.2 instead of Theorem 2.1. But this is immediate: Since the mutual coherence condition is stricter in the case of noise, the consequences of Proposition 5.2 and the above corollaries hold whenever the conditions are assumed to hold on the clean version of the data. In particular, Theorem 2.2 requires the existence of a clean test sample (even if it is unknown to us) that satisfies with . Under the hypothesis of Corollary 5.3 (setting ), such a cannot exist.

In concluding this section, we stress that the mutual coherence conditions in Theorems 2.1 and 2.2 are sufficient, but not necessary, for -equivalence. Thus it is possible for -minimization to find (or closely approximate) the sparsest solution even when the conditions of these theorems do not hold. Whether or not this happens in the context of SRC is the topic of the next section.

## 6 Equivalence on Highly-Coherent Data

In this section, we investigate whether sparsity is reliably achieved via -minimization on highly-correlated data, such as class-structured databases.

### 6.1 Inspiration

We are inspired by the data model and subsequent work of Wright and Ma wri:dense () (see also the work of Wright et al. wri:srcv ()), which produces an -equivalence guarantee for dictionaries containing vectors assumed to model facial images. We summarize their result briefly.

Previous work has shown that the set of facial images of a fixed subject (person) under varying illumination conditions forms a convex cone, called an illumination cone, in pixel space geo:illum (); bel:what (). Wright and Ma demonstrate that in fact the set of facial images under varying illuminations over all subjects combined exhibits this cone structure. For example, they show that this is the case for the entire set of (raw) samples from the Extended Yale B Face Database geo:illum (). Further, this cone becomes extremely narrow, i.e., a “bouquet,” as the number of pixels grows large wri:dense (). These findings reiterate that class-structured data, particularly face databases, are highly-coherent.

Lee et al. lee:linss () showed that any image from the illumination cone can be expressed as a linear combination of just a few images of the same subject under varying lighting conditions. In other words, illumination cones are well-approximated by linear subspaces. Thus the SRC condition that class manifolds are (approximately) linear subspaces presumably holds for databases made up of facial images under varying lighting conditions. Given a facial image that may be occluded or corrupted by noise, can thus be expressed as

 y=Xtrα0+z0, (15)

given that certain requirements are satisfied in the sampling of the training data. By the above model, is assumed to be non-negative (a result of the illumination cone model wri:srcv (); geo:illum ()) and sparse, containing nonzeros at training samples that represent the same subject as (i.e., are in the same class). Additionally, is an (unknown) error vector with nonzeros in only a fraction of its coordinates; i.e., the model assumes that only a portion of the pixels are occluded or corrupted wri:srcv (). Note that this is not quite the same situation as in the condition for -equivalence in the noisy setting given in Theorem 2.2. One difference is that in Eq. (15) above, is bounded in terms of -norm (sparsity) with no limit on -norm (magnitude), whereas in Theorem 2.2, is bounded in terms of magnitude but not sparsity.

The goal, as one might expect, is to recover from Eq. (15). In the SRC paper wri:src (), Wright et al. use -minimization to do this. In particular, they solve

 (ˆα1,z1):=argmin∥α∥1+∥z∥1{ subject to }y=Xtrα+z, (16)

and they show that this version of SRC produces very good classification results on occluded or corrupted facial images. (Again, note that is different from both and discussed earlier, as there is a sparsity constraint instead of an -norm bound on the noise component .)

In a later paper, Wright, et al. wri:srcv () correctly note that the usual -equivalence theorems do not hold on the highly-correlated data in , and so it cannot be determined whether or not the -minimized solution in Eq. (16) is equal to (what is assumed to be) the true sparsest solution . Fortunately, Wright and Ma wri:dense () proved a theorem that gives sufficient conditions for this equivalence under an assumed model (called the bouquet model) of facial images; see also Wright et al.’s version wri:srcv (). To state the theorem, we will need the following definition:

###### Definition 6.1 (Proportional Growth wri:dense ()).

A sequence of signal-error problems , for , exhibits proportional growth with parameters , , and , if , , and .

It follows that is the redundancy factor in the dictionary and and control the sparsity of and , respectively. Here, is assumed to be small and may depend on and .

We are now in a position to state Wright and Ma’s main theorem:

###### Theorem 6.1 (Wright and Ma wri:dense ()).

Fix any and . Suppose that is distributed according to the bouquet model given by

 X=[x1,…,xN]∈Rm×N,xi\lx@stackreli.i.d.∼N(μ,(ν2/m)Im),∥μ∥2=1,∥μ∥∞≤Cμm−1/2,Cμ≥1 (17)

for sufficiently small. Also suppose that the sequence of signal-error problems for exhibits proportional growth with parameters , , and . Suppose further that is a uniform random subset of size , and that with entries of i.i.d.  (independent of ) and . Lastly assume that is sufficiently large. Then with probability at least in , , and , for all with and any with sign vector and support , we have

 (α0,z0)=argminα,z∥α∥1+∥z∥1{ subject to }Xα+z=Xα0+z0.

Here, is a numerical constant and and are positive constants (independent of ) which depend on , , and . By “ sufficiently small” and “ sufficiently large,” Wright and Ma mean that there exist constants and (independent of ) such that and , respectively.333The relationship between and is not explicitly stated, but it makes sense that by the proportional growth assumption. Further, if , then since , we can likely alternatively write . This theorem illustrates that -equivalence can provably hold on the classification of highly-coherent data via random database model.

###### Remark 6.1.

Despite its applicability to highly-coherent data, Theorem 6.1 does not prove that -equivalence holds in SRC. First of all, the theorem requires that be sufficiently large, which may not be the case, especially when feature extraction is used. Second, the model in Theorem 6.1 does not explicitly deal with class-structured data. A true face recognition model should account for the individual subjects, with samples in the same class being (on average) more correlated than those from different classes. Thus our model should contain “sub-bouquets” (i.e., the classes) inside the larger bouquet.

### 6.2 Experiments

With these changes in mind, we design a random database model that will allow us to study the relationship between sparsity and -minimization on highly-coherent and class-structured data, such as the images used in face recognition. First, we specify the dimension , the number of classes , and the number of samples , in each training class. We require that so that the resulting dictionary of training samples leads to an underdetermined system. We then randomly generate training data with an increasing amount of cone/bouquet structure as well as class structure, along with a test sample—with known sparse coefficient vector —generated as a linear combination of training samples from a single class. We run a fixed number of trials of the experiment at each of 11 increasing values of coherence (we call these stages) and determine at which stages -minimization can closely (or exactly) recover .

#### 6.2.1 Experimental Setup

For each generated training set , we set the (clean) test sample to be a random vector in the positive span of the class 1 data. That is, we set

 y0:=α(1)1x(1)1+…+α(1)N0x(1)N0,

where and , . We then define

 α0:=[α(1)1,…,α(1)N0,0,…,0]T∈RNtr.

Given this setup, we want to see if -minimization will recover , i.e., if the solution

 α1:=argminα∈RNtr∥α∥1 subject to Xtrα=y0

is equal to . Note that for large , can be viewed as a sparse vector.

In Stage 1 of our model, the training data has no class or cone structure and is randomly generated on the unit sphere . It has been shown experimentally that, for and sufficiently large, an -minimization solution with no more than nonzeros is enough to ensure it is the sparsest solution with high probability don:und (). Thus we expect to see exact recovery in Stage 1 for values of , , and satisfying these requirements.

To add both bouquet and class (or sub-bouquet) structure to the training set in subsequent stages, we define the cone mean and the class means . At Stage , , we set and then modify , where effectively increases the cone mean from as increases. Next, each class mean is randomly generated depending on as follows: For each class , we sample from for (so that each class mean becomes increasingly close to the cone mean) and then modify , . Lastly, to generate the training samples in class , we sample from and then modify , . Figure 1 shows an example of Stage with , , and .

We perform experiments using four different specifications for the triples , as shown in Table 2. By design, we have that in our experiments (though we will also briefly look at the case that ). Note that: (i) the inequality is satisfied for each of the specifications in Table 2; and (ii) these numbers are similar to what we might expect to see in classification of a face database (after some method of feature extraction is applied, as is generally required by SRC for face classification).

#### 6.2.2 Experimental Results: No Noise

Accuracy of recovery: We consider the following quantities for evaluating the success of -recovery:

• The average normalized -error

 errℓ2:=∥α1−α0∥2/∥α0∥2 (18)

between the -minimized solution and ,

• The average number of nonzeros of occurring at training samples not in class 1 (we call these “off-support” nonzeros, because they are nonzeros not in the support of ), divided by the total number of nonzeros. That is, let be the result of setting all entries in that are in class 1 to zero. Then this error is defined as

 errsupp:=∥αoff−supp1∥0∥α1∥0,
• Since does not provide information regarding the size of the off-support nonzero coefficients, we also consider

 errsupp(ℓ2):=∥αoff−supp1∥2∥α1∥2 and errsupp(ℓ1):=∥αoff−supp1∥1∥α1∥1,
• The average mutual coherence of the training set, .

It is informative to consider the effect that the support error quantities would (hypothetically) have on the classification performance of SRC. Recall that, in the case that the clean test sample is known, SRC computes the class residuals , , and assigns to the class with the smallest residual. Thus if , , and are small, we expect that SRC will have an easier time classifying the test sample correctly (recall that these quantities measure the residual from the correct class ). For example, if all the support error quantities are 0, then and it follows that the class 1 residual and for . This corresponds to the ideal classification scenario.

We compute the average quantities , , , , and over 1000 trials at each stage, using the -minimization algorithm HOMOTOPY don:hom (); asif:hom () with error/sparsity trade-off parameter (to force near-exactness in the approximation). The results are shown in Figure 2.

Considering that records any off-support nonzeros, regardless of how small, the results are quite good. In many cases, -minimization was able to recover the exact solution on highly-correlated data, and when errors in the support occurred, they were generally small.

We see two different things happening at either end of the Stage axis. At Stage 1, we see support errors in every database except DB-4 (the low-redundancy case). Further, there are nonzero values of , , and