Compressive Source Separation:
Theory and Methods for Hyperspectral Imaging
Abstract
With the development of numbers of high resolution data acquisition systems and the global requirement to lower the energy consumption, the development of efficient sensing techniques becomes critical. Recently, Compressed Sampling (CS) techniques, which exploit the sparsity of signals, have allowed to reconstruct signal and images with less measurements than the traditional Nyquist sensing approach. However, multichannel signals like Hyperspectral images (HSI) have additional structures, like interchannel correlations, that are not taken into account in the classical CS scheme.
In this paper we exploit the linear mixture of sources model, that is the assumption that the multichannel signal is composed of a linear combination of sources, each of them having its own spectral signature, and propose new sampling schemes exploiting this model to considerably decrease the number of measurements needed for the acquisition and source separation. Moreover, we give theoretical lower bounds on the number of measurements required to perform reconstruction of both the multichannel signal and its sources. We also proposed optimization algorithms and extensive experimentation on our target application which is HSI, and show that our approach recovers HSI with far less measurements and computational effort than traditional CS approaches.
I Introduction
A Hyperspectral Image (HSI) is a collection of hundreds of images that have been acquired simultaneously in narrow and adjacent spectral bands, typically by airborne sensors. HSI are produced by expensive spectrometers that sample the light reflected from a twodimensional area. An HSI data set is thus a “cube” with two spatial and one spectral dimensions. Hyperspectral imagery has many applications including environmental monitoring, agriculture planning or mineral exploration. The plurality of channels in HSI makes it possible to discriminate among the various materials that make up a geographical area: each of them is represented by a unique spectral signature. Accordingly, HSI are often processed via clustering or source separation methods to obtain segmentation maps locating and labeling the various materials appearing in the image. Unfortunately, having multiple channels comes at a price: the sheer volume of data makes acquisition, transmission, storage and analysis of HSI computationally very challenging. Therefore, the problem addressed in this paper is to reduce the complexity of manipulating HSI via a suitable compression or dimensionality reduction technique.
In this context the emerging Compressive sensing (CS) theory, which addresses the problem of recovering signals from few linear measurements, seems ideally suited [1, 2]. The main assumption underlying CS is that the signal is sparse or compressible when expressed in a convenient basis. A signal is said sparse in a basis if it is a linear combination of only basis vectors of . The signal is said sparse when and compressible if the coefficient’s magnitudes, when sorted, have a fast powerlaw decay, meaning that the signal has few large coefficients and many small coefficients. The recent literature abounds with examples of sparse models for signals and images.
While the CScommunity has mostly focused on 1D or 2D signals, few works have been done on higher dimensional signals, in particular multiarray signals such as HSI. Extensions of wavelets basis for 3D data have been proposed [3] and rather generic sparse models have been exploited in [4, 5] for designing innovative compressive hyperspectral imagers. However, multiarray signals such as HSI have usually some structures that go beyond the sparsity assumption. Indeed, HSI can be interpreted as a mixture of sources, each of them having a specific spectral signature. This model is widely used for unmixing HSI [6, 7, 8, 9, 10], that is extracting, form the HSI, each source and their respective spectral signatures.
The main focus of this paper is to exploit, beyond the sparsity assumption, an additional structured model, the linear mixture model, so as to reconstruct and separate the sources of multiarray signals assuming we know their spectra (or mixing parameters) as side information. Note that this hypothesis is validated in many applications where the elements or materials composing the data are known and their spectra tabulated. This idea was first introduced in two of our conference papers [11, 12]. In this paper, we introduce and analyze a new sampling scheme, which exploits this structured model, and that has the following important properties:

the number of measurements, or samples, does not scale with the number of channels,

the recovery results do not depend on the conditioning of the mixing matrix (as long as the mixing spectra are linearly independent).
We propose new algorithms for HSI compressive source separation (CSS), that is source separation and data reconstruction from compressed measurements, which are based on exploiting the linear mixture structure and TV, or regularization. We establish that sources can be efficiently separated directly on the compressed measurements, i.e avoiding to run a source separation algorithm on this highdimensional raw data, thereby eliminating this important bottleneck and providing a rather striking example of compressed domain data processing. We provide theoretical guaranties and intensive experiments which show that, with this approach, we can reconstruct a multiarray signal from compressed measurements with a far better accuracy than traditional CS approaches. For example, we are able to reconstruct HSI datasets with only relative error from of measurements and less than of data transmission, with an algorithm that is more than times faster. While the main target application of this paper is HSI, our model and the theoretical analysis is general and could be applied to other multiarray signals like e.g. Positive Emission Tomography (PET) or distributed sensing.
The remainder of this paper is structured as follows. The necessary background and notations are first introduced in Section II. We then propose, in Section III, two acquisition schemes that exploit the prior knowledge of the mixing parameters so as to perform a decorrelation step. In Section IV, we provide theoretical guarantees for both source identification and data reconstruction. We determine the number of CS measurements sufficient for robust source identification and signal reconstruction as a function of the sparsity of the sources, sampling SNR and the conditioning of their corresponding mixture parameters. In Section V we discuss in further details the application of our acquisition and recovery schemes for HSI. We introduce different recovery algorithms that we compare with the classical methods, for various CS acquisition schemes on two sets of HSI. Finally, in the spirit of reproducible research, the code and data needed to reproduce the experimental sections of this paper is openly available at http://infoscience.epfl.ch/record/180911.
Ii Background and Notations
Iia CS of Multichannel Signals
We represent a multichannel signal with a matrix where is the number of channels and is the dimension of signal in each channel. The CS acquisition protocol of a multichannel signal is a linear mapping of into a CS measurement vector contaminated by the measurement noise :
When the signal is effectively compressed. The main goal of CS is to recover the signal from the fewest amount of measurements . Note that any linear mapping can be written in matrix form , where and is the vectorized form of matrix :
(1) 
In order to recover , we can search for the sparsest vector which is consistent with the measurement error, leading to the following minimization problem:
(2) 
where is an upper bound on the norm of the noise vector (i.e. ), denotes the quasinorm of a vector (i.e., the number of its nonzero coefficients). Unfortunately, this combinatorial minimization problem is NPhard in general [13, 14]. However, there are two tractable alternatives to solve problem (2): The convex relaxation leading to minimization and greedy algorithms such as matching pursuits (MP) [13] or Iterative Hard Thresholding (IHT) [15]. Both types of approaches provide conditions on the matrix and on the sparsity such that the recovered solution coincides with the original signal , and consequently also with the solution of (2).
The minimization approach consists in solving the following nonsmooth convex optimization problem called Basis Pursuit DeNoising (BPDN):
(3) 
where denotes the norm, which is equal to the sum of the absolute values of the vector entries, denotes the or Euclidean norm.
It has been shown in [1, 2, 16] that approximating the sparse recovery problem by the minimization (3) can stably recover the sparse original solution (i.e. sparse signal per channel) whenever satisfies the socalled restricted isometry property (RIP). This result guarantees that sparse signals can be perfectly recovered from noisefree measurements and that the recovery process is robust to the presence of noise. The computation of the isometry constants for a given matrix is prohibitive in practice, but certain classes of matrices, such as matrices with independent Gaussian or Bernoulli entries, obey the RIP condition with high probability (see Theorem 5.2 in [17]) as long as:
(4) 
for a fixed constant .
IiB Sparse Regularization of a Multichannel Signal
Usually the data is not directly sparse, but sparse in a basis . In that case, the regularization approach consists in solving the following problem which generalizes problem (3):
(5) 
with . Stable reconstruction by solving problem (5) is guaranteed as long as the matrix satisfies the RIP. When the data is a multichannel image, a classical basis is a block diagonal orthonormal basis ^{1}^{1}1 is the identity matrix and denotes the matrix Kronecker product. where denotes a proper 2dimensional wavelet basis.
Another classical approach to regularize the data (specially images) is the total variation (TV) penalty [18], which tends to generate images with piecewise smooth regions and sharp boundaries. Replacing the norm with the norm on each channel of the multichannel in problem (5) leads to the Total Variation DeNoising (TVDN) problem:
(6) 
IiC The Linear Mixture Model
One of the most practical setups of a multichannel signal is when the multichannel data matrix is derived by a sparse linear mixture model as follows:
(7) 
Here, denotes the source matrix whose th column contains the proportion of the source at each pixel. Each source is mixed with the corresponding column of the mixing matrix in order to generate the full multichannel data. Each column of contains the spectrum of the corresponding source. The observed signal in any channel is thus a linear combination of source signals:
IiD Mixing Parameters as Side Information for Multichannel CS Recovery
In certain multichannel signal acquisition setups the mixing parameters are known at both decoder and encoder sides. In particular, this is the case in many remote sensing applications where the spectra of common materials are tabulated. Such prior efficiently restricts the degrees of freedom of the entire data matrix to the sparse coefficients of the underlying sources. Indeed, we will show that, when we know the mixing parameters , the inverse problem consisting in recovering the multichannel signal from the measurements in (1) is equivalent to the problem of recovering the sources from the following measurements:
(8) 
with . The source coefficients can then be recovered by solving a convex optimization problem such as (5), where is replaced by and the multichannel signal can be reconstructed by applying the mixing matrix to the recovered source matrix according to the linear mixture model (7). This approach has the advantage of solving two problems: i) source separation directly from the compressive measurements, ii) data compressive sampling via source separation or, equivalently, via a particular structured sparse model.
Iii Compressive Multichannel Signal Acquisition Schemes
If the multichannel signal follows the linear mixture model (7), the knowledge of the mixing matrix can be used efficiently. The sparse source coefficients can be directly recovered from the measurements. In this section we introduce a decorrelation mechanism, applied at the acquisition process or as a postprocessing step, which has two main advantages: first it leads to strong dimensionality reduction and secondly it improves the conditioning of the recovery problem.
Iiia Multichannel Recovery via Source Recovery
When we know the mixing matrix , and thanks to the property () of the Kronecker product, the sampling equation (1) (in the noise free case) can be written as:
(9) 
Then, the regularization approach for the recovery of the whole data consists in finding the sparsest coefficients vector of the sources vector in a basis , where e.g. is a block diagonal orthonormal basis, through the following minimization:
(10) 
This corresponds to a “synthesis” formulation of BPDN using a basis . The “analysis” formulation, which is equivalent to the synthesis one when is a basis but different when is a redundant dictionary, consists in solving the following problem with respect to the sources instead of its coefficients:
(11) 
where is the adjoint of the operator .
IiiB Decorrelation Scheme
We have seen in section IIA, that the conditions to recover the signal from the noisy measurements depend on properties (such as RIP) of the sensing matrix . We introduce a particular structure for the sampling matrix which benefits from the available knowledge of the mixture parameters and incorporates data decorrelation into the compressive acquisition.
IiiB1 Decorrelating Multichannel CS Acquisition
The decorrelation mechanism consists of applying the MoorePenrose pseudo inverse matrix in order to remove the underlying dependencies among CS measurements. We therefore propose the following sampling matrix:
(12) 
The main sampling matrix is generated from a smallersize core sampling matrix . Note that CS imposes .
The total number of measurements is . Applying the sampling matrix of (12) on multichannel data results in the following CS measurements:
(14) 
The third equality comes from the following property: , and is a block diagonal matrix whose diagonal blocks are populated with :
As we can observe in (14) and thanks to the specific structure of the sampling matrix, the mixing parameters are discarded from the formulation and each source (each column of ) is directly subsampled by the same matrix .
IiiB2 Uniform Multichannel CS Acquisition
In many practical setups the acquisition scheme can not be arbitrarily chosen and is rather determined by various constraints posed by the physics of the signals and the implementation technology. Certain acquisition systems such as Rice’s singlepixel hyperspectral imager [4] are using a universal random matrix to sample independently data in each channel. In this case, acquisition models such as (12), which require interchannel interactions for compressed sampling, simply cannot be implemented. Here, the sampling matrix in (1) is block diagonal with blocks (each applies on a certain channel) that are populated by a unique matrix (similarly as in (14)):
(15) 
The total number of measurements is then . Reshaping and correspondingly into matrices (the measurement matrix) and (the noise matrix) leads to the following equivalent formulation:
IiiB3 Decorrelationbased Uniform Sampling
A decorrelation step similar to the one introduced in Section IIIB1 can be applied on the CS measurements. It consists in multiplying the rows of the measurement matrix by and reducing the dimensionality of to an matrix as follows:
where, . By reshaping and into the vectors and , we can observe that the outcome of such decorrelationbased uniform sampling leads to an expression similar to (14) i.e.,
(16) 
This decorrelating scheme favorably reduces the dimension of the data: at the acquisition stage, the total number of samples is but at the transmission and decoding stages the number of samples is only .
For the decorrelating sampling schemes described in section IIIB1 and IIIB3, the minimization (e.g. the ”synthesis” problem (10)) of section IIIA takes the following form:
(17) 
which, in the noiseless case can be decoupled into independent minimizations, each of them corresponding to a certain source compressed by a universal matrix . In Section IV we provide the theoretical analysis of such recovery scheme for various acquisition schemes.
Iv Main Theoretical Analysis
Compressive sparse source recovery is closely related to the problem of compressed sensing with redundant dictionaries [19, 20]. Indeed, the later problem has the same formulation as in (10) by replacing by an overcomplete dictionary matrix. The first part of this section provides an overview of the CS literature on redundant dictionaries. In the second part, we derive new performance bounds that extend the former results for a larger class of dictionaries. In the third part, we cast the sparse source separation problem as a particular case of CS recovery using redundant dictionaries and we give a bound on the performance of the minimization for each of the considered CS acquisition schemes (dense, uniform and decorrelated).
Iva Compressed Sensing and Redundant Dictionaries
Let be a vector that is sparse in a dictionary (i.e., with, ). The minimization approach for recovering (equivalently ) from the compressive measurements consists in solving:
(18) 
where, . Note that in this section is a sampling matrix of size and the dictionary typically contains a large number of columns ().
It has been shown in [1, 2] that the minimization (18) can stably recover the original solution whenever satisfies the restricted isometry property (RIP). More precisely, if for all sparse vectors the following RIP property holds:
(19) 
with the RIP constant of order , , then the solution to (18) satisfies the following error bound:
(20) 
for some positive constants , and where is the best sparse approximation of . Now the question is how many CS measurements are sufficient so that satisfies the RIP ? It has been shown in [19] that, for a certain class of random sampling matrices (e.g., with i.i.d. Gaussian, Bernoulli or subgaussian elements), with very high probability the RIP constant is bounded by:
(21) 
If is an orthonormal basis, then and becomes another subgaussian matrix with a similar distribution as for and thus (21) holds with equality i.e., .
Considering the recovery condition using minimization (i.e., ) and the bound in (21), we can conclude that must satisfy RIP with the following constant:
(22) 
Moreover, using the JohnsonLindenstrauss lemma, it has been shown that (see Theorem 5.2 in [17]) a random matrix whose elements are drawn independently at random from Gaussian, Bernoulli or subgaussian distributions satisfies RIP as long as we have:
(23) 
for a constant depending on the RIP constant of i.e., the higher , the smaller . If is not a unitary matrix, becomes a positive constant and the more coherent the columns of , the larger its RIP constant. Therefore, there is a tradeoff for compressed sensing using redundant dictionaries: redundancy can result in a more compact representations of the signals i.e., smaller , and thus less measurements are required for CS recovery using (18). Meanwhile, too much redundancy can lead to an awfully large constant in (23) implying that more CS measurements are required to overcome the uncertainties brought by over completeness.
IvB Performance Bounds for Compressed Sensing using AsymmetricRIP Dictionaries
In Section IVC we will show that applying the classical RIP based analysis results in conditions that are too restrictive to guaranty the source recovery. Therefore in this part and in order to overcome such limitations, we derive a new theoretical performance bound that uses different notions of RIP. We begin by introducing the notions of the asymmetric restricted isometry property (ARIP) and the restricted condition number of a dictionary .
Definition 1.
For a positive integer , an matrix satisfies the asymmetric restricted isometry property, if for all sparse the following inequalities hold:
(24) 
where, and are correspondingly the largest and the smallest constants for which the inequalities above hold. The restricted condition number of is defined as:
(25) 
In addition, we use a different notion of RIP for the compression matrix , namely, the Dictionary Restricted Isometry Property (DRIP), proposed by Candes et al. in [20]:
Definition 2.
For a positive integer , a matrix satisfies the DRIP adapted to a dictionary as long as for all sparse vectors the following inequalities hold:
(26) 
The DRIP constant is the smallest constant for which the property above holds.
This definition extends the classical RIP (which deals with signals that are sparse in the canonical basis) to linear mappings that are able to stably embed all low dimensional subspaces spanned by every columns of a redundant dictionary .
As in [20], we suppose that is an matrix drawn at random from certain distributions that satisfy the following concentration bound for any vector :
(27) 
for some constants and that are only depending on . Then, will satisfy the DRIP for any dictionary with overwhelming probability if
Remark 1.
Matrices whose elements are independently drawn at random from Gaussian, Bernoulli or (in general) subgaussian distributions satisfy the concentration bound in (27) and therefore satisfy DRIP for any dictionary as long as .
Based on these definitions we establish the following theorem in order to bound the performance of the minimization in (18):
Theorem 1.
Given a matrix that satisfies the DRIP adapted to a dictionary , with the constant where , then the solution to (18) obeys the following bound:
(28) 
for some positive constants .
The proof of this theorem is given in Appendix. Using Remark 1, the following result is straightforward:
Corollary 1.
Comparing to the bound (23) based on the classical RIP analysis, we see that (29) features the same scalingorder for the number of measurements. In addition, for both types of analysis the constant factors grow as the atoms of the dictionary become more coherent and therefore, more CS measurement are required.
Note that this result requires neither nor the dictionary to satisfy the classical RIP. In the next section, we apply these results to guaranty the performance of the minimization approach (10) for source identification and in particular, for the case where is not wellconditioned.
IvC Theoretical Guaranties for Source Recovery using Minimization
Sparse source recovery from compressive measurements using minimization (10) is a particular case of the compressed sensing problem using dictionaries (18). Indeed, for the source recovery problem, and the dictionary matrix are replaced respectively with and , and consequently, and . The only difference here is that is a tall matrix (i.e., ) due to its specific construction and the assumption of having few number of sources (i.e., ). Though there is no redundancy in in terms of the number of columns, there is uncertainty at the sparse decoder because of coherent columns. The following lemma which has been proven in [3] (see Lemma 2 in [3]) shows that the conditioning of is directly related to the conditioning of the underlying mixture parameters i.e., intuitively, if the columns of become coherent, so become the columns of .
Lemma 1.
For matrices with restricted isometry constants respectively, we have:
(30) 
Since the RIP constant of any orthonormal basis is zero (e.g., ), and since is an orthogonal matrix, we can deduce the following bound on the RIP constant of by applying Lemma 1:
(31)  
(32) 
For one can use (31) (which then holds with equality), and more generally (32) for any . Note that (32) follows by the definition of the RIP constant and it only holds if is properly normalized so that and . ^{2}^{2}2This can be done by dividing and multiplying by , respectively.
Moreover, due to the properties of the extreme singular values of the Kronecker product of two matrices:
and according to Definition 1, we can bound the restricted condition number of as follows:
(33) 
where, (without subscript) denotes the standard definition of the condition number of a matrix. With those descriptions, the performance of the sparse source recovery using (10) can be easily characterized by any of the previous types of performance bound of sections IVA and IVB.
According to the standard definition of the RIP for the matrix , we can bound its restricted condition number as follows:
Recall that, the classical RIP based analysis in section IVA requires (in order to have in (22)), which implies , or consequently . This severely restricts the application of such analysis for a limited class of relatively wellconditioned mixture parameters.
To address this limitation, we use the second theoretical analysis based on the DRIP of the compression matrix presented in section IVB. The following theorem is a corollary of Theorem 1:
Theorem 2.
Comparing to Theorem 1, is replaced by and is set to which satisfies the requirement of Theorem 1 i.e., according to (33) we have . As we can see, this analysis is valid for a much wider range of condition number namely, . ^{3}^{3}3As for an identity matrix always satisfies (i.e. there is no advantage by replacing the full Nyquist sampling with CS), Theorem 2 becomes useful only when we have which for the value of in the theorem implies .
Now, if we use this approximation to recover the multichannel data i.e., , the reconstruction error can be bounded using (34) and the following inequality:
(35)  
Theorem 2 indicates as the sufficient condition for the sparse source recovery. In the following we investigate the implication of this condition for the previously mentioned acquisition schemes to bound the number of CS measurements.
IvC1 Dense Random Sampling
IvC2 Uniform Random Sampling
The same type of analysis indicates a very poor performance for the uniform random acquisition scheme described in section IIIB2. The corresponding sampling matrix has a blockdiagonal form . Here, we assume that the core compression matrix that separately applies to each channel is an matrix whose elements are drawn independently at random from Gaussian, Bernoulli or subgaussian distributions.
According to the theoretical analysis provided in section IVA, the sufficient condition for source recovery via (10) is which, by considering (32) can be rephrased as:
For a compression matrix with this structure and by using Lemma 1 we can deduce . Now similarly as for the bound (23), satisfies the RIP with the constant above (and so does ) as long as or equivalently,
(37) 
The constant depends on the conditioning of the mixture matrix . When the columns of are very coherent, the extreme singular values spread away from each other and becomes large. As a consequence, (or equivalently ) must satisfy RIP for a smaller constant which, as discussed earlier in section IVA, implies to be large and more CS measurements are required for source recovery.
IvC3 Decorrelating Random Sampling
When a decorrelation step is incorporated into the compressive acquisition process, is discarded in the recovery formulation, and then we can use the standard RIP analysis in [1, 2] to evaluate the source recovery performance. Therefore, if satisfies the RIP with a constant , then the solution to (17) obeys the following error bound:
where the constants are the same as in (20).
Now, since is a block diagonal matrix, we can proceed along the exact same steps as for the uniform sampling scheme (Section IVC2) to bound the minimum number of CS measurements such that satisfies the RIP:
Unlike the previous measurement bounds for the nondecorrelating sampling schemes, here is a fixed constant independent of the mixture matrix . Consequently, the total number of CS measurements used for source recovery is:
(38) 
Note that, for a noiseless sampling scenario () the minimization (17) can be decoupled into independent minimizations, each of them corresponding to a sparse recovery of a certain source. Now, if we assume that each source has exactly nonzero coefficients, then a perfect recovery can be guaranteed as long as which, for a matrix drawn form the previouslymentioned distributions, implies that and consequently:
(39) 
Comparing to (38) where is roughly proportional to , here the measurement bound improves by a factor and it is mainly proportional to the sparsity level of all sources.
IvD Conclusions on the Theoretical Bounds
Consider a multichannel data derived by the linear mixture (7) of sources, each having a sparse representation i.e. is sparse. Table I summarizes the scalingorders of the number of CS measurements sufficient for an exact data reconstruction for different noiseless random acquisition schemes and sparse recovery approaches.
CS Acquisition Scheme  Dense  Dense  Uniform  Decorrelating 
CS Recovery Approach  BPDN  SS  SS  SS 
CS measurements  
Constant depends on    Yes  Yes  No 
As we can observe, compressed sensing via source recovery using (10) once it is coupled with a proper CS acquisition (i.e., Dense i.i.d. subgaussian , or a random decorrelating sampling scheme as in sections IIIB1 and IIIB3) leads to a significantly improved bound compared to standard methods such as BPDN. More remarkably, the number of CS measurements turns out to be independent of the number of channels.
Finally note that the measurement bound for the sourceseparationbased reconstruction approach, which uses a nondecorrelating random compression matrix, depends on the conditioning of the mixture parameters via the constant factor in (36). Therefore, when the columns of are highly coherent, the condition number of becomes relatively large, and so does . This limitation can be circumvented thanks to the decorrelating acquisition scheme.
V Applications in Compressive Hyperspectral Imagery
Compressed sensing is particularly promising for hyperspectral imagery where the acquisition procedure is very costly. This type of images can be approximated by a linear mixture model as in (7) where each spatial pixel is populated with a very few number of materials (i.e. sources). In this regard, is a matrix whose columns are source images (vectorized 2D images) indicating the percentage of each material in one of the spatial pixels, and therefore
(40) 
Moreover, is a matrix whose columns contain the spectral signatures of the corresponding sources of . Note that in some particular applications and specially when the spatial resolution is high enough, the source images become disjoint, meaning that each spatial pixel contains only one material and .
The two key priors that will be essential for compressive source identification are the following: i) Each source image contains piecewise smooth variations along the spatial domain, implying a sparse representation in a wavelet basis, or sparsity of its gradient, and ii) each spatial pixel is a nonnegative linear combination of a small number of sources.
In the next two sections we introduce two classes of source separation based recovery approaches that are particularly adapted to hyperspectral compressive imagery.
Va Compressive HSI Source Separation via Convex Minimization
According to our earlier assumptions, source images are spatially piecewise smooth, which means the coefficients of are sparse in a 2dimensional wavelet basis . We conveniently rephrase this representation in a vectorized form with as described in Section IIB.
Taking into account the sparsity of and by incorporating specific assumptions such as (40) and nonnegativity we can extend the minimization approach in (10) as follows:
(41)  
subject to  
Where, denotes an all one dimensional vector. The first constraint is the same as the fidelity constraint in (10). The last two constraints impose the elementwise nonnegativity of and the “percentage” normalization (40) i.e., each row of belongs to the positive face of the simplex in . Minimizing the norm together with the last two constraints (that is equivalent to an additional norm constraint) gives solutions that contain both desired sorts of sparsity: i) along the 2D wavelet coefficients of and, ii) along each row of .
Note that the theoretical analysis given in Section IVC can also apply here to bound the performance of (41). Although we bound the error similarly as for (10), one can naturally expect a much better performance for (41) thanks to the two additional constrains.
Alternatively, problem (41) can be formulated in a more general “analysis” formulation with an analysis sparsity prior :
(42)  
subject to  
which is equivalent to (41) when and is a square and invertible operator. Another efficient analysis prior for image regularization is the Total Variation which can be applied on each source image of the HSI with the prior: . The problem formulation (42) is general and includes the decorrelating schemes discussed in sections IIIB1 and IIIB3. Indeed inserting the matrix of (12) in (42) leads to the following fidelity term while the other terms remain unchanged.
In the next Section we provide an iterative algorithm for solving problem (42). When sources are disjoint, it is also possible to add a hard thresholding postprocessing step that sets the maximum coefficient of each row of equal to one and set to zero the other coefficients.
VB The PPXA Algorithm for Compressive Source Separation
The Parallel Proximal Splitting Algorithm (PPXA) [21] is an iterative method for minimizing an arbitrarily finite sum of lower semicontinuous (l.s.c.) convex functions. Each of the iteration consists in computing the proximity operator of all functions (which can be done in parallel), averaging their results and updating the solution until convergence. The proximity operator of a function is defined as [21]:
(43) 
For solving (42) with PPXA, we rewrite it as the minimization of the sum of three l.s.c. convex functions:
(44) 
with , and and where is the indicator function of a convex set defined as:
and the convex sets are respectively, the set of matrices that satisfy the fidelity constraint , and the set of matrices whose rows belong to the standard simplex in . The template of the PPXA algorithm that solves (44) and hence (42) is given in Algorithm 1. We now derive the proximity operator of each function . Note that the definition of the proximity operator in (43) naturally extends for matrices by replacing the norm with the Frobenius norm.
For , a standard calculation shows that
(45) 
which is the soft thresholding operator applied on the wavelet coefficients of . The proximity operator of can be decoupled and computed in parallel for each of the sources via an efficient implementation proposed by [22]. By definition, the proximal operator of an indicator function is the orthogonal projection of onto the corresponding set . The projection onto the standard simplex can be done in one iteration using the method proposed by Duchi et al. [23]. For a general implicit operator , the projector onto can be computed using a forward backward scheme as proposed in [24]. This projection usually has the dominant computational complexity of the algorithm because of costly subiterations. However if the decorrelating sampling scheme is used and is a tight frame (i.e., for a constant ), then according to the semiorthogonal linear transform property of proximity operators [21], the orthogonal projection onto has the following explicit form:
(46) 
with .
VC Compressive HSI Source Separation via Iterative Hard Thresholding
If the source images are disjoint, the following nonconvex minimization can be alternatively used for recovering the sparse wavelet coefficients of the sources:
(47)  
subject to  
where the operator returns the offdiagonal elements of matrix , and the norm constraint on imposes the wavelet coefficients to be sparse. The second constraint imposes the orthogonality of the wavelet coefficients which is a consequence of the source disjointness. The two last constraints are the same as in (41).
Despite its convex objective term, (47) has multiple nonconvex constraints and is therefore a nonconvex problem. We propose an algorithm similar to the Iterative Hard Thresholding (IHT) algorithm [15] to approximate the solution of (47). At each iteration the current solution is updated by a gradient descent step followed by a hard thresholding step that selects the largest wavelet coefficients of . In addition the three last constraints of (47) are applied sequentially:

First, a procedure inspired by the orthogonal matrix procrustes is applied to diagonalize . Let be a diagonal matrix where for we have
Since for disjoint sources we have , then a good orthogonal matrix that would approximate and keeps the energy of the current estimate of each source image proportional to that of the previous estimate would be through the following singular value decomposition .

Second, the current solution is projected onto the standard simplex as in [23].
The description of the this algorithm can be found in Algorithm 2. Note that the gradient of the objective functional