Sparse and Nonnegative BSS for Noisy Data
Abstract
Nonnegative blind source separation (BSS) has raised interest in various fields of research, as testified by the wide literature on the topic of nonnegative matrix factorization (NMF). In this context, it is fundamental that the sources to be estimated present some diversity in order to be efficiently retrieved. Sparsity is known to enhance such contrast between the sources while producing very robust approaches, especially to noise. In this paper we introduce a new algorithm in order to tackle the blind separation of nonnegative sparse sources from noisy measurements. We first show that sparsity and nonnegativity constraints have to be carefully applied on the soughtafter solution. In fact, improperly constrained solutions are unlikely to be stable and are therefore suboptimal. The proposed algorithm, named nGMCA (nonnegative Generalized Morphological Component Analysis), makes use of proximal calculus techniques to provide properly constrained solutions. The performance of nGMCA compared to other stateoftheart algorithms is demonstrated by numerical experiments encompassing a wide variety of settings, with negligible parameter tuning. In particular, nGMCA is shown to provide robustness to noise and performs well on synthetic mixtures of real NMR spectra.
I Introduction
In many applications, such as nuclear magnetic resonance (NMR) spectrometry or massspectrometry, measurements are often made of mixtures of physical components which can be identified by their specific spectrum. Discriminating between these elementary components or sources can be made by acquiring several measurements at different times or locations in order to observe different mixtures, yielding multispectral data.
In this context, blind source separation (BSS) aims at recovering the spectra from measurements in which the components sources are mixed up together in an unknown way. The instantaneous linear mixture model assumes that the measurements are linear mixture of sources with spectra . In other words, there exist mixture coefficients such that:
(1) 
where the vectors are added in order to account for noise and model imperfections. This mixing model can be conveniently rewritten in matrix form:
(2) 
where:

is the number of measurements.

is the number of samples of the sources.

is the number of sources.

is the measurements matrix in which each row is a measurement.

the unknown source matrix in which each row is a spectrum/source.

the unknown mixing matrix which defines the contribution of each source to the measurements.

is an unknown noise matrix accounting for instrumental noise and/or model imperfections.
With the notation (Frobenius norm for ) and for independent and identically distributed Gaussian noise, the maximumlikelihood estimate is then provided by the standard problem:
(3) 
However, this problem presents an infinite number of solutions which are not necessarily of any interest with respect to a given application. It is therefore standard to constrain and/or so as to limit the search to minima with a desired structure. In this article, we focus on the assumption that both and are nonnegative, yielding Nonnegative Matrix Factorization (NMF).
The nonnegativity assumption arises naturally in many applications, including text mining [1], clustering [2], audio processing [3] and spectrometry [4]. Indeed, the sources in can for instance be mass or power spectra, which are nonnegative, and the mixtures can represent concentrations, which cannot be negative either. The problem is written under the constrained form:
(4) 
can be the distance such as in (3) or any divergence taking into account other priors on the noise [3, 5, 6]. In this article, we focus on Gaussian noise and therefore use the distance.
The first publications dealing with this type of problems comes from Paatero & Tapper [7] and Lee & Seung [5] who provided convergent gradient descent type algorithms. They emphasized on the importance of the nonnegativity constraint which yields a “partbased” representation of the data. Indeed, only additive/constructive, and not subtractive/destructive combinations are then allowed in order to represent the data.
NMF shares a couple of indeterminacies with other BSS problems:

permutations: if is a permutation, so that one cannot hope to recover the order of the rows of and columns of .

scales: if is diagonal with strictly positive diagonal coefficients, so that if is a solution of (4), is also solution.
However, unlike problem (3) for which a solution can be conveniently obtained through SVD decomposition for instance, NMF is NPHard [8] and can present numerous local minima. For this reason, additional constraints or priors information can be helpful to recover the sought sources. In the next sections, we impose a sparse prior on the sources, together with their nonnegativity.
Contributions
This article details and extends the preliminary work in [9] with more complete experimental study and a more advanced understanding of the algorithms. Specifically, noise is accurately accounted for. It illustrates the proposed approach on realistic data as well. In Section II, we first give a review of the stateoftheart NMF methods and sparse NMF techniques proposed so far. Our approach aims at obtaining a better management of the noise through an ad hoc regularization strategy. For that purpose, we introduce in Section III an extension of the sparse BSS algorithm GMCA (Generalized Morphological Component Analysis) to tackle the problem of sparse and nonnegative BSS from noisy data. This extension is motivated by the robustness of GMCA to additive noise contamination.
We further show that a special care has to be taken to constrain both the sparsity and the nonnegativity of the signals. More specifically, improperly constrained methods can lead to unstable and suboptimal solutions. We therefore introduce the nonnegative GMCA (nGMCA) which has the ability to provide exact solutions of the constrained and penalized subproblems, thanks to proximal calculus methods. Last but not least, the proposed algorithm is a simple to use technique since it barely requires any parameter tuning.
In Section IV the proposed approach is compared with standard NMF and sparse NMF algorithms. Exhaustive experiments show that nGMCA allows for better robustness to noise, large numbers of sources and the conditioning of the mixing matrix, and therefore outperforms most algorithms for a broad variety of settings. Finally, in section V, we illustrate how nGMCA performs also well to separate out synthetic NMR spectra.
Ii Stateoftheart NMF methods
Iia Standard NMF algorithms
The minimization of the cost function from Problem (4) is generally solved by alternately updating and . Indeed, while this problem is not convex, the subproblem in writes:
(5) 
and the equivalent subproblem in are often convex and their minimization is therefore much easier. These alternating updates are then repeated for a large number of iterations.
The first converging NMF algorithm, designed by Lee & Seung [5, 10], updates and with a weighted gradient descent. The weights insure that the gradient steps do not increase the cost function in (4) and keep and nonnegative, since the update can be recast as a pointwise product of nonnegative matrices. With the elementwise matrix multiplication and the elementwise matrix division, the update rule for and in the case of a least square cost function can be written as follows:
(6) 
The multiplicative update rule is usually considered as a standard because of its convenience, with no parameter to set. As well, it was the first convergent algorithm proposed to solve the NMF problem. However, it has been shown to be slow and that the monotone decrease of the algorithm does not insure convergence to a local minimum [11, 12]. Firstorder methods are also used in projected/proximal gradient descent algorithms [13], interior point gradient [14], and quasiNewton algorithms [15].
Another standard approach is the Alternating Least Square (ALS) algorithm explained in [7], which solves exactly the unconstrained cost function with a pseudoinverse and projects the result on the nonnegativity constraint:
(7) 
with the operator . This algorithm is also widely used due to its easy implementation and its efficiency in decreasing the cost function. However, it does not necessarily converge. In Hierarchical ALS (HALS, [16, 17]), the columns of and rows of are processed one by one. This yields a simple and fast optimization process to solve the constrained subproblems.
It is however possible to solve exactly the constrained subproblems of type (5) at each iteration as follows:
(8) 
In [18] for instance, Lin uses a projected gradient descent subroutine to solve the subproblems. Guan et al. later provided a faster firstorder method [19].
It has to be mentioned that other approaches based on geometrical methods have also been investigated to solve NMF problems[20, 21, 22]; these approaches are however generally very sensitive to noise.
As was previously stated, nonnegativity is not always sufficient to recover the actual sources and mixing matrix. In nonnegative ICA [23], one additionally enforces the independence of the sources. However, this approach is also sensitive to noise. Sparsity, on the other hand, has been shown to provide robustness to noise. We give a short introduction about this prior in the next section, before presenting sparse NMF algorithms.
IiB Sparsity & NMF
IiB1 An Introduction to Sparsity
Sparsity constraints have already proved their efficiency to solve a very wide range of inverse problems (see [24] and references therein). In the context of BSS, sparsity has been shown to increase the diversity between the sources which greatly helps their separation [25, 26, 27]. In the wide sense, a sparse signal is such that its information content is concentrated into only a few large nonzero coefficients, or can be well approximated in such a way. The sparsity of a signal however depends on the basis or dictionary in which it is expressed. For instance, a sine wave will be sparse in the Fourier domain since it can be encoded with one coefficient in this domain, while in the direct domain most of its coefficients are nonzero: the more a basis captures the structures of a signal, the sparser will be in such a basis. In this article we will however only focus on sparsity in the direct domain.
IiB2 Sparse NMF Algorithms
In many applications the nonnegativity and sparsity of the sources arise naturally as for instance in MS or NMR spectroscopy. Recent works have emphasized the fact that this knowledge can indeed help perform more relevant factorizations [2, 28, 29].
In [2], Kim & Park have proposed to formulate the NMF problem as:
(9) 
In this equation, the sparsityenforcing regularizer favors solutions where a single source dominates at each sample. However, it does not enforce the intrinsic sparsity of each of the sources. The authors made use of an activeset method to solve this problem. This technique can solve exactly each constrained subproblem in a similar way than in Problem (8). However, it is not clear how the parameters and must be set. The authors provide an implementation on their website^{1}^{1}1http://www.cc.gatech.edu/~hpark/nmfsoftware.php.
In [15], Zdunek & Cichocki used a similar regularization term for but with decreasing during the algorithms in order to be more robust to local minima, and without exactly solving the subproblems.
In [29], Hoyer used a sparse regularization of the form that uniformly enforce the sparsity of . The author later used a different type of sparse prior in [30] defined for some vector as follows:
(10) 
This sparseness function goes from 1 when is perfectly sparse —only 1 active coefficient— to 0 when it is not at all —all coefficients active, with the same value. The idea is therefore to optionally impose a chosen level of sparsity for the sources and/or the mixtures:
(11) 
This problem is solved by using projected gradient descent steps followed by a projection on the sparseness constraint if the constraint is active and with a multiplicative update otherwise. However, the constraint is directly related to the expected sparseness level of the sources which is not necessarily known beforehand. Furthermore, hardconstraining the sparseness level may make the solution very dependent on the sparseness parameters and . An implementation of this algorithm is available online^{2}^{2}2http://www.cs.helsinki.fi/u/phoyer/software.html.
Sparse HALS [16, 17] aims at solving Problem (4) with a sparsity penalization of the form . In the case of an data fidelity term, this problem can be rewritten as:
(12) 
In this algorithm, the columns of and lines of are updated one by one. As each subproblem admits a straightforward analytic solution and no matrix inversion is required, the HALS algorithm turns to be a simple and fast NMF solver. In a recent implementation of the HALS [31, 32]^{3}^{3}3https://sites.google.com/site/nicolasgillis/code, the parameter of is automatically managed in order to obtain a required sparsity rate (defined as the ratio of coefficients smaller than times the largest coefficient).
It is important to note that none of these algorithms makes use of the sparse prior explicitly to deal with additive noise; therefore they may not be robust in case of noise contamination.
Iii Nonnegative Generalized Morphological Component Analysis
Iiia A first naive extension
In the last decade the use of sparsity in the field of BSS has been widely explored. In [27, 33], the authors have introduced a sparsityenforcing BSS technique coined Generalized Morphological Component Analysis (GMCA) which has been shown to be effective at separating out sparse signals from noisy data. Morphological diversity has been defined in [33] as a mean to characterize separable sources based on their geometrical structures or morphologies : separable sources with different morphologies do not share the same significant coefficients in a given sparse representation. When sparsity holds in the direct domain this means that the entries of each source with the most significant amplitudes should be different. This does not mean that their supports are disjoint but rather their most significant elements should be disjoint.
The objective of this paper is to extend this sparsityenforcing BSS algorithm to deal with nonnegative mixtures. Following [27], the GMCA with an additional nonnegative constraint estimates the mixing matrix and the sources by minimizing the following optimization problem:
(13) 
The pseudonorm counts nonnull coefficients in and therefore limits their number, thus enforcing the sparsity of . In the vein of Alternating Least Squares, GMCA alternately and iteratively estimates the unconstrained least square solution and projects on the nonnegativity constraint, with an additional thresholding step for the sources in order to keep only the most significant coefficients. These updates are provided in lines 6 and 7 of Algorithm 1 where the hardthresholding operator is defined as follows:
(14) 
It has been emphasized in [27] that one crucial feature of GMCA is the use of a decreasing threshold . At the beginning, this parameter is first set to a high value and then decreases throughout the iterative down to a final value that depends on the noise level. Simulated annealing has already inspired decreasing and regularizations in NMF [15, 34]. The motivation behind a decreasing threshold is however different:

first estimating the mixing matrix from the entries of the sources that have the highest amplitude and thus likely to belong to only one source.

help removing the smallest coefficients which are more sensitive to noise contamination.
In the same way as in the original GMCA, for a source with index , the threshold is set to where is an empirical estimator (the median absolute deviation) of the source noise variation. is chosen at each iteration in order to obtain a linear increase of the number of active coefficients in , so as to refine the estimation while maintaining some continuity. The final , , is usually taken in the range as a tradeoff between sufficient denoising and correct separation. Indeed, for a sparse signal contaminated by i.i.d. Gaussian noise with standard deviation , thresholding at 3 rejects noise samples with probability 0.99. Still, in BSS, too large a final threshold could leave some leakage between the sources. This nGMCA will be considered as a naive extension of GMCA and used as a reference algorithm. It will be coined naive nonnegative GMCA (nGMCA).
Limitations
It is very important to notice that this algorithm, as based on alternating projected leastsquares, is only a proxy which generally does not converge to a stable couple , hence the “naive” denomination. This means that the solution given by this type of algorithm may not be stable and be suboptimal. The solution may not provide the sparsest nonnegative sources. This is due to the fact that it deals with the data fidelity term and the constraints in a completely independent way, thus not exactly solving the subproblems such as in Problem (8).
Next, we therefore propose an alternative algorithm which exactly solves the nonnegatively constrained and penalized subproblems and should allow for more robust and stable solutions.
IiiB nGMCA
We first propose to tackle the sparse nonnegative BSS problem using an sparse regularization with such as in (13). This then amounts to solving the following optimization problem:
(15) 
where is the characteristic function of the nonnegative orthant, which enforces the nonnegative constraints. The characteristic function of the nonnegative orthant is defined as:
(16) 
In the previous section, we emphasized that a naive approach based on projected leastsquare does not necessarily provide a stable and thus optimal solution to the problem. To go beyond the aforementioned naive extension of GMCA, one has to alternatively and exactly minimize the constrained subproblems in and so as to obtain stable solutions with the sought structure. Let’s first have a look at the subproblem in ; assuming is fixed, the sources are estimated as follows:
(17) 
This problem minimizes a function which can be split into the sum of a differentiable quadratic term and two nonsmooth and nondifferentiable terms: the norm and the characteristic function of the nonnegative orthant . The choice provides the closest convex surrogate to the sparse norm and in this case, the problem admits a unique solution which, however, cannot be formulated explicitly. Fortunately, thanks to recent advances in proximal calculus and splitting techniques, efficient algorithms can be designed to solve this type of problems.
IiiB1 Proximal calculus and splitting methods
Proximal splitting methods [35] aim at minimizing convex functions which may not be differentiable. The idea of these methods is to split the cost function into the sum of several convex functions which are alternately locally minimized. Many problems can be decomposed into the general form:
(18) 
where is a smooth and convex data fidelity term; and a convex regularization term which may not be differentiable.
Since is differentiable, it can be locally minimized with a gradient descent step. When is not, but if it is convex, proper and lower semicontinuous, one can define the proximal operator of :
(19) 
In this case, the following process called forwardbackward splitting algorithm (FBS, see [35]):
(20) 
has been shown to converge to the solution of Problem (18) if is Lipschitz.
The update of is made by solving the following problem:
(21) 
which can be recast in the general form (18) by defining and . In this case, the gradient of is trivially equal to: , it is Lipschitz with where is the matrix spectral norm. The proximal operator of is the projector onto the nonnegative orthant . The FBS then reduces to a projected gradient algorithm similar to the updates in [18].
Similarly, the update of in Problem (17) with can be solved with FBS, with . The proximal operator of this function also takes an explicit form usually termed skewedposition softthresholding operator:
(22) 
where the softthresholding operator is defined as:
(23) 
This softthresholding, induced by the norm, is wellknown to introduce a bias. It is therefore customary to use . In the FBS this is made by replacing the softthresholding operator with the hardthresholding. Rigorously, it is not a proximal operator since is not convex nor semicontinuous; this means that there is no convergence guarantee of the forwardbackward splitting algorithm when hardthresholding is used.
IiiB2 Description of nGMCA
In the same vein as the naive nGMCA introduced in the previous section, and are alternately updated with the exception that, now, each subproblem is solved exactly, which guarantees that the solution is stable and has the sought structure. The main steps of the algorithm are given in Algorithm 2. One must notice that exactly solving each subproblem is costly since it requires subiterations at each step. Fortunately, it has been recently showed that the speed of convergence of the FBS algorithm can be greatly improved by using the multistep techniques introduced by Nesterov [36]. Both steps have the algorithm make use of an accelerated version of the FBS algorithm [37]. A detailed description of how this accelerated algorithm is used to update is given in appendix A.
A special care has to be given to renormalizations since the regularization tends to reduce the norm of . More specifically, since the norm of can keep increasing to compensate the reduction of the norm of , the algorithm can converge to the degenerate solution and . In the algorithm, the columns of are therefore renormalized to unity before updating . This also assigns to the coefficients of their overall importance in the estimation of .
Following the general thresholding strategy used in GMCA and its extensions, the threshold decreases from step to step. However, the strategy used in this version nGMCA differs from the one used in the naive approach. In the former, the threshold is applied to the sources as defined by their leastsquare estimate. On the contrary the threshold in nGMCA applies at each gradient descent step. The update rule of in the subiterations of nGMCA (without the acceleration) can be written as follows:
(24) 
with containing only ones. Iterative softthresholding therefore operates on the gradient and not directly on the source values like in the naive nGMCA. Also, unlike with hardthresholding, a variation of affects all active coefficients. Our strategy consists in starting with a large parameter which forces the coefficients of to be nonincreasing in the first iteration. The threshold is then linearly decreased in order to refine the solution while preserving continuity, down to where is this time an estimate of the noise level in the row of the gradient.
We also implemented a version of nGMCA using hardthresholding aiming at solving Problem (15) with an pseudonorm instead of the norm for the regularization. The superscript and are specified to differentiate between respectively the hard and the softthresholding versions.
Iv Numerical experiments
In this section we first compare the introduced algorithms and classical algorithms on noiseless data, in order to better understand their behaviors, and then we benchmark the GMCAbased algorithms with stateoftheart sparse algorithms on noisy data. The settings of the simulations and the evaluation methodology are described in the following sections.
Iva Settings
Reference matrices and coefficients are uniformly generated respectively from the distribution of and , where:

is a Bernoulli random variable with activation parameter , i.e. it equals 1 with probability and 0 otherwise.

is a centered and reduced Generalized Gaussian random variable with shape parameter .
In practice and control 2 kinds of sparsity. The Bernoulli parameter affects the number of actual zeros in and and therefore exact sparsity. On the other hand, selects the sharpness of the distribution of , which pdf is proportional to (with and dependent on the standard deviation which is fixed to 1 here). As special cases, for , is a Gaussian random variable, and with it is a Laplacian random variable. With , is considered as approximately sparse —the generated signals become sparser when decreases.
In the experiments and unless stated otherwise, the standard settings will be , , with measurements of samples.
IvB Evaluation of the Results
In order to evaluate and compare the algorithms, a scale and permutation invariant criterion is needed. This criterion has to be well adapted to measure the reconstruction performance. In many applications, the signal of interests are the sources. More precisely, it has to be noticed that noisereducing priors are applied on the sources only. This implies that a good estimate of the sources should be the least contaminated by noise and interferences from the other sources. Moreover, in a noisy setting, a perfectly estimated mixing matrix do not necessarily yield a good estimate of the sources: indeed a slightly degraded mixing matrix may be preferred if it leads to less noisy sources. These points make criteria based on this variable not adequate to measure a good separation. In the next, we will focus on estimating the separation performance using a criterion on the sources .
In [38], Vincent et al. have proposed different criteria to evaluate the performance of blind source separation techniques. In the noisy setting, they propose separating each estimated source into the sum of several components:
with the projection of on the reference source it estimates; and , in its orthogonal space, respectively standing for interferences with other sources, contamination with noise and contamination with algorithm artifacts. They design an SNRtype energy ratio criterion named Source Distortion Ratio (SDR):
(25) 
As stated in [38], this criterion is a global performance measure taking into account all the elements of the reconstruction, i.e. a correct separation (low ), efficient denoising (low ) and little artifacts left by the algorithm (low ). Also, this criterion has the advantage of being scaleinvariant. In the next, the SDR will be used to evaluate the separation performance of the proposed technique with respect to stateoftheart methods.
Because of the permutation invariance, one cannot know a priori which estimated source stands for which reference source. Reference and estimated sources are therefore paired onetoone in order to obtain the best mean SDR. The SDR on (coined SDR) has be used in the experiments below in order to assess the performances of the algorithms. In the following, the SDR has been evaluated from several MonteCarlo simulations; the number of simulations will be given in each figure’s caption.
IvC Behavioral Study: Noiseless Data
In this first experimental section, nGMCA, nGMCA and nGMCA are tested on data with a large activation rate for () and therefore not the very sparse sources for which they are designed. These settings are not favorable for the nGMCA algorithms, which allows to emphasize the differences of behavior between them. Since the problem reduces to an exact factorization when there is no noise, is set to 0, hence the final threshold in this case is identical for all the algorithms. ALS [7], the Multiplicative Update [10] and (nonsparse) accelerated HALS [31] are also performed as standard algorithms to play the role of references. The maximum number of iterations were set to large enough numbers in order to assure the convergence of all the algorithms. Precisely, the number of iterations is set to 5000 for HALS, 40,000 for the Multiplicative Update, 500 for ALS and the GMCAbased algorithms, with a maximum of 80 subiterations in nGMCA and nGMCA).
IvC1 Summary of the experiments

Figure 1: this benchmark shows the influence of the number of sources on the estimation of . This parameter is of course important since the more numerous they are, the more difficult they are to separate. nGMCA outperforms ALS, while with , the final iterations are identical for both algorithms. The thresholding strategy of the first stage of the iterations therefore proves its efficiency for the separation. Though nGMCA is not performing as well as nGMCA and nGMCA with few sources, it is much more robust than all the algorithms with large numbers of sources. This is further detailed in paragraph IVC3.

Figure 2: this figure exhibits the evolution of the cost function throughout the iterations, for 40 sources and a large activation rate (). In the refinement phase, the sparsity parameter is left constant at its final value in order to observe the convergence of the algorithms and the possibility to enhance the reconstruction. nGMCA converges to a lower value than nGMCA while nGMCA does not converge at all. An explanation is provided in paragraph IVC2.

Figure 3: this benchmark shows the influence of coefficients distribution on the reconstruction. Modifying the parameter is a way to make its distribution more or less sparse. A sparser yields less correlated columns and hence a better conditioning of the problem (table 4) which simplifies the separation. nGMCA tends to be more robust than the other algorithms to illconditioned mixtures.
0.50 0.70 0.90 1.20 1.60 2.00 3.00 4.00 cond() 6.90 8.36 9.43 10.7 12.0 12.9 14.3 15.3 Fig. 4: Conditioning of with respect to
(, average of 48 simulations)
IvC2 Properly accounting for the constraints
The differences of performance between nGMCA and nGMCA for large numbers of sources and large activation rates in figure 1 can be understood by observing the evolution of the cost function during the iterations (figure 2). Properly applied nonnegativity and sparsity constraints can help refining the reconstruction of and once the sources have been sufficiently disambiguated. Indeed, since nGMCA does not exactly solve the constrained cost function, it does not necessarily neither converge to a minimum nor lead to a stable solution, while nGMCA does.
IvC3 Vs
The explanation of the differences between nGMCA and nGMCA lies in the properties of hard and softthresholding. This is summarized in figure 5 which shows thresholding applied to three twodimensional points . When a point —a column of — has two large coefficients, i.e. when the point is in the quadrant, it suffers a bias with softthresholding while it remains untouched with hardthresholding. On the other hand, the shift induced by softthresholding increases the ratio of its larger coefficient over its smaller one, which helps reinforce the affectation of a point to the direction of its larger coefficient (in this case: ). With the thresholded point, this means here that . While the lesser bias created by hardthresholding can lead to better accuracy, reinforcing the affectation of each point to a direction with softthresholding leads to better separation of the sources.
With few sources, both nGMCA and nGMCA separate correctly as displays figure 1. The bias is therefore costly for nGMCA since it leads to some compensation behaviors: with 80% activation rate, nearly every coefficient suffers the bias and one source tends to compensate for all the others’ with a positive offset. This can be seen on source 5 in figure 7. The offset correlates with all the ground truth sources as can be observed from the correlation matrix in figure 6: estimated source gathers all the thresholded coefficients of the other estimated sources and is therefore affected by the interferences.
On the other hand, the effect of softthresholding on the coefficients amplitude helps giving more weight to larger coefficients which is essential for the separation of sources from illconditioned mixtures as shown in figure 1 with a large number of sources ; and in figure 3 with a large for instance.
IvD Noisy Data
In this section, noise is added to the data and the input of the algorithms is therefore , with a Gaussian matrix with independent and uniformly distributed coefficients. In the experiments, the amount of noise is given in term of SNR on the data , SNR. nGMCA, nGMCA and nGMCA are compared with Hoyer’s [30], Kim & Park’s algorithms [2] and sparse accelerated HALS [32], which are competitive, publicly available algorithms and take sparsity into account in different ways (paragraph IIB2). There is no straightforward way to set Kim & Park’s algorithm parameters and we therefore used the default parameters of the implementation. It is then left running until convergence. For Hoyer’s algorithm, no prior is applied on and the sparsity ratio of is optimally tuned using the ground truth sources, since there is no automatic way to set the parameters. The sparsity level for the sparse accelerated HALS is also provided from the ground truth sources and both algorithms are left running for 5000 iterations in order to assure convergence. In all this section, in the nGMCA algorithms, as an effective tradeoff between noise removal and good separation of the sources.
The comparisons also include an oracle which solves the nonnegatively constrained inversion problem in using the groundtruth sources :
(26) 
The sparsity parameter is set to such as in nGMCA. This oracle stands for the optimal which could be recovered by nGMCA if the uncontaminated mixtures were known. Of course, since the mixture are not known in practice in BSS, the oracle yields unachievable results, but it provides a reference line for the comparisons and a limit for the progression margin of the reconstructions.
IvD1 Summary of the experiments

Figures 8, 9 and 10: these benchmarks show the reconstruction results for 15 sources with activation rate of 10% (figures 8 and 9) and 30% (figure 10) —the lower the activation rate, the better the sparse prior— with a varying level of noise contamination in the data. Figure 9 and 10 display the loss in SDR compared to the oracle in order to facilitate the visualization. In both cases, nGMCA is less sensitive to noise and outperforms the other algorithms.

Figures 11: this benchmark shows the same experience than the previous ones but with a larger activation rate (50%) which is less favorable to the GMCAbased algorithms. nGMCA remains better in most settings. Hoyer’s algorithm performs similarly to nGMCA for very noisy data but it is important to remember that in our experiment, Hoyer’s algorithm and sparse accelerated HALS are provided with the ground truth sparsity ratios, which would not be available with such precision in practice. For cleaner data, nGMCA and nGMCA begin to overtake nGMCA, which corroborates the results of the previous section for noiseless data with large activation rates and few sources (figure 1).

Figure 12: This benchmarks provides the reconstruction results for noisy data (15dB), 15 sources, a low activation rate (30%) and a varying number of measurements . The lower the number of measurements, the more difficult the reconstruction is, since the redundancy can help denoising and discriminating between the sources. While we have exhibited results for a large number of measurements so far, this shows that nGMCA also compares favorably with other algorithms when the number of measurements is more restrained.

Figure 13: This benchmarks provides the reconstruction results for sparse (30% activation rate) and noisy (15dB) data, and a varying number of sources . The complexity of the separation rises with the number of sources hence the reconstruction results decrease with it for all algorithms, but in any case, nGMCA performs best for all the values.
IvD2 About the initialization and the separation
Figure 14 provides the same results as figure 10 but compares nGMCA and nGMCA with version of them which are initialized with and and hence, with a perfect separation from the start. The initialized nGMCA is also provided with the exact noise standard deviation. The difference in term of reconstruction quality between the regular algorithms and their optimally initialized version is extremely small. This shows that the automatic estimation of the noise level within nGMCA is appropriate, and that the initialization of nGMCA and nGMCA is robust.
IvE Conclusion of the Experiments  the Compromise
Remember that the estimated sources (see [38]) can be decomposed as follows:
(27) 
Any BSS algorithm must minimize at the same time interferences, noise and artifacts in order to achieve good performance. These three terms are strongly affected by the sparsity prior:

Interferences: They intervene when the sources are not correctly, or not completely, separated. The term is computed as the projection on all the sources but the target. Sparsity, as a measure of diversity, can greatly help getting a correct separation of the sources, hence keeping this term relatively small. However, it can still create interferences when the sparse model for the sources departs from their actual structure, such as in figure 7. Interferences then originate from an imperfect source prior and/or badly separated sources.

Noise: is the part of the reconstruction that projects on the noise but not the sources. Since the Gaussian noise studied in this article spreads uniformly on all the coefficients, while sparse sources concentrate their energy on few coefficients, the thresholding effect implied by and regularizations significantly denoises the estimates. This reduces the importance of the noise term and therefore helps obtaining better reconstructions.

Artifacts: for a given source, the artifacts gathers the residues which are neither explained by the other sources nor the noise. We observed that the softthresholding operator introduces a bias which is the main contributor to the artifacts. Again, this term will increase when the sparsity level of the sources decreases: in such a case, the sparsity prior is not as well suited to constrain the morphology of the sources. It is also important to notice that even when the sources are very sparse, improperly constrained solutions are more akin to be contaminated with a higher level of artifacts.
All these aspects interact with each other. As shown in this section, nGMCA provides an effective tradeoff between noise, interferences and bias. Indeed, through the experiments, we show that nGMCA outperforms other algorithms in most scenarios, according to the SDR criterion which takes into account these three origins of reconstruction deterioration.
For low activation rate (high sparsity), nGMCA performs definitely better than the other algorithms for a large range of noise levels (figures 8 and 10) while in the extreme noiseless case it performs quite reasonably. In this setting, the sparsityenforcing prior plays its role at: i) getting a good separation process with respect to other priors (such as the pseudonorm); this helps reducing the interferences, ii) correctly denoising the sources; this tends to lower the noise contribution and artifacts.
nGMCA is noticeably quite robust to departures from the sparsity assumptions: it performs reasonably well with large activation rates (figures 1 and 11) but at the cost of a slight bias of the estimated sources (figures 6 and 7) which tends to increase the contribution of the artifacts.
Additionally, the nGMCA algorithm provides good separation performance for a large range of numbers of sources (figure 13) as well as for illconditioned problems arising from a lack of observations in figure 12, or from correlated mixing directions in figure 3. These results can be explained by the good separation power of the regularizer with an appropriate tuning of the regularization parameter, in order to disentangle sparse sources, together with the appropriate implementation of the nonnegativity constraints.
V Application
In physical applications, molecules can be identified by their specific Nuclear Magnetic Resonance (NMR) spectra. In this section, we simulate more realistic data, using NMR spectra of real molecules. These spectra are well adapted to the current settings since they are very sparse. The information about the peaks can be found in the Spectral Database for Organic Compounds, SDBS^{4}^{4}4http://riodb01.ibase.aist.go.jp/sdbs/cgibin/cre_index.cgi. The spectra were convoluted with a Laplacian with width at half maximum of 3 samples, in order to account for the acquisition imperfections. The number of samples is set to . is made of real spectra such as the ones displayed in figure 15. Some sources can exhibit strong normalized scalar product, such as cholesterol and menthone spectra for instance (0.67). The mixing coefficients of are simulated in the same way as in the previous section (, ). The observed data is where is an i.i.d. Gaussian noise matrix. An example of measurement where the lactose spectrum is particularly strong is provided in figure 16.
In figure 17, the number of measurements is limited to the number of sources, i.e. , which occurs in some applications; and the curves show the influence of noise in the data. With so few measurements, denoising becomes more important, while at the same time the noise is underestimated by the algorithm since the problem is less constrained. To compensate this behavior, is this time set to 2.
nGMCA fails to obtain suitable results. Indeed, in this setting the conditioning of the problem is extremely poor: and nGMCA is not able to converge. On the other hand, nGMCA performs from 3 to 5dB better than all the other algorithms. This shows once again that nGMCA is particularly robust for a large variety of settings. An example of reconstruction is given in figure 18, where nGMCA is able to identify more peaks that sparse accelerated HALS. Its reconstruction is however not completely noiseless, since there is always a tradeoff to find between denoising, separation and bias.
In figure 19, the number of measurements varies from 15 to 90. Since the conditioning greatly improves for larger numbers of measurements, nGMCA results increase very quickly. But in any case, although nGMCA and sparse accelerated HALS obtain similar results to nGMCA when there are enough measurements, nGMCA still performs better than all the other tested algorithms in most of the settings.
Vi Software
Following the philosophy of reproducible research [39], the algorithms introduced in this article will be available at http://www.cosmostat.org/GMCALab.
Vii Conclusion
In this paper we have introduced a new algorithm, nGMCA, to tackle the problem of sparse nonnegative BSS from noisy mixtures. Inspired by a recent sparse BSS algorithm coined GMCA, several extensions have been explored which imply that a rigorous handling of both sparse and nonnegative constraints are essential to avoid instabilities and suboptimal solutions. In particular, one extension estimates both a mixing and a source matrix by exactly solving the nonnegatively constrained and penalized subproblems, using proximal techniques. Extensive comparisons have been carried out with stateoftheart algorithms on synthetic data; these experiments show that this nGMCA extension is robust to noise contamination thanks to a dedicated thresholding strategy, with negligible parameter tuning. The experiments also show that it performs well for a wide variety of settings, including problems with highly correlated mixture directions, few observations or a large number of sources. Finally, the nGMCA algorithm yields highly competitive results on synthetic mixtures of real NMR spectra.
In this article however, the sparsity of the sources only held in the direct or sample domain. Future work will focus on extending nGMCA to deal with the more general setting where the sources are still nonnegative in the direct domain, but are sparse in a different signal representation.
Appendix A Resolution of the subproblems
Acknowledgment
J.B. was supported by the French National Agency for Research (ANR) 11ASTR03402MultID. We thank the associate editor and the anonymous reviewers for their help in improving the clarity and quality of the paper. We also thank Daniel Machado for his help in writing the paper.
References
 [1] M. W. Berry and M. Browne, “Email Surveillance Using Nonnegative Matrix Factorization,” Computational & Mathematical Organization Theory, vol. 11, no. 3, pp. 249–264, 2005.
 [2] J. Kim and H. Park, “Sparse Nonnegative Matrix Factorization for Clustering,” Georgia Institute of Technology, Tech. Rep. GTCSE0801, 2008.
 [3] C. Févotte, N. Bertin, and J.L. Durrieu, “Nonnegative matrix factorization with the ItakuraSaito divergence: With application to music analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009.
 [4] R. Dubroca, C. Junot and A. Souloumiac, “Weighted NMF for highresolution mass spectrometry analysis,” Proceedings of EUSIPCO, pp. 1806–1810, 2012.
 [5] D. D. Lee and H. S. Seung, “Learning the parts of objects by nonnegative matrix factorization, ” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
 [6] A. Cichocki, R. Zdunek, and S.i. Amari, “Csiszárs divergences for nonnegative matrix factorization: Family of new algorithms,” Independent Component Analysis and Blind Signal Separation, vol. 3889, no. 1, pp. 32–39, 2006.
 [7] P. Paatero and U. Tapper, “Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.
 [8] S. A. Vavasis, “On the Complexity of Nonnegative Matrix Factorization,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2009.
 [9] J. Rapin, J. Bobin, A. Larue and J.L. Starck, “Robust NonNegative Matrix Factorization for Multispectral Data with Sparse Prior,” Proceedings of ADA7, 2012.
 [10] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrix factorization,” Advances in neural information processing systems, vol. 13, no. 1, pp. 556–562, 2001.
 [11] E. F. Gonzalez and Y. Zhang, “Accelerating the LeeSeung Algorithm for Nonnegative Matrix Factorization,” Dept. Comput. Appl. Math., Rice University, Tech. Rep. TR0502, pp. 1–13, 2005.
 [12] M. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons, “Algorithms and applications for approximate nonnegative matrix factorization,” Computational Statistics & Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.
 [13] Y. Xu, “Alternating proximal gradient method for nonnegative matrix factorization,” arXiv:1112.5407v1, December 2011.
 [14] M. Merritt and Y. Zhang, “InteriorPoint Gradient Method for LargeScale Totally Nonnegative Least Squares Problems,” Journal of Optimization Theory and Applications, vol. 126, no. 1, pp. 191–202, 2005.
 [15] R. Zdunek and A. Cichocki, “Nonnegative matrix factorization with constrained secondorder optimization,” Signal Processing, vol. 87, no. 8, pp. 1904–1916, 2007.
 [16] A. Cichocki, R. Zdunek, and S.i. Amari, “Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization,” in Proceedings of ICA, 2007, pp. 169–176.
 [17] A. Cichocki, R. Zdunek, A. H. Phan, and S.i. Amari, Nonnegative Matrix and Tensor Factorizations: Applications to Exploratory Multiway Data Analysis and Blind Source Separation. John Wiley & Sons, Ltd, 2009.
 [18] C.J. Lin, “Projected gradient methods for nonnegative matrix factorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.
 [19] N. Guan, D. Tao, Z. Luo, and B. Yuan, “NeNMF: An Optimal Gradient Method for Nonnegative Matrix Factorization,” IEEE Transactions on Signal Processing, vol. 60, no. 6, pp. 2882–2898, 2012.
 [20] M. BabaieZadeh, A. Mansour, C. Jutten, and F. Marvasti, “A Geometric Approach for Separating Several Speech Signals.” in Proceedings of ICA, 2004, pp. 798–806.
 [21] W. S. Ouedraogo, A. Souloumiac, M. Jaïdane, and C. Jutten, “Geometrical method using simplicial cones for overdetermined nonnegative blind source separation: application to real PET images,” in Proceedings of LVA/ICA, 2012, pp. 494–501.
 [22] N. Gillis and S. A. Vavasis, “Fast and Robust Recursive Algorithms for Separable Nonnegative Matrix Factorization,” arXiv:1208.1237, 2012.
 [23] M. D. Plumbley and E. Oja, “A Nonnegative PCA Algorithm for Independent Component Analysis,” IEEE Transactions on Neural Networks, vol. 15, no. 1, pp. 66–76, 2004.
 [24] J.L. Starck, F. Murtagh, and M.J. Fadili, Sparse Image and Signal Processing  Wavelets, Curvelets, Morphological Diversity. Cambridge University Press, 2010.
 [25] M. Zibulevsky and B. A. Pearlmutter, “Blind Source Separation by Sparse Decomposition in a Signal Dictionary,” Neural Computation, vol. 13, no. 4, pp. 863–882, 2001.
 [26] Y. Li, A. Cichocki, S.i. Amari, S. Shishkin, J. Cao, and F. Gu, “Sparse Representation and Its Applications in Blind Source Separation,” NIPS, 2003.
 [27] J. Bobin, J.L. Starck, J. Fadili, and Y. Moudden, “Sparsity and morphological diversity in blind source separation,” IEEE Transactions on Image Processing, vol. 16, no. 11, pp. 2662–2674, 2007.
 [28] J. Eggert and E. Körner, “Sparse coding and NMF,” International Joint Conference on Neural Networks, vol. 2, no. 4, pp. 2529–2533, 2004.
 [29] P. O. Hoyer, “Nonnegative sparse coding,” Proceedings of NNSP, pp. 557–565, 2002.
 [30] P. O. Hoyer, “Nonnegative matrix factorization with sparseness constraints,” Journal of Machine Learning Research, vol. 5, no. 5, pp. 1457–1469, 2004.
 [31] N. Gillis and F. Glineur, “Accelerated Multiplicative Updates and Hierarchical ALS Algorithms for Nonnegative Matrix Factorization,” Neural Computation, vol. 24, no. 4, pp. 1085–1105, 2012.
 [32] N. Gillis, “Sparse and Unique Nonnegative Matrix Factorization Through Data Preprocessing,” Journal of Machine Learning Research, vol. 13, 2012.
 [33] J. Bobin, J.L. Starck, Y. Moudden, and J. Fadili, “Blind Source Separation : the Sparsity Revolution,” Advances in Imaging and Electron Physics, vol. 152, no. January, pp. 1–82, 2008.
 [34] A. Cichocki and R. Zdunek, “Regularized Alternating Least Squares Algorithms for Nonnegative Matrix/Tensor Factorization,” in Proceedings of the ISNN, 2007, pp. 793–802.
 [35] P. L. Combettes and V. R. Wajs, “Signal recovery by proximal forwardbackward splitting,” Multiscale Modeling Simulation, vol. 4, no. 4, pp. 1168–1200, 2005.
 [36] Y. Nesterov, “Gradient methods for minimizing composite objective function,” Mathematical Programming, vol. 140, no. 1, pp. 125–161, 2013.
 [37] A. Beck and M. Teboulle, “A Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, p. 183, 2009.
 [38] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech & Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
 [39] J. Buckheit and D. Donoho, “WaveLab and Reproducible Research.” SpringerVerlag, 1995, pp. 55–81.