# Semi-blind Source Separation via Sparse Representations and Online Dictionary Learning

###### Abstract

This work examines a semi-blind single-channel source separation problem. Our specific aim is to separate one source whose local structure is approximately known, from another a priori unspecified background source, given only a single linear combination of the two sources. We propose a separation technique based on local sparse approximations along the lines of recent efforts in sparse representations and dictionary learning. A key feature of our procedure is the online learning of dictionaries (using only the data itself) to sparsely model the background source, which facilitates its separation from the partially-known source. Our approach is applicable to source separation problems in various application domains; here, we demonstrate the performance of our proposed approach via simulation on a stylized audio source separation task.

Source separation, sparse representations, dictionary learning

## I Introduction

The blind source separation (BSS) problem entails separating a collection of signals, each comprised of a superposition of some unknown sources, into their constituent components. A canonical example of the BSS task arises in the so-called cocktail party problem in audio processing, and a number of methods have been proposed to address this problem. Perhaps the most well-known among these is independent component analysis (ICA) [1], where the sources are assumed to be independent non-Gaussian random vectors. Other source separation approaches entail more classical matrix factorization techniques like principal component analysis (PCA) [2], or, when appropriate for the underlying data, non-negative matrix factorization (NMF) [3] and non-negative sparse coding (NNSC) [4, 5].

Here we focus on a slightly different, and often more challenging setting – the so-called single channel source separation problem – where only a single mixture of the source signals is observed. Single channel source separation problems require the use of some additional a priori knowledge about the sources and their structure in order to perform separation[6, 7, 8, 9]. Here, we assume that the local structure of one of the source signals is approximately known (in a manner described in more detail below), and our aim is to separate this partially known source from an unknown “background” source.

Our separation approach is based on local sparse approximations of the mixture data. A novel feature of our proposed method is in our representation of the unknown background source – we describe a technique for learning (from the data itself) a model that sparsely represents the unknown background source, using tools from the emerging literature on dictionary learning (see, e.g., [10, 11, 12]).

While our proposed approach may find utility in source separation tasks in a variety of application domains, our effort here is motivated by an audio processing application in law enforcement scenarios where electroshock devices are utilized to induce temporary incapacitation. A key forensic task in these scenarios is to determine, from audio data recorded by the device itself, whether the resistive load encountered by the device is in a qualitatively “high” or “low” state (corresponding, respectively, to settings where the device is, or is not, delivering current to a human suspect). We demonstrate our proposed approach on a stylized version of this task, where our approach is employed to separate the audio corresponding to a nominally periodic (up to random timing jitter) and approximately known (up to the resistive load ambiguities) signal from otherwise unknown, but often highly structured, background audio. In forensics applications the separated audio signal could subsequently be used to classify the state of the resistive load, though here we demonstrate only the separation task, since accurate separation would facilitate accurate classification.

The remainder of this paper is organized as follows. In Section II we state our problem and describe our proposed approach in the context of related existing works, and a detailed description of our proposed method is provided in Section III. We provide an experimental performance evaluation motivated by the aforementioned audio forensics application in Section IV, and provide some concluding discussion and remarks in Section V.

## Ii Problem Statement and Related Works

Although the algorithm we develop here may be applied in any of a number of separation tasks, we use the stylized audio source separation example to fix notation and explain our approach. Let represent our observed data, and suppose that may be decomposed as a sum of two sources – one of which () exhibits local structure that is partially or approximately known, and the other () is unknown. In our motivating audio application for example, is comprised of samples of an underlying continuous time waveform, and we consider to be samples of a source that is a nominally regular repetition of one of a small number of prototype signals. Our aim is to separate the sources and from observations of , which may be noisy or otherwise corrupted.

Our proposed approach is based on the principle of local sparse approximations. In order to state our overall problem in generality, we describe an equivalent model for our data that facilitates the local analysis inherent to our approach. Let us suppose that is an integer that divides evenly, such that , an integer. Then may be represented equivalently as a matrix :

(1) |

where is a matrix whose columns are non-overlapping length- segments of , and similarly for . The goal of our effort is, in essence, to separate into its constituent matrices and .

As alluded above, our separation approach entails leveraging local structure in each of the components of . Our main contribution comes in the form of a procedure that, given our “partial” information about the columns of , enables us to learn in an online fashion and from the data itself a dictionary such that columns of are accurately expressed as linear combinations of (a small number of) columns of . In a broader sense, our work is related to some classical approximation approaches as well as several recent works on matrix decomposition. We briefly describe these background and related efforts in matrix decomposition here, in an effort to put our main contribution in context.

### Ii-a Related Works

#### Ii-A1 Low Rank and Robust Low Rank Approximation

Consider the model (1) and suppose that the columns of can each be represented as a linear combination of some linearly independent vectors, implying that is a matrix of rank . Now, different separation techniques may be employed depending on our assumptions of . Perhaps the simplest case is where the elements of are iid zero-mean Gaussian random variables; in this case, the problem amounts to a denoising problem, which can be solved using ideas from low-rank matrix approximation. In particular, it is well-known that the approximation obtained via the truncated (to rank ) singular value decomposition (SVD) of is a solution of the optimization

(2) |

where is the function that returns the rank of , and the notation denotes the squared Frobenius norm, which is the sum of squares of the elements of the matrix.

It is well-known that certain (non-Gaussian) forms of interference may cause the accuracy of estimators of the low-rank component obtained via truncated SVD to degrade significantly. This is the case, for example, when is comprised of sparse large (in amplitude) impulsive noise, or contains a few columns that may be construed as outliers in the low-rank model for . Numerous extensions of traditional PCA to these settings have been proposed in the literature; we mention here several recent efforts in robust PCA [13, 14] which model as a sparse matrix, and aim to simultaneously estimate both the low-rank and the sparse , by solving the convex optimization

(3) |

where is a regularization parameter. Here denotes the nuclear norm of , which is the sum of the singular values of . The nuclear norm is a convex relaxation of the non-convex rank function . The notation here denotes the sum of the absolute entries of – essentially the norm of a vectorized version of , which is a convex relaxation of the non-convex quasinorm that counts the number of nonzeros of .

#### Ii-A2 Low Rank Plus Sparse in a Known Dictionary

A useful extension of the robust PCA approach arises in the case where is not itself sparse, but possesses a sparse representation in some known dictionary or basis. One example is the case where the background source is locally smooth, implying it can be sparsely represented using a few low-frequency discrete cosine transform or Fourier basis elements. Formally, suppose that for some known matrix , we have that , where the columns of are sparse. The components of can be estimated by solving the following optimization [15]

(4) |

Note that an estimate of may be obtained directly as . This approach assumes (implicitly) a priori knowledge of a dictionary that sparsely represents the background signal, which may be a restrictive assumption in practice.

#### Ii-A3 Morphological Component Analysis

A more general model arises when is not low-rank, but instead, its columns are also sparsely represented in a known dictionary. Suppose that and are sparsely represented in some known dictionaries and , such that and , and that the columns of and are sparse. Such models were employed in recent work on Morphological Component Analysis (MCA) [16, 17, 18], which aimed to separate a signal into its component sources based on structural differences codified in the columns of the known dictionaries. The MCA decomposition can be accomplished by solving the following optimization

(5) | |||||

for some , where the estimates of and are formed as and , respectively. When and are each comprised of a single column, this optimization is equivalent to the so-called Basis Pursuit (or more specifically, Basis Pursuit Denoising) technique [19], which formed a foundation of much of the recent work in sparse approximation. Note that this approach also assumes a priori knowledge of a dictionary that sparsely represents the background.

###
Ii-B Our Contribution:

“Semi-blind” Morphological Component Analysis

Our focus here is similar to the MCA approach above, but we assume only one of the dictionaries, say , is known. In this case, the MCA approach transforms into a semi-blind separation problem where we try to also learn a dictionary to represent the unknown signal. Our main contribution comes in the form of a “Semi-Blind” MCA procedure, designed to solve the following modified form of the MCA decomposition

(6) | |||||

where the columns of the learned are constrained in some way (e.g., so that the norms of all columns are bounded by ). This modeling approach forms the basis of the remainder of this paper.

## Iii Semi-blind MCA

As described above, our model assumes that the data matrix can be expressed as the superposition of two component matrices, and . Further, we assume that each of the component matrices possesses a sparse representation in some dictionary, such that and , where is known a priori. Our essential aim, then, is to identify an estimate of the coefficient matrix and estimates and of the matrices and . Our estimates of the separated components are then given by , and .

We propose an approach to solve (6) that is based on alternating minimization, summarized here as Algorithm 1. Let be user specified regularization parameters. Our initial estimate of coefficients , corresponding to the coefficients of in the known dictionary , is obtained via a LASSO-type approach

(7) |

or other comparable sparse modeling approach, such as orthogonal matching pursuit (OMP) [20]. We then proceed in an iterative fashion, as outlined in the following subsections, for a few iterations or until some appropriate convergence criteria is satisfied. It should be noted that the lack of joint convexity makes it difficult to make global optimality claims for our proposed approach. In this sense, the overall performance may vary depending on the particular initialization strategy used.

### Iii-a Dictionary learning stage

Given the estimate , we can essentially “subtract” the current estimate of from , and apply a dictionary learning step to identify estimates of the unknown dictionary and the corresponding coefficients . In other words, we solve

(8) |

Now, given the estimate , we update our current estimate of the overall dictionary . We then update the overall coefficient matrix by solving another sparse approximation problem, as described next.

### Iii-B Sparse approximation stage

Given our current estimate of the overall dictionary, we update the corresponding coefficient matrices by solving the following LASSO-like problem:

(9) |

Now, we extract the submatrix from , and repeat the overall processing (beginning with the dictionary learning step). These steps are iterated until some appropriate convergence criteria is satisfied.

## Iv Experimental Evaluation: An Audio Forensics Source Separation Example

We demonstrate the performance of our approach on a stylized version of the audio separation task described in the introduction, which is motivated by forensic examination of audio obtained during law enforcement events where electroshock devices are utilized. For the sake of this example, we suppose that the electroshock devices discharge approximately times per second (a nominal period of ms), and the waveforms generated by the device during discharge take one of two different forms depending on the level of resistive load encountered by the device. The collected audio corresponds to the nominally periodic discharge of the device, superimposed with background noise. Our aim is to separate this superposition into its components.

Figure 1 shows a segment of the signals used in the simulation. We simulate the form of the nominally periodic signals (), shown in Figure 1 (a), using two distinct exponentially decaying sinusoids, corresponding to the use of two series RLC circuits with different resistance parameters to model the loaded and open circuit states. We form the overall signal by concatenating a sequence of randomly-selected versions of these two prototype signals, each of which is subject to a few samples of random timing offset or jitter, in order to model the non-idealities of an actual electroshock device^{1}^{1}1The timing offsets for each prototype signal were selected randomly from a collection of distinct values from a symmetric interval ms in duration centered at the nominal pulse occurrence time.. A speech signal^{2}^{2}2Speech Samples obtained from VoxForge Speech Corpus: www.voxforge.org/home shown in Figure 1 (b), was used to model background source. We simulate the overall raw audio data as a linear combination of , and zero-mean random Gaussian noise (Figure 1 (c) depicts the ideal case ). The data matrix is then formed from the signal as discussed in Section II, using non-overlapping segments with samples each.

Now, we form the dictionary by incorporating shifts of the nominal prototype pulses, corresponding to distinct shifts of each pulse, and employ the semi-blind MCA approach (discussed in Section III) to separate the background audio from the approximately known periodic portion. To evaluate whether, and to what extent, our proposed dictionary learning step aids in the separation relative to a similar approach that uses a fixed dictionary/basis to represent the unknown background source, we compare our proposed approach with two versions of MCA which use fixed bases (the standard DCT basis or the identity basis) to form the dictionary . We refer to these techniques as “MCA-DCT” and “MCA-Identity,” respectively^{3}^{3}3We use the estimated , obtained via MCA-DCT procedure to initialize our approach, as follows: we apply one step of orthogonal matching pursuit (OMP) [20] on the estimate of obtained via MCA-DCT to form the initial (one component per column) estimate for the SBMCA algorithm.. Further, for this example we evaluate our approach relative to a “time-frequency masking” approach, which is a standard method in the audio source separation literature. Specifically, we compare our approach with an approach based on spectral separation using NNSC and NMF, and denote this approach as NNSC (Spectral)^{4}^{4}4Let denote the matrix whose columns are the non-negative frequency-domain amplitude spectra of the corresponding columns of . Owing to symmetry, we retain in only the amplitudes corresponding to positive frequencies. Now, a classical NNSC-based source separation approach identifies and with nonnegative elements, to minimize (when this is just NMF). We reconstruct the time-domain estimate of using the spectral information from the two columns of having largest total contribution across rows of , and corresponding phase information from the original mixture (the estimate of is obtained similarly, using the remaining columns of )..

Table I lists the best achievable reconstruction SNRs (in dB)^{5}^{5}5For a signal and its estimate , the SNR is computed as . for MCA-DCT, MCA-Identity and NNSC (Spectral), obtained by choosing the parameters which give the best performance for and separately. For aforementioned methods, different parameters may have been utilized to obtain the reconstruction SNRs of each signal component, even for the same method and same noise level – in other words, the SNRs listed may not be jointly achievable from a single implementation of these procedures. However in case of SBMCA, the reported SNR values are achieved by clairvoyantly tuning the value(s) of the regularization parameters and number of dictionary elements to give the lowest error via a single implementation. At any rate, the proposed approach significantly outperforms each of the other approaches (MCA-based, as well as the classical spectral domain separation) in each of the settings examined.

A second, perhaps more illustrative, performance comparison is shown Figure 2, which depicts the histogram of normalized (i.e., per-sample) errors per block, measured using the vector -norm, for each method^{6}^{6}6Panels (a), (e), (i), and (m) represent the histogram of normalized error-per-block for and (b), (f), (j), and (n) represent the histogram of normalized error-per-block for via SBMCA, MCA-DCT, MCA-Identity, and NNSC (Spectral) respectively, with standard deviation of gaussian noise . Panels (c), (g), (k), and (o) represent the histogram of normalized error-per-block for and (d), (h), (l), and (p) represent the histogram of normalized error-per-block for via SBMCA, MCA-DCT and MCA-Identity respectively, with standard deviation of gaussian noise .. We observe from the distribution of -errors across blocks, that the SBMCA procedure (Figure 2 (a-d)) results in larger number of blocks with lower errors as compared to the other approaches. This feature may be of primary importance in the motivating audio forensics application where classifying each period of the nominally periodic signal , as one of the two prototype signals, is of interest.

Noise | ||||
---|---|---|---|---|

Method Signal | ||||

SBMCA | 23.84 | 29.33 | 20.33 | 17.14 |

MCA-DCT | 20.53 | 26.05 | 18.44 | 17.10 |

MCA-Identity | 11.91 | 16.25 | 11.83 | 11.96 |

NNSC (Spectral) | 8.89 | 13.12 | 6.57 | 11.77 |

SBMCA | ||||
---|---|---|---|---|

(a) | (b) | (c) | (d) | |

MCA-DCT | ||||

(e) | (f) | (g) | (h) | |

MCA-Identity | ||||

(i) | (j) | (k) | (l) | |

NNSC (Spectral) | ||||

(m) | (n) | (o) | (p) |

## V Discussion and Conclusions

We conclude with a few additional comments related to our experimental demonstration, and more broadly, to the philosophy underlying our proposed model-based separation strategy. In the context of audio processing, spectral source separation approaches (based on variants of NNSC and NMF) remain among the most popular techniques. Indeed, prior efforts have examined semi-blind separation techniques that are qualitatively very similar to the approach proposed here, which aim to separate a source with known spectral content or properties from another unknown source, whose local spectral representation is learned online from the mixture data itself [21, 22]. Generally speaking, however, spectral-domain separation approaches find most utility in settings where the signals are, in effect, nearly orthogonal in the frequency domain, or at least in cases where the spectral overlap in minimum. Significantly overlapping spectra in the signals being separated is widely noted as a challenging setting for these traditional methods (e.g., as noted in [21]).

On the other hand, sparse modeling and dictionary learning approaches implicitly allow for separation on the basis of overcomplete representations of each source, and ultimately of the mixture overall. This additional modeling flexibility offers the promise that even signals with overlapping frequency domain amplitude spectra may still be separated, provided other appropriate notions of structure (sparsity in appropriate dictionaries) are employed. Further, such modeling approaches are not restricted to the frequency domain; separation in the time domain is a viable approach if that is the domain in which the structure of the sources is most easily modeled.

Overall, encouraged by the experimental investigation here, we feel that semi-blind “dictionary based” separation strategies may find applications in other domains (e.g., in image or video processing) provided the structure is modeled in the appropriate domains or representations. We defer the further examination of our approach in these other application domains to future efforts.

## References

- [1] A. Hyvärisen and E. Oja, “Independent component analysis: Algroithms and applications,” Neural Networks, vol. 13, no. 4-5, pp. 411–430, 2000.
- [2] I.T. Jolliffe, Principal Component Analysis, Springer Verlag, 1986.
- [3] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, 1999.
- [4] P. O. Hoyer, “Non-negative sparse coding,” in Proc. IEEE Workshop on Neural Networks for Signal Processing, 2002, pp. 557–565.
- [5] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Journal of Machine Learning Research, vol. 5, pp. 1457–1469, 2004.
- [6] G. J. Jang and T. W. Lee, “A Maximum Likelihood Approach to Single-Channel Source Separation,” Journal of Machine Learning Research, vol. 4, pp. 1365–1392, Dec. 2003.
- [7] T. P. Jung, S. Makeig, C. Humphries, T. W. Lee, M. J. McKeown, V. Iragui, and T. J. Sejnowski, “Removing Electroencephalographic Artifacts by Blind Source Separation,” Psychophysiology, vol. 37, pp. 163–178, 2000.
- [8] M. N. Schmidt and R. K. Olsson, “Single-channel Speech Separation using Sparse Non-negative Matrix Factorization,” International Conference on Spoken Language Processing (INTERSPEECH, 2006.
- [9] M.E. Davies and C.J. James, “Source Separation using Single Channel ICA,” Signal Processing, vol. 87, no. 8, pp. 1819 – 1832, 2007.
- [10] B. A. Olshausen and D. J. Field, “Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?,” Vision Research, vol. 37, no. 23, pp. 3311–3325, 1997.
- [11] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: Design of Dictionaries for Sparse Representation,” In Proceedings of SPARSâ05, pp. 9–12, 2005.
- [12] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online Learning for Matrix Factorization and Sparse Coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.
- [13] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust Principal Component Analysis?,” Journal of the ACM, vol. 58, no. 3, pp. 11, 2011.
- [14] V. Chandrasekaran, S. Sanghavi, P. A. Parrilo, and A. S. Willsky, “Rank-Sparsity Incoherence for Matrix Decomposition.,” Society for Industrial and Applied Mathematics (SIAM) Journal on Optimization, vol. 21, no. 2, pp. 572–596, 2011.
- [15] M. Mardani, G. Mateos, and G. B. Giannakis, “Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies,” CoRR, 2012, Online: arXiv:1204.6537v1 [cs.IT].
- [16] J. L. Starck, M. Elad, and D. L. Donoho, “Image Decomposition via the Combination of Sparse Representations and a Variational Approach,” IEEE Transactions on Image Processing, vol. 14, no. 10, pp. 1570–1582, 2005.
- [17] D. L. Donoho and G. Kutyniok, “Microlocal Analysis of the Geometric Separation Problem,” CoRR, vol. abs/1004.3006, 2010.
- [18] J. Bobin, J. L. Starck, J. Fadili, Y. Moudden, and D. L. Donoho, “Morphological Component Analysis: An Adaptive Thresholding Strategy,” IEEE Transactions on Image Processing, vol. 16, no. 11, pp. 2675–2681, 2007.
- [19] S. Chen and D. Donoho, “Basis Pursuit,” in Conference Record of the Twenty-Eighth Asilomar Conference on Signals, Systems and Computers. IEEE, 1994, vol. 1, pp. 41–44.
- [20] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal Matching Pursuit: Recursive Function Approximation with Applications to Wavelet Decomposition,” in Proceedings of the 27th Annual Asilomar Conference on Signals, Systems and Computers, 1993, pp. 40–44.
- [21] P. Smaragdis, B. Raj, and M. Shashanka, “Supervised and semi-supervised separation of sounds from single-channel mixtures,” in Independent Component Analysis and Signal Separation, pp. 414–421. Springer, 2007.
- [22] Z. Duan, G. J. Mysore, and P. Smaragdis, “Online PLCA for real-time semi-supervised source separation,” in Proc. International Conference on Latent Variable Analysis / Independent Component Analysis, 2012.