Semisupervised Speech Enhancement in Envelop and Details Subspaces
Abstract
In this study, we propose a modulation subspace (MS) based single channel speech enhancement framework, in which the spectrogram of noisy speech is decoupled as the product of a spectral envelop subspace and a spectral details subspace. This decoupling approach provides a method to specifically work on elimination of those noises that greatly affect the intelligibility. Two supervised lowrank and sparse decomposition schemes are developed in the spectral envelop subspace to obtain a robust recovery of speech components. A Bayesian formulation of nonnegative factorization is used to learn the speech dictionary from the spectral envelop subspace of clean speech samples. In the spectral details subspace, a standard robust principal component analysis is implemented to extract the speech components. The validation results show that compared with four speech enhancement algorithms, including MMSESPP, NMFRPCA, RPCA, and LARC, the proposed MS based algorithms achieve satisfactory performance on improving perceptual quality, and especially speech intelligibility.
I Introduction
Speech enhancement is a common topic in speech processing and front end of speech recognition. In general, the goal of speech enhancement is to improve both perceived quality and intelligibility by reducing residual noise while minimizing speech signal distortion. Speech with higher quality is more comfortable for audiences, and comparatively higher intelligibility is measured by lower word recognition error rates. Most of the existing speech enhancement algorithms can achieve high speech quality but relatively low performances on speech intelligibility [1].
Speech intelligibility is mainly affected by vocal tracts, which can be translated as the spectral envelop [2]. The fact that melfrequency cepstrum coefficients (MFCC) derived from spectral envelop is efficient in automatic speech recognition algorithms [3] further demonstrates the importance of spectral envelop on intelligibility improvement. Generally, the human speech is a convolution acoustic procedure in the time domain, corresponding to that vocal excitation (harmonics) is modulated by vocal tract (formants) in the frequency domain:
(1) 
where as the ’carrier’ modulates the finestructure , and is the Hadamard product. Equivalently, we can decompose the noisy speech as , where is the envelop matrix and is the details matrix. In the cepstrum domain, along the pseudofrequency axis, the vocal tract as the identity of the speech is located at low frequency region, and the pitch is concentrated as high frequency components [4]. This physical modulation is naturally translated as the correlations in timefrequency (TF) domain. Despite several works take into account the interframe correlation, estimation of by operating on spectrogram cannot provide insightful illustration on the long term correlation, which generally is reflected by the envelop . This may also explain that conventional spectrogram estimation sheds little light on intelligibility improvement. Therefore, decoupling the formants and pitches into different subspaces helps to process the correlations independently, and as a result may provide a better approximation to envelop matrix.
To recover the speech spectrogram , our approach is to extract speech components (i.e., and ) from two subspaces and separately. Considering that previous studies [5] show that supervised methods perform quite well in subspace speech enhancement, we propose a semisupervised framework combining dictionary and nondictionary based lowrank and sparse decomposition (LSD). In envelop subspace , driven by the motivation of improving intelligibility and the fact that speech bases may highly overlap with the convex hull of noise bases, two specifically designed algorithms, referring as twolayer LSD (TLSDMS) and singlelayer LSD (SLSDMS), are comparatively introduced to implement the speech extraction. An offline trained speech envelop dictionary is utilized in both TLSDMS and SLSDMS. In , a general unsupervised LSD is used to obtain speech components . The spectrogram of estimated speech can be obtained as the elementwise product of the two extracted sub matrices.
Ia Related Work
In this study, our proposed speech enhancement algorithm can be categorized as modulation subspace based semisupervised LSD. Modulation domain based source separation technologies are mainly developed according to the knowledge that the spectrogram of speech can be described as a timevarying weighted sum of component modulations [6]. By exploiting intrinsic decomposition through well convex optimization, lowrank and sparse analysis overcomes the high sensitive of the conventional principle component analysis (PCA) when subjecting to large corruptions.
Low rank and sparse decomposition
The idea of applying LSD to separate speech from background noise is derived from the intrinsic data structure of noisy speech spectrogram [7], in which background noise usually demonstrates low spectral diversity whereas speeches are more instantaneous and changeable. Specific constraints (e.g., masking threshold, noise rank, and blockwise restrictions) [8, 9] are incorporated to optimize the decomposition. LSD has been also implemented in wavelet packet transform domain[10, 11], in which the speech components are concentrated to be more sparsity.
In many relevant cases, using a single spectral model to describe the speech signal is insufficient. Because with longterm repeated structure, speech can also demonstrate lowrank characteristic as well as sparsity. The coexisting of lowrank and sparse properties in speech requires a more comprehensive constraint to reflect its spectral structure. Chen [12] utilized a modified robust PCA (RPCA) optimization function, in which offline trained speech spectral dictionary is employed and outlying entries are subjected to minimal energy restriction. Duan introduced an online learned dictionary to implement nonnegative spectrogram decomposition [5]. Yang proposed a LSD strategy via combining dictionaries with respect to speech and noise [13]. Despite the supervised RPCA relying on prelearned dictionary, to some extent, is quite similar to dictionary based nonnegative matrix factorization (NMF) technique [14] and sparse coding approach [15], it specifically imposes rank constraints on background noise spectra, and is more flexible and effective for nonstationary noise cancellation.
Modulation Domain based Source Separation
In a speech enhancement framework, the time and frequency modulations in spectrogram are intuitively represented as the correlations among neighboring spectral magnitudes. These correlations have been frequently employed as a prior knowledge to improve either the noise power estimation [16] or speech magnitude estimation [17]. Typically, by incorporating 1D smooth coefficients [18] or 2D average window [19] imposed on the spectrogram, significant improvements on speech quality can be achieved by taking the correlations into account.
Instead of locally introducing correlations deriving from the modulations in spectrogram, a more straightforward way is to decouple the modulation by transforming into the cepstral domain. By utilizing pseudo frequencies, Deng [20] conducted a conditional minimum mean square error (MMSE) estimation in the cepstrum domain, and the result showed that it was a noiserobust feature selection approach. Breithaupt [21] proposed a higher cepstral coefficients smoothing scheme, in which the recursive temporal smoothing was only applied to the fine spectral structure. Gekmann enforced the statistical estimation in the temporal cepstral domain, and successfully obtained a more accurate speech presence probability estimation [22]. Veisi and Sameti introduced hidden markov models into the melfrequency domain [23], and the results indicated a significant improvement on noise cancellation. Different from cepstrum based algorithms, Paliwal [24, 25] proposed a framewise transformation along the time axis, and in the modulation domain, the clean speech is obtained based on conventional speech estimator.
IB Method Overview
The proposed method of decomposing spectrogram into two modulation subspace has two major advantages: 1) the decoupling intrinsically incorporates the correlations existed in the speech spectrogram; 2) it strengthens the acoustic characteristics of speech components in each subspace, and makes speech components more distinctive compared with noise components. To obtain the two modulation subspaces, a cepstrum based modulation inverse (CMI) transform is applied. It firstly obtains cepstrogram by applying elementwise logarithm and discrete Fourier transform (DFT), then window functions are used to separate the envelop and details subspaces in the cepstrum domain, and finally inverse Fourier transform is implemented to obtain two modulation subspaces [?].
In each modulation subspace, LSD are implemented to extract the speech components. Considering that the spectral envelop subspace has a slowly varied property, noise components in this subspace share more spectral bases with speech components than that in the spectral details subspace. Therefore, in the spectral envelop subspace, supervised LSD can be implemented, in which two different decomposition strategies adapting to different types of noises are proposed. In the spectral details subspace , the speech components show highly regular structure (i.e., fine structure), and comparatively noise is supposed to be low rank. Therefore, a typical unsupervised RPCA method can be used to effectively extract the speech spectral details. Specifically, for unvoiced segments, the supervised LSD in envelop subspace and the unsupervised LSD can in details subspace can both work efficiently to minimize the residual noises as the general LSD approaches conducted in spectrogram domain. Especially, considering that the details subspace provide a more concentrated speech structure, it will yield better results comparing with the general spectrogram decomposition. The implementation procedure is shown in Fig. 1.
IC Contribution of Our Work
By decoupling the spectral envelop and spectral details subspaces, LSD is implemented in both subspaces in this study. The contributions of the proposed algorithms can be summarized as follows:

A uniform acousticmodel based framework is proposed, which naturally inherits the correlations demonstrated in speech spectrogram. In other words, the decoupling of the spectral envelop and details subspaces can help to independently reduce the distortions in these two uncorrelated subspaces.

New semisupervised speech enhancement algorithms are proposed based on the two modulation subspaces, which provides robust features for dictionary learning.

Two different LSD schemes are developed to be adaptive to different background noise in the spectral envelop subspace.

The proposed algorithms provide highly efficient and robust solutions to single channel speech enhancement, and the comprehensive evaluation results demonstrate significant improvements on speech quality, and particularly with respect to intelligibility, compared with existing stateoftheart algorithms.
The rest of the paper is organized as follows: the modulation subspace framework is presented in Section II. Section III describes the algorithms of two proposed semisupervised LSD. In Section IV, the experiments and results are presented with the developed approaches. Finally, Section V concludes the study.
Ii Modulation subspace
For a speech signal corrupted by an additive noise, i.e., , the squared magnitude spectrum is given as
(2) 
by using , , and , it can further be presented as
(3) 
where , , and includes the speechnoise crossterm . Considering the spectral envelop and spectral details subspaces, the noisy speech model in the spectrogram domain is given as
(4) 
in which noisy speech spectrogram matrix is the elementwise product of the spectral envelop matrix and spectral details matrix , same as the definition in (1). Accordingly, the noise term in (3) is equal to . and are the relative noise components in two subspaces. Note that is not necessarily equal with . To conduct the decomposition in each subspace and extract and , both subspaces and can be obtained based on the noisy speech spectrogram . For clean speech spectrogram , applying window functions in the cepstral domain can effectively obtain the spectral envelop and details subspaces. However, noise greatly affects the boundaries of two subspaces in the cepstral domain [26].
Iia Two Modulation Subspaces Decomposition
In this study, the proposed CMI is applied to obtain the two modulation subspaces. Hilbert transform as a typical approach for demodulation is also used as a comparison. Unlike the Hilbert transform, CMI requires no assumption on speech details structure. In CMI method, the spectral envelop and details matrices of the noisy speech can be written as:
(5) 
(6) 
in which and are defined as elementwise exponential and logarithm, and refers to the DFT matrix
where , and is the number of input signal data points. and are the mask matrices to select the low pseudo frequency components and high pseudo frequency components in the cepstrum domain, and defined as , and , respectively. is a element column array with unity value. , and are element column array, and given as
(7) 
where the index is the pseudo frequency in the cepstrum domain, and its corresponding frequency is given as , where is the sampling frequency.
The obtained subspace matrices by two approaches are shown in Fig. 2. Obviously, in the spectral details subspace, the Hilbert transform produces a considerably irregular speech distribution when the speech deviates from an ideal sinusoidal model. Comparatively, the proposed CMI obtains a periodical alignment of speech components.
In CMI, the masking matrix decides the energy distribution of and . To effectively extract the speech components from the two subspaces, should be optimized to achieve noise tradeoff between the two subspaces. As shown in Fig. 3, two peak magnitudes in the cepstrum domain referred as the spectral envelop and spectral details (i.e., fine structure) are located at low pseudo frequency and high pseudo frequency regions (as marked in Fig. 3), respectively. Different types of noises can be translated into varieties of distribution in cepstral domain: babble noises with the speechlike structure can produces ’fake’ peaks, and steady Gaussian noises with flat spectral envelop present strong low pseudo frequency peak.
Along the pseudo frequency axis, defines SNR in each subspace. According to the Parseval’s theorem, we calculate SNR in cepstral domain instead of TF domain. Therefore, we define SNR=E/E = /. and are the speech and noise components in the cepstral domain. Accordingly, SNR = /, where is a round operator. Supposing speech components are ideally concentrated in two narrow bands (as shown in Fig. 3), which are centered at and with bandwidth and , respectively. To ensure that the major energy of envelop and finestructure can be separated, should be in the range [ ]. Moreover, SNR can be approximated as / = . When increases, increases, and accordingly, SNR will decrease. Thus to obtain higher SNR, therefore should be as smaller as possible, however, higher than to maintain the spectral envelop subspace energy. Typically, the vocal range (i.e., fundamental frequency) for human speech is about 85 Hz to 300 Hz [6]. Accordingly, the lower boundary of should include 300 Hz fundamental frequency. Hence, we have relationship satisfying
(8) 
The sampling frequency is 16 kHz, and alternatively, cycle in half second. In practice, when is selected as 2535 cycle, the results are comparable. In our study, is set as 30.
IiB Low Rank and Sparse Characteristics of Two Subspaces
In the last section, we have demonstrated that with different , CMI can lead to different spectral subspaces. When is varying between the envelop peak and details peak along the pseudo frequency axis in cepstrum domain, SNRs of and are changed. With a given within the range in (8), CMI procedure is equal to the transform , where
is a circulant matrix, , and . In what follows, the row and column indices and . Then, the th entry of is given by
(9) 
Similarly, by ,
(10) 
and are in the same form as (9) and (10), in which is used to replace . Technically, to prove that is relatively higher rank than is almost impossible. However, due to the fact that small values in these matrics have no significant impact on speech components recovery, we can focus on these principle components that most influential to noise cancellation. Thereby, singular value decomposition (SVD) is utilized to demonstrate the approximate rank properties. The numerical results have been shown in Fig.4
The experimental results clearly demonstrate that the singular values of noise matrices (i.e., and ) decrease faster than speech matrices (i.e., and ). By cutting off small singular values (i.e., thresholding by ), both and show lower rank than and , respectively. This conclusion further can be used to justify the implementation of lowrank decomposition.
In sparsity perspective, speech components generally are more spectrally diverse than noise components [12]. This conclusion is also applicable in the proposed two modulation subspaces. The spectral envelop subspace reflects the low pseudo frequency components, and apparently it is more ’smooth’ than the spectral details subspace. Such ’smooth’ can be regarded as less spectral basis diversity. Specifically, and are both included in this ’smooth’ subspace, which means their spectral convex hulls would be overlapped with high ratio. This intuitive assumption has been evidently shown in Fig. 5, where the speech and noise components are projected to the principal axes (i.e., eigenvectors). The principal directions are extracted from the clean speech spectra, and the first 3 largest and the succeeded 3 secondary largest eigenvectors are used as the 3 dimensional support basis displayed in the same space. As shown in Fig. 5(a)(c), in spectral envelop subspaces, speech bases are severely overlapped, compared with that in the spectral details subspace noise bases are concentrated to be easily separated from the speech bases shown in Fig. 5(b)(d). Therefore, it is easier to separate speech components from noise components in the spectral details domain than in the spectral envelop domain.
Based on the energy distribution in the cepstrum domain, as the discussion of SNR and SNR in II.A, the spectral envelop subspace demonstrates higher SNR compared with the spectral details subspace. The highly overlapped spectral bases between speech and noise components in the spectral envelop subspace require supervised decomposition approach. Contrarily, in the spectral details subspace, speech components can be considered as the summation of several harmonics within narrow frequency band, while the noise components in this subspace generally are random statics. As a result, a general RPCA based decomposition in the spectral details subspace can be used to separate speech and noise components.
IiC Noise in the Spectral Envelop Subspace
As discussed in II.B, noise components share more bases with speech components in the spectral envelop subspace than in the spectral details subspace. Therefore, we further investigated how the lowrank and sparse characteristics of different types of noise affect the separation of speech and noise components in the spectral envelop subspace. In this study, 25 noise samples (as listed in Table. I) obtained from several databases, including NOISEX92, IEEE database, and NOIZEUS, are used. The spectrograms of these noise samples are decomposed into the spectral envelop subspace as shown in Fig. 6 (left), and noted as . For each noise sample, has an approximation form consisted of linear combinations of speech bases in the spectral envelop subspace. This projection is given as a nonnegative least square optimization
(11) 
where the activation matrix (as shown in Fig. 6(right)) is a linear transformation of noise matrix into the speech dictionary space . Both and matrices show lowrank and sparse characteristics. It means when the LSD is conducted, there is a tradeoff on distributing the two parts of noise components into speech subspace.
To quantitatively describe the impact of different noises on speech enhancement, two indices, referred as coherent ratio and sparsitytolowrank ratio (SLR), are proposed to explain the general selection criterion of decomposition algorithm and parameters. The coherent ratio is applied to measure the coherence between the noise components and speech components, and is given as
(12) 
where denotes the th column in and is the th column in . Two matrices are considered less coherent if is small. The normalized values of 25 noise samples are summarized in Table. I, in which babble noise is strongly coherent to speech dictionary ( values at 0.85), and white Gaussian noise is much less coherent to speech dictionary ( values at 0.35). In order to minimize the errors caused by incorrect distributions of noise components, the SLR R is proposed for a better understanding of the noise’s energy distribution on lowrank components and sparse components. Accordingly, R is described as
(13) 
where norm is defined as . The nuclear norm is , in which is the singular value. We calculate the R values of noise samples in both the spectral envelop subspace and the activation matrix , corresponding to and , respectively.
The normalized values of (as summarized in Table. I) ranging from 0 to 1 indicate which part of the matrix takes superiority in RPCA based decomposition. R represents the R values in envelop subspace, and R represents the R value in speech dictionary space. Different types of noise have various R values, which indicate the degree that they can be decomposed into speech subspace. For instance, Gaussian noise shows the highest sparsity (R = 1) in the spectral envelop subspace, and a considerably low rank (R = 0.06) in the speech dictionary space. In contrast, Volvo noise has the highest sparsity (R = 1) in speech dictionary space, but it is relatively lowrank (R = 0.39) in the spectral envelop subspace. Generally, with the assumption that speech shows sparsity characteristic, one can propose different decomposition tactic strategies to conform the SLR. To develop robust LSD, the consideration of R value of noise can facilitate the separation of speech and noise components in the spectral envelop subspace.
C  R  R  
Car 60mph  0.57  0.31  0.34 
Cafeteria babble  0.82  0.54  0.07 
Babble  0.85  0.55  0.11 
Construction Crane  0.81  0.54  0.02 
Inside Flight  0.53  0.44  0.66 
Street Downtown  0.79  0.56  0.13 
Street  0.81  0.56  0.31 
Construction Drilling  0.86  0.58  0.21 
F16  0.83  0.92  0.44 
Inside Train 1  0.84  0.53  0.15 
Inside Train 2  0.68  0.22  0.39 
Train1  0.40  0.75  0.55 
Train2  0.22  0.62  0.52 
PC Fan  0.27  0.31  0.84 
SSN  1  0.69  0.22 
Volvo  0.01  0.39  1 
Water Cooler  0.86  0.56  0.30 
Machinegun  0.52  0.01  0.01 
Leopard  0.36  0.31  0.81 
Subway  0.84  0.85  0.26 
Airport  0.77  0.47  0.16 
Restaurant  0.85  0.54  0.16 
Exhibition  0.74  0.89  0.21 
White Gaussian  0.35  1  0.06 
Pink  0.85  0.87  0.23 

IEEE;

Noisex 92;

NOIZUS;

Simulated
Iii SemiSupervised Decomposition in modulation subspace
In this study, two different supervised LSD schemes are proposed in the spectral envelop subspace, and both of them are applied to obtain a robust recovery of speech components. In the spectral details subspace, a standard unsupervised RPCA is implemented.
Iiia Model Description
In both spectral envelop and spectral details subspaces, LSD is conducted. A basic RPCA model for both subspaces can be written as
(14) 
where refers to the spectral envelop subspace and the spectral details subspace, respectively. As discussed in Section II, noise components share more bases with speech components in the spectral envelop subspace than that in the spectral details subspace. In other words, it is harder to distinguish noise from speech in the spectral envelop subspace without prior information. Therefore, the offlinetrained speech dictionary in the spectral envelop subspace is employed to separate from .
spectral envelop subspace model
In a supervised decomposition, is commonly regarded as a sparse activation of a global speech dictionary. The underlying reason is that speech components in each local segment can be expressed by a few bases within a global speech dictionary space. A sparse constrain can be enforced upon the activation matrix to extract the speech components. Comparatively, for , the lowrank constraints implemented in the spectral envelop subspace can effectively pick out noise components which are not within the speech dictionary space, alternatively incoherent to speech components. However, the noise components that are coherent to speech dictionary bases may not be excluded from speech components. As a result, the activation matrix of speech dictionary is mixed with some outlying entries, which represents the noise approximation. As discussed in Section II, when the SLR varies, the noise components show different probabilities to be decomposed into the speech subspace. Therefore, in this study, we proposed two different supervised decompositions, referred as a twolayers LSD (TLSDMS) and a single layer LSD (SLSDMS), in the spectral envelop subspace. Both decompositions are based on the consideration of reducing incoherent noise components.
The proposed TLSDMS is straightforward on incoherent noise cancellation: after first layer LSD in the spectral envelop subspace, the second layer LSD is implemented in the activation matrix of speech dictionary. The first layer LSD can be written as
(15) 
where the spectral envelop matrix is decomposed as the summation of a lowrank matrix , reflecting less spectrally diverse noise components, and a product of speech dictionary and its sparse activation matrix [27]. In the speech components , it either exists some noise residuals or is highly distorted by tightening or relaxing the noise constraint coefficient . A natural thought is to conduct the second layer LSD in the obtained . Accordingly, it can be translated as the following optimization problem
(16) 
where is the activation matrix of noise components in speech dictionary space. The rationality of (16) is that even in the activation matrix, speech components are still more diverse than noise components , which has been explained in the Fig. 5(a) and (c). The TLSDMS is quite efficient when the noise components have small R and large R (e.g., volvo noise listed in Table I).
When the background noise shows large R, the first layer of TLSDMS may decompose the noise components into speech with a high possibility. Moreover, the second layer of TLSDMS only works well for those noise with high R in the speech dictionary space. Hence, another decomposition scheme SLSDMS is proposed as
(17) 
where and have the same definitions as in TLSDMS, reflecting the coherent and incoherent noise components, respectively. is a modified nuclear norm, in which is the singular value, and is a mapping function, defined as [28]. The conventional nuclear norm overpenalizes large singular values, and consequently may only find a biased solution. In the ptype norm, when , a tighter rank approximation can be obtained. Specifically, this tightrank approximation can equally treat each singular value in the optimization.
spectral details subspace model
In the spectral details subspace, the typical RPCA decomposition is given by
(18) 
where is considered as the approximation to , and corresponds to . This unsupervised LSD can effectively separate speech components from noise components in the spectral details subspace. Because the speech components in the spectral details subspace show periodic structures (i.e., finestructure as shown in Fig. 2(b)), the conventional RPCA can work well on speech extraction in the spectral details subspace.
IiiB Dictionary Learning
NMF is a reliable method to obtain speech dictionary [29], and it can be given as
(19) 
where is a cost function, is an optional regularization term, and is the regularization weight. The minimization of (19) is performed under the nonnegativity constraint of and . The commonly used cost functions include Euclidean distance, Bregman divergence, and the negative likelihood in the probabilistic NMFs. In this study, a Bayesian NMF is used to learn the speech dictionary from the spectral envelop subspace of clean speech samples. Accordingly, the input matrix is assumed to be stochastic. To perform NMF as , the following model is considered:
(20) 
where are latent variables, denotes the Poisson distribution, and is the factorial of . Before conducting the dictionary learning procedure proposed in [14], the speech samples are reduced according to the syllable boundaries in the time domain. By this way, it excludes spectral interfere from those utterance interval, which is quite similar with noise structure.
In practice, three approaches (i.e., least angle regression with coherence (LARC), efficient Sparse Coding [30], and Bayesian NMF) have been applied to learn the clean speech dictionary in the spectral envelop subspace. Among these dictionaries learned by the three approaches, the Bayesian NMF based dictionary showed the best speech enhancement results. Therefore, Bayesian NMF was selected as the dictionary learning method in this study.
IiiC Algorithms
To solve the twolayer optimization problem in (15) and (16), augmented Lagrangian method (ALM) is employed. For the first layer, the optimization solution is given as
(21) 
where an auxiliary variable is introduced , and assumed to be equal to . After obtaining , a standard RPCA based decomposition is implemented in the second layer
(22) 
The proposed SLSDMS presented in (17) can be solved by the alternative direction method of multipliers (ADMM) or ALM. However, both methods require introducing two auxiliary variables to solve (17) and expensive matrix inversions are required in each iteration. Accordingly, a recently developed method called the linearized alternating direction method with adaptive penalty (LADMAP) is applied in this study. The augmented Lagrangian function is
(23) 
where the introduced extra variable is assumed to be equal with . Specifically, to update , a subproblem is proposed as
(24) 
here we extend the method proposed in [28] to LADMAP approach. Therefore, the conventional singular value thresholding operator can be redefined as
(25) 
where is any singular value decomposition and . Considering the thresholding value includes the singular value itself, an iterative approach is applied to yield the converged . Accordingly, the shrinkage operator has a closedform solution at the th inner iteration
(26) 
As a relaxation of , the norm is the summation of the absolute values of all the entries. Therefore, it may lead to suppression of speech components, since speech components generally present high intensity in activation matrix compared with noise. To tight the sparsity constraint, a special strategy w.r.t energy concentration is proposed for (21)(27), and can be given as
(28) 
where . The is an energy threshold value set for concentrating the decomposition. The introduced energy threshold is a penalty to the sparsity relaxation, and also helps to distinguish the speech components from noise components in the activation matrix.
Generally, (21), (22), and (27) can be solved by ALM, and (23) is solved by LADMAP. With some algebra, the corresponding updating schemes of TLSDMS and SLSDMS are outlined in Algorithm 1 and Algorithm 2, respectively.
Input:noisy speech envelop matrix Y and offlinetrained dictionary , parameters , , , tolerance and .
Initialize:Set maxIter, and Terminate False. Initialize , , , and to zero.
Output:Optimal active coefficient matrix
Input: noisy speech envelop matrix Y and offlinetrained dictionary , parameters and , , , and .
Initialize: Set maxIter, tolerance and Terminate False. Initialize , , , , and to zero.
Output:Optimal active coefficient matrix
denote the shrinkage operator . and denote the singular value thresholding operator given by , where [31]. The procedure of algorithm 1 can be implemented for both (22) and (27), in which and are updated as , and and are updated as . In algorithm 2, is the modified singular value thresholding operator mentioned in (25), in which the shrinkage operator is implemented in an inner iterative until divergence. . The iteration value ’maxIter’ is set as 500 in our numerical experiment. The parameter is set as 1.2 to update thresholding value . In the singular value threshold , is a SNR related constant.
Iv Experiments and Results
In this section, the proposed speech enhancement algorithms, TLSDMS and SLSDMS, are evaluated and compared with other stateoftheart speech enhancement algorithms. In Section IVA, a direct comparison of the proposed algorithms in modulation subspaces and complete spectrum. In Section IVB, noise with different coherent ratios are utilized to obtain a better understanding of the merits of TLDSMS and SLDSMS algorithms. Evaluations via benchmark metrics and intelligibility indexes are presented in Section IVC and IVD, respectively.
In our simulation, all speech and noise signals are downsampled to 16 kHz and the DFT was implemented using a frame length of 512 samples and 0.5overlapped Hann windows. We select 600 samples from IEEE database [32] for the noise reduction evaluation. The signal synthesis is performed using the overlapandadd procedure, and 23 noise samples selected from environmental and industrial noise database [33] plus two simulated noise (white Gaussian and pink noise) are used at various input SNRs (10 dB to 10 dB). The dictionaries are all consisted of 750 basis vectors learned from selected 150 speech samples.
Iva Comparison of Proposed algorithms in Modulation Subspaces and Complete Spectrum
In order to evaluate the benefits of separated subspace, the proposed TLSDMS and SLSDMS are both applied to modulation subspaces and complete spectrum. For the complete spectrum speech recovery, the dictionary is also trained using Bayesian NMF in the TF domain instead of envelop subspace. The selected 50 speech samples are not overlapped with the test utterances. White Gaussian, Pink and Volvo are used as the additive background noises at various SNRs (10, 5, 0, 5, and 10 dB). Four objective metrics, perceptual evaluation of speech quality (PESQ), segmental SNRs (SegSNR) [33], signal to distortion (SDR), and hearingaid speech quality index (HASQI) are used to quantitatively evaluate the performance of the two LSD algorithms in different subspaces.
As shown in Fig. 7, SLSDMS generally achieve the similar performances compared with TLSDMS with respective to PESQ, SegSNRs and SDR metrics. Based on the finetune parameters, the two different LSD algorithms show almost the same capability on improving the speech qualities. In terms of speech perception, higher value of HASQI represents better to be recognized with lower error rate. The results clearly demonstrate that the modulation subspace based decomposition (indexed by blue and red ) by two algorithms both show higher HASQI values averagely than complete spectrum based decomposition. It indicates that the implementation of decoupling the envelop subspace and details subspace specifically helps to improve the speech intelligibility, despite there is no significant advantages over the complete spectrum.
IvB Coherent and Incoherent Noise Reduction by the Proposed Algorithms
In this section, different types of noise samples are applied to test the proposed TLSDMS and SLSDMS algorithms. Eight noise samples, including car, babble, construction crane, jet F16, fan noise, Volvo, machine gun, and white Gaussian, are used to represent various coherent ratio and SLR ( as shown in Fig. 6 and Table I).
As shown in Fig. 8, the proposed algorithms achieve the best performance on Volvo noise, because this noise sample has the lowest coherence ratio (i.e., ). Generally, the performance of both algorithms decreases as the coherence ratio increases. For instance, the averaged performance of machinegun noise () is higher than that of babble noise (), which is comparable with the results of crane noise (). The reason is that high coherence ratio causes ambiguity on speech extraction.
The results in Fig. 8 also reveal that, not only coherent ratio can greatly affect the performance, the SLR also plays a role on noise cancellation in the proposed speech enhancement framework. It clearly shows that the averaged performance of jet F16 noise () is better than that of babble noise (), despite the two noise samples have almost the same coherence ratios. The same situation happens for machinegun noise () and car noise (). The potential explanation may be that jet F16 and machinegun noise demonstrates either a superior lowrank () in the spectral envelop subspace or a more sparsity (,) in both the spectral envelop subspace and the speech dictionary space. These characteristics help the background noise more distinguishable, and by utilizing SLR, the proposed algorithms can impose lowrank and sparse constraints to separate the speech and noise components successfully.
IvC Assessment via Speech Quality Metrics
The performance of the proposed TLSDMS and SLSDMS algorithms are compared with four stateoftheart speech enhancement algorithms, including MMSESPP [34], NMFRPCA [12], RPCA [35], and LARC [15]. The evaluation is implemented across 25 noise samples at various SNRs (10,5, 0, 5, and 10 dB). Each benchmark algorithm is finetuned to be one of the best alternatives.
Figure 9 shows performance evaluation by three speech quality metrics, including the source to distortion ratio (SDR), source to interference ratio (SIR), and source to artifact ratio (SAR) from the BSSEval toolbox [36]. SDR measures the overall quality of the enhanced speech, whereas SIR and SAR are proportional to the amount of noise reduction and inverse of the speech distortion. The results show that both proposed LSDMS algorithms take advantages over four other algorithms with respect to all three metrics. In addition, SLSDMS algorithm demonstrates slightly better performance than that of TLSDMS algorithm. The underlying reason could be that a large part of selected noise samples have low SLR values, which are suitable for SLSDMS algorithm.
Evaluation results using SegSNRs and PESQ are shown in Fig. 10. Both TLSDMS and SLSDMS algorithms outperform four other algorithms in terms of PESQ and SegSNR, and this superiority is especially significant when compared with unsupervised methods (i.e., MMSESPP and RPCA) at low SNRs. As a supervised technique, NMFRPCA also achieves a good performance at low SNRs. The dictionary based speech recovery techniques can more effectively extract the speech components when speech is severely corrupted by background noise. Specifically, two proposed algorithms utilize the structure characteristics of speech and noise spectrogram in two modulated subspaces, and successfully avoid the general issues of dictionary based speech enhancement, such as overfitting and speechlike noise.
IvD Assessment via Speech Intelligibility Metrics
To evaluate the performance of proposed algorithms on intelligibility of enhanced speech, three popular indexes, including hearingaid speech quality index (HASQI) [37], normalized covariance metric (NCM) [38], and shorttime objective intelligibility (STOI) [39], are employed in this section. HASQI has great potential to specifically capture quality when speech is subjected to a wide variety of distortions. This index can accurately predict the speech intelligibility ratings and generally as an improved version of Coherence speech intelligibility index (CSII). NCM is similar to the speechtransmission index (STI), and it computes the STI as a weighted sum of transmission index values determined from the envelops of the probe and response signals in each frequency band. STOI is also applied to validate the short time segmentation of the enhanced speech. All three metrics are expected to have a monotonic relation with the subjective speechintelligibility, where a higher value denotes better intelligible speech.
Figure 11 shows the speech intelligibility evaluations of the proposed algorithms compared with four other stateoftheart algorithms at various SNRs. Both proposed algorithms demonstrate superiority on all three intelligibility metrics. Specifically, at low SNRs (10 and 5 dB), significant intelligibility improvements have been achieved by our proposed algorithms when compared with those benchmark algorithms. Specifically, for two other supervised algorithms, NMFRPCA and LARC, their intelligibility improvements degrade greatly at low SNRs. One reason is that our proposed algorithms learn dictionaries from the spectral envelop subspace, which avoids the interference from the spectral details subspace. The benefit is that when the noise level increases, the energy in the spectral details subspace can produce a biased approximation to both speech and noise components. Another substantial explanation is that our supervised algorithms mainly focus on the recovery of the spectral envelop of speech, which is directly associated with speech intelligibility.
In addition, the SLDSMS algorithm demonstrates better intelligibility improvements than the TLSDMS algorithm. Because, for noise with small R, the tight lowrank constraints imposed by SLDSMS works better to separate the lowrank noise components and sparse speech components in the spectral envelop subspace.
V Conclusion
In this paper, we proposed a novel modulation subspace based speech enhancement framework. An acoustic model referred as formantandpitch was applied to obtain the spectral envelop and spectral details subspaces, in which supervised and unsupervised lowrank and sparse decompositions were implemented, respectively. To obtain the speech dictionary in the spectral envelop subspace, BNMF was utilized to inherently capture the temporal dependencies. Two different LDS schemes in the spectral envelop subspace were developed. By imposing different forms of norm to constraint rank and sparsity, the two approaches aimed to be adaptive to various background noise. The performance of the two developed algorithms were compared with other four existing speech enhancement algorithms, including MMSESPP [34], NMFRPCA [12], RPCA [35] and LARC [15]. Results showed that our developed algorithms not only showed robust performance under different background noise, but also achieved remarked improvements on speech perceptional quality with respect to various metrics. In addition, considerably robust performance is also demonstrated for different speech dictionaries obtained from several databases in spectral envelop subspace. Results showed that the MS based LDS approaches demonstrated significant improvements on speech intelligibility, when compared with other stateoftheart algorithms.
Vi Acknowledgments
This research was supported in part by the Illinois Clean Coal Institute (ICCI) with funds made available by the State of Illinois.
References
 P. C. Loizou and G. Kim, “Reasons why current speechenhancement algorithms do not improve speech intelligibility and suggested solutions,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 19, no. 1, pp. 47–56, 2011.
 M. R. Schroeder and B. S. Atal, “Codeexcited linear prediction (celp): Highquality speech at very low bit rates,” in Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’85., vol. 10. IEEE, 1985, pp. 937–940.
 G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
 A. V. Oppenheim and R. W. Schafer, “From frequency to quefrency: A history of the cepstrum,” Signal Processing Magazine, IEEE, vol. 21, no. 5, pp. 95–106, 2004.
 Z. Duan, G. J. Mysore, and P. Smaragdis, “Speech enhancement by online nonnegative spectrogram decomposition in nonstationary noise environments.” in INTERSPEECH, 2012, pp. 595–598.
 T. M. Elliott and F. E. Theunissen, “The modulation transfer function for speech intelligibility,” PLoS comput biol, vol. 5, no. 3, p. e1000302, 2009.
 P. Sun and J. Qin, “Lowrank and sparsity analysis applied to speech enhancement via online estimated dictionary,” IEEE Signal Processing Letters, vol. 23, no. 12, pp. 1862–1866, 2016.
 Y.H. Yang, “On sparse and lowrank matrix decomposition for singing voice separation,” in Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012, pp. 757–760.
 C. Sun, Q. Zhu, and M. Wan, “A novel speech enhancement method based on constrained lowrank and sparse matrix decomposition,” Speech Communication, vol. 60, pp. 44–55, 2014.
 A. Röbel, F. Villavicencio, and X. Rodet, “On cepstral and allpole based spectral envelope modeling with unknown model order,” Pattern Recognition Letters, vol. 28, no. 11, pp. 1343–1350, 2007.
 A. Bouzid, N. Ellouze et al., “Speech enhancement based on wavelet packet of an improved principal component analysis,” Computer Speech & Language, vol. 35, pp. 58–72, 2016.
 Z. Chen and D. P. Ellis, “Speech enhancement by sparse, lowrank, and dictionary spectrogram decomposition,” in Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013, pp. 1–4.
 Y.H. Yang, “Lowrank representation of both singing voice and music accompaniment via learned dictionaries.” in ISMIR, 2013, pp. 427–432.
 N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 21, no. 10, pp. 2140–2151, 2013.
 C. D. Sigg, T. Dikk, and J. M. Buhmann, “Speech enhancement using generative dictionary learning,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 6, pp. 1698–1712, 2012.
 T. Gerkmann and R. C. Hendriks, “Unbiased mmsebased noise power estimation with low complexity and low tracking delay,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 4, pp. 1383–1393, 2012.
 I. Cohen, “Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging,” Speech and Audio Processing, IEEE Transactions on, vol. 11, no. 5, pp. 466–475, 2003.
 R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” Speech and Audio Processing, IEEE Transactions on, vol. 9, no. 5, pp. 504–512, 2001.
 T. Gerkmann, C. Breithaupt, and R. Martin, “Improved a posteriori speech presence probability estimation based on a likelihood ratio with fixed priors,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 5, pp. 910–919, 2008.
 L. Deng, J. Droppo, and A. Acero, “Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features,” Speech and Audio Processing, IEEE Transactions on, vol. 12, no. 3, pp. 218–233, 2004.
 C. Breithaupt, T. Gerkmann, and R. Martin, “Cepstral smoothing of spectral filter gains for speech enhancement without musical noise,” Signal Processing Letters, IEEE, vol. 14, no. 12, pp. 1036–1039, 2007.
 T. Gerkmann, M. Krawczyk, and R. Martin, “Speech presence probability estimation based on temporal cepstrum smoothing,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4254–4257.
 H. Veisi and H. Sameti, “Speech enhancement using hidden markov models in melfrequency domain,” Speech Communication, vol. 55, no. 2, pp. 205–220, 2013.
 K. Paliwal, K. Wójcicki, and B. Schwerin, “Singlechannel speech enhancement using spectral subtraction in the shorttime modulation domain,” Speech communication, vol. 52, no. 5, pp. 450–475, 2010.
 K. Paliwal, B. Schwerin, and K. Wójcicki, “Speech enhancement using a minimum meansquare error shorttime spectral modulation magnitude estimator,” Speech Communication, vol. 54, no. 2, pp. 282–305, 2012.
 J. P. Openshaw and J. Masan, “On the limitations of cepstral features in noise,” in Acoustics, Speech, and Signal Processing, 1994. ICASSP94., 1994 IEEE International Conference on, vol. 2. IEEE, 1994, pp. II–49.
 E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011.
 Z. Kang, C. Peng, and Q. Cheng, “Robust pca via nonconvex rank approximation,” in Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015, pp. 211–220.
 J. F. Gemmeke, T. Virtanen, and A. Hurmalainen, “Exemplarbased sparse representations for noise robust automatic speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2067–2080, 2011.
 H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Advances in neural information processing systems, 2006, pp. 801–808.
 L. Zhuang, H. Gao, Z. Lin, Y. Ma, X. Zhang, and N. Yu, “Nonnegative low rank and sparse graph for semisupervised learning,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2328–2335.
 P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2013.
 Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2008.
 R. C. Hendriks, T. Gerkmann, and J. Jensen, “Dftdomain based singlemicrophone noise reduction for speech enhancement: A survey of the state of the art,” Synthesis Lectures on Speech and Audio Processing, vol. 9, no. 1, pp. 1–80, 2013.
 P.S. Huang, S. D. Chen, P. Smaragdis, and M. HasegawaJohnson, “Singingvoice separation from monaural recordings using robust principal component analysis,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 57–60.
 E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement in blind audio source separation,” IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462–1469, 2006.
 J. M. Kates and K. H. Arehart, “The hearingaid speech quality index (hasqi) version 2,” Journal of the Audio Engineering Society, vol. 62, no. 3, pp. 99–117, 2014.
 J. Ma, Y. Hu, and P. C. Loizou, “Objective measures for predicting speech intelligibility in noisy conditions based on new bandimportance functions,” The Journal of the Acoustical Society of America, vol. 125, no. 5, pp. 3387–3405, 2009.
 C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.