# Deep Speech Denoising with Vector Space Projections

## Abstract

We propose an algorithm to denoise speakers from a single microphone in the presence of non-stationary and dynamic noise. Our approach is inspired by the recent success of neural network models separating speakers from other speakers and singers from instrumental accompaniment. Unlike prior art, we leverage embedding spaces produced with source-contrastive estimation, a technique derived from negative sampling techniques in natural language processing, while simultaneously obtaining a continuous inference mask. Our embedding space directly optimizes for the discrimination of speaker and noise by jointly modeling their characteristics. This space is generalizable in that it is not speaker or noise specific and is capable of denoising speech even if the model has not seen the speaker in the training set. Parameters are trained with dual objectives: one that promotes a selective bandpass filter that eliminates noise at time-frequency positions that exceed signal power, and another that proportionally splits time-frequency content between signal and noise. We compare to state of the art algorithms as well as traditional sparse non-negative matrix factorization solutions. The resulting algorithm avoids severe computational burden by providing a more intuitive and easily optimized approach, while achieving competitive accuracy.

Jeffrey Hetherly, Paul Gamble, Maria Barrios, Cory Stephenson, Karl Ni \addressLab41, In-Q-Tel \email{jhetherly, pgamble, mbarrios, cstephenson, kni}@iqt.org \ninept

Index Terms: deep learning, speech, speaker denoising, non-stationary processes

## 1 Introduction

Signal denoising has been a problem in multiple media for over a century with applications ranging from acoustic speech processing, image processing, seismic data analysis, and other modalities. For each application, approaches have evolved over the span of several decades ranging from traditional statistical signal processing like Wiener and Kalman filtering, wavelet theory, and specific instances of matrix factorization. While effective for locally as well as wide-sense stationary signals, even with a storied history, these efforts have seen less success with more dynamic and in-the-wild sets of noise owing to their algorithmic capacity.

Dynamic noise represents many real-world speech situations, and solutions that have impressive results primarily focus on hardware: array processing efforts [1] in the form of SONAR, RADAR, and Synthetic Aperture sensing. These methods solve the problem by using the inputs of multiple sensors to process the source of interest. Unfortunately, much of recorded contemporary media is typically done through a phone, and the same methods cannot be easily extended to the monaural case [2], in which audio is recorded from a single microphone.

In previous approaches to the monaural problem, assumptions are explicitly made on signal and noise attributes, or specifications have enabled some control over environment and listening devices. The more general case of single-track speech recordings in noisy or reverberant rooms has become increasingly common due to the proliferation of inexpensive portable devices with recording capabilities such as cellphones and laptops. In such cases, no guarantees on speech or noise attributions can be assumed regarding the nature of the environment or the location of microphones.

Over the last decade, machine learning approaches have begun to see success in this scenario. In particular, adaptation of familiar matrix factorization techniques [3] to the processing of time-frequency representations of audio signals has proven useful. However, these methods can be difficult to make performant [4], and in many cases additional complexity is required to model source characteristics accurately.

More complex sources can be modeled with the inclusion of a priori knowledge regarding their characteristics. This can be empirically derived from a large corpus of training data provided the model in use has a high capacity. In recent years, neural network approaches have reemerged in the mainstream of machine learning research, and several papers [5] have adapted them to the general speech denoising problem. Among these methods, recurrent neural networks in particular have shown the most promise in modeling acoustic time series [6, 7], especially when applied to time-dependent spectral features.

One challenging aspect of neural network approaches is the development of cost functions. For speech signals in particular, the computational complexity of the cost function is important as the timescales associated with speech contain many samples. Additionally, if the goal is to separate sources (e.g, a speaker and a noise source), the cost function must be invariant to different permutations of the recovered sources since the ordering is arbitrary. The proposed approach automates the featurization of speakers and the characterization of noise using an efficient permutation invariant sampling technique.

Building upon previous work, we propose an algorithm to directly optimize a vector space that isolates specific source characteristics. Such an approach is closely related natural language processing [8] work, where an embedding space in which speakers and noise sources are explicitly contrasted with each other is created. The conjecture is that such an intuitive approach will provide better discrimination, and the proposed algorithm source-contrastive estimation (SCE), as distinguished from noise-contrastive estimation [9]. Our vector space is independent of source type and offers a significant speedup at training time compared to state of the art deep-clustering-based approaches. Additionally, we further improve upon our model with a mask inference approach detailed in [7].

The remainder of this paper describes our approach to finding optimal vector spaces using SCE and applies the technique to synthetically-mixed noisy speech. We dive deeper into the state of the art techniques, some of which we leverage, in Sec. 2. The approach is then described in Sec. 3 with implementation details in Sec. 4. Experimental results are shown in Sec. 5, which is followed by a summary and discussion of future work.

## 2 Related Work

Considerable work has been done in speech processing for as long as recordings have been made. Traditional methods have included approaches that are rooted in signal processing theory, where a large number of approaches use some type of matrix factorization [10, 11]. In particular, sparse non-negative matrix factorization (SNMF) is shown effective at extracting non-stationary noise sources in [4, 12, 3]. SNMF constructs a set of spectral basis functions from training data and linearly combines these with a set of learned weights to reconstruct the spectral features of the desired signal. Sparsity is typically enforced by an -norm constraint on the learned weights that contains a multiplicative hyper-parameter, . As will be shown in Sec 5, linear methods such as these lack the algorithmic capacity to compete with more modern techniques.

### 2.1 Convolutional Denoising Autoencoder

Autoencoders have been used to successfully remove noise and to isolate single sources from audio signals [13]. At a high level, autoencoders learn to featurize inputs (usually referred to as encoding) and then reconstruct them as outputs (decoding). This approach is well suited to denoising because a model is forced to preferentially build representations of the non-noise components of its input.

Closely related are convolutional autoencoders, typically used for denoising images [14, 15]. These models make use of convolutional layers during encoding and deconvolutional layers during decoding. Convolutional denoising autoencoders (DAE) applied to audio signals represented via spectrogram (using STFT) operate similarly, though many of these approaches have problems generalizing to unseen signals. Moreover, convolutional autoencoders are an architectural construct still dependent on their cost functions, a challenge that defines how they perform in the context of denoising, where the common -norm may not be sufficiently descriptive.

### 2.2 Neural Network-Based Source Embeddings

Recent success in monaural audio source separation and denoising have taken to learned embedding vectors [6, 16, 17, 7]. The primary advantage of learning embedding vectors is that they bypass the so-called permutation problem in which the output of a learning algorithm must be permuted to account for the unordered nature of the target sources [18]. Additionally, the number of sources to be separated and denoised can be arbitrary with an appropriate clustering technique (although, this depends on how inference is performed).

The embedding model we propose in the following sections most resembles deep clustering [6] and mask inference (DC+MI) found in [7], though with a vastly reduced cost function. The DC+MI network learns embeddings given the spectral magnitude of the mixed audio sample using a series of four bi-directional LSTMs (BLSTM). In addition to clustering those embeddings to create a binary mask as in [6], a learned non-linear transformation is used to directly translate the embeddings into a ratio mask. This has the advantage of limiting some of the artifacts inherent to performing a binary mask. However, this comes at a cost of fixing the number of sources to two. Clustering on an arbitrary number of sources can still be performed on the embeddings, but only a binary mask can be constructed from these clusters.

## 3 Approach

The approach used in this paper combines the mask inference capabilities of cited literature in Sec. 2 with the flexibility of SCE to remove dynamic, non-stationary noise sources from speech for monaural audio signals.

### 3.1 Datasets

Our task is to isolate speech from a mixture of dynamic noise and speech using a monaural audio signal. All denoising algorithms are trained and evaluated on a mixture of the LibriSpeech [19] and UrbanSound8K [20] datasets. LibriSpeech provides high-quality audio recordings of isolated English speech from both male and female speakers and UrbanSound8K provides recordings from ten non-stationary noise classes. Two two-second clips from each dataset are added at various SNR ratios to create the noisy-speech data. The SNR ratio is continuously varied between -5 and 5 dB for the training phase for all but SNMF algorithm wherein speech and noise are fed in separately. No impulse response convolution is used so as to focus solely on removing non-stationary sources of noise.

For the training, validating, and in-set testing of each algorithm we use the train-clean-100 set of audio readings from the LibriSpeech dataset, which provides approximately 100 hours of speech evenly split between female and male speakers. For out-of-set testing we use the dev-clean set from LibriSpeech. Although all noise types from UrbanSound8K are used for training, noise files from each noise type are reserved for training, validation, and testing.

### 3.2 Model

Our model for denoising monaural signals operates on the assumption that linearly mixed speech and noise can be well-separated into individual sources . In this context, a source is either a speaker or a particular type of noise. For a given source in a speaker-noise mix, our model masks the magnitude response. This mask filters out information from time-frequency bins in the short-time Fourier transform (STFT), , that do not belong to a given source, while passing those time-frequency bins that do.

Typically, the predicted mask for the source is implemented as either a ratio or in our case, a binary mask. We let , where , being the total number of sources in our training set and being the number to be mixed. To set our masks, if source is the loudest in time frequency bin , then , and otherwise.

Similar to natural language processing embedding techniques like word2vec [8], a given word embedding can represent specific words. Instead of a word embedding, we use a speaker embedding, similarly optimized via two vector spaces. The first vector space is an input embedding that implicitly defines a speaker, and it is not associated with anyone in particular. We also have an output embedding that explicitly trains to a corpus of known speakers. Then, when performing inference to input vector space, it is possible to generalize to any possible speaker by clustering our neural network outputs. In our notation, the input and output vector spaces for a given sample are implemented as tensors with an embedding space of , labeled as and , respectively. The columns of either tensor have dimensions (hidden units) and denote the vectors associated with a speaker’s likeness.

To train and generate our embeddings, we use a recurrent neural network regression to . To compare to [6] and [7], we use a total of four BLSTM layers, and we have a dense layer that is convolved over the output 2D vector produced by the final BLSTM. This final layer of source embeddings is also fed through a non-linear transform as in [7] to yield the ratio mask.

Let our loss for every time frequency bin for sample be denoted as . Then,

(1) |

Here, is the set of sources sampled for mix , and is a single source from the subset. The total loss for the batch of size over all frequencies and time is thus,

(2) |

Intuitively, the output of the neural network at time is and the output vector is an embedding for source at frequency . Say that source is louder than source at time frequency bin for sample . Then we would ideally like the correlation between the embedding produced by our neural network and the vector for source 1 to be high. That is to say, we would like . Simultaneously, the correlation between and the vector for source 2 should be low, since these two vectors should be anti-correlated if they are sufficiently different. That is to say, we would like . Mathematically speaking, we are pulling our embedding towards our source vector and pushing it away from non-source vectors . Which sources to attribute appropriate correlation/anti-correlation to is determined by the label , which will be in the former case and in the latter. It is important to note that we can save on both computation and accuracy by optimizing only those sources that are in , which in our case will have two elements (one speaker and one noise source).

Additionally, during inference we do not use the output vector space . While it is true that computations are further reduced, the intention is that the out-of-set sources set is allowed. In fact, even though we may train on mixes with fewer sources, we can inference in situations where there are arbitrary numbers of sources.

Our algorithm (denoted SCE+MI) is implemented in Tensorflow, v1.4 [21], with an architecture consisting of four BLSTM layers of 500 units each. These are followed by a fully connected layer that maps the output of the fourth BLSTM layer to the input vector space. The BLSTM layers use tanh nonlinearities, and the fully connected layer is linear. For a batch of inputs , the output of the four BLSTM layers . While the final (embedding) layer of the neural network is technically a fully-connected linear layer, it is implemented as a convolution over the output tensor with a filter . The output of the convolution can then be reshaped to give the input vector space . The vector-space output is fed through what is effectively a 1D convolution along the embedding dimension with a softmax that yields the final ratio mask output. This implementation allows the model to be run for arbitrary input , which is useful at inference time.

For efficient evaluation of the cost function of Eq. 1 across batches, the sources vectors for sources only represented in each batch are assembled into a tensor . The ordering of the speakers in must match the ordering used in , but is otherwise arbitrary. To efficiently compute the dot products in Eq. 1 with broadcasting, we expand the dimensions appropriately.

This gives an output of the dot product operation as a tensor , which is compatible with the labels and so they can be multiplied together elementwise to give the argument of the sigmoid in Eq. 1. The remaining portion of the cost function is easily evaluated.

Our batch size is during training. The input tensors have dimensions and label tensors are , where is the length of total time steps per sample and are the number of frequency bins used.

## 4 Experiments

In all experiments, signals are resampled and scaled to , zero mean and unit standard deviation, from which the short time Fourier transform (STFT) spectrograms are extracted with a Hanning window of 512 and 256 stride length. We use audio clips of approximately two seconds which, when combined with the STFT operation, yield input features of dimension (frequencies by time frames). The complex phases were saved separately for use in post separation processing. Separate spectrograms for the signal from each speaker and noise ( for ) were computed for training and evaluation purposes, while the total spectrogram was computed by the elementwise sum for a speaker and noise with IDs .

The magnitude of the spectrograms were then passed through a square root nonlinearity and percent normalized. This is similar to the procedure suggested in [22]; however we obtained better results with a square root rather than a logarithmic nonlinearity. Source labels are assigned to each T-F bin by giving a value of to the signal which is loudest in at that time and frequency, and a value of to all other sources.

### 4.1 Algorithm Comparisons

We compare three approaches against the proposed work: a linear matrix factorization method (SNMF), a denoising auto-encoder (DAE), and a hybrid deep clustering/mask inference architecture (DC+MI). SNMF is adopted from [4] with the most optimal hyperparameter settings found therein and trained on 10000 two-second audio clips of noise and speakers. To aid in training SNMF we removed portions of each spectrogram based on a log max-amplitude threshold for each time frame. This threshold was found to isolate spoken words while trimming the surrounding empty audio.

Our comparison convolutional DAE is based on [23] and consists of 15 convolutional layers, followed by 15 deconvolutional layers. Each layer contains 128 5x5 filters with relu activation and constant input size. Skip connections are employed between every other pair of matched convolutional and deconvolutional layers. The model was trained using Adam RMSprop with Nesterov momentum and a learning rate of 5e-5.

The DC+MI network is an implementation of the architecture found in [7]. We use the same optimal set of hyperparameters as they do except that our loss function for the mask inference head uses the true spectral component rather than a proxy.

Comparisons of the performance of each alogrithm are quantified by improvement in the source-to-distortion ratio (SDR). Each algorithm is evaluated on how well it improves the SDR metric for an input SNR range of and for each noise type.

### 4.2 Reconstruction

At inference time for our model and the deep-clustering head of DC+MI, a signal consisting of an unknown mixture of sources is preprocessed as described in the previous subsection, giving a complex T-F estimate of a single source signal, . An input feature is generated and fed through the model to obtain the vectors . A -means clustering is then performed on the vectors in order to generate a labeling prediction in which each T-F element is associated with a cluster label. Here the element if the associated vector belongs to the cluster, and . These labelings can then be used as masks to reconstruct a source from each of the clusters. T-F representations of the inferred sources are calculated as the element-wise multiplication of the input spectrogram with the inferred labeling.

(3) |

The source spectrogram is then converted (using the inverse STFT) into a source waveform, completing the inference process.

The output of the mask inference head of SCE+MI and DC+MI and that of SNMF is a ratio mask that, when multiplied element-wise with the original spectrogram, yield the respective speaker and noise sources. These ratio masking techniques have the potential to produce higher-quality audio than binary masking as the T-F bins can be shared amongst sources (as is actually the case).

Our contribution, replicated research [7, 4], and evaluation code can be found
`http://github.com/lab41/magnolia`

.

## 5 Results

The results from our experiments on a hold-out set of mixes are summarized in Figs. 1(a) and 1(b). The performance of the mask inference head of SCE+MI is on par with the DC+MI (+13 dB at an input SNR of [-5,-4] dB) while the clustering performance of SCE (+11.5 dB) is slightly better than the clustering of the SCE+MI and DC+MI algorithms. Thus, SCE may be more desirable when the number of sources to be separated is arbitrary. The improvements in SDR are greatest for more statistically stationary noise sources and inputs with lower in SNR. This can be explained by the fact that at higher input SNRs the signal is already quite prominent, so there is less room for improvement. The performance of the deep learning-based methods is relatively consistent across input SNRs while SNMF sees more dramatic differences.

## 6 Conclusions

We show that SCE with mask inference gives improved reconstruction performance for dynamic noise source denoising. Mask inference performs well (on average, +12 dB in SDR) regardless of the clustering loss it’s coupled with. SCE showed the best clustering performance (on average, +11 dB in SDR). This indicates that denoising in the presence of an arbitrary number of sources, SCE may give better accuracy.

At present, the training objective related to the embedding space (SCE or DC) is not perfectly aligned with the -means clustering performed at inference. Immediate future work on incorporating the -means objective into the training procedure [24] could improve clustering performance.

### References

- J. Sanz-Robinson, L. Huang, T. Moy, W. Rieutort-Louis, Y. Hu, S. Wagner, J. C. Sturm, and N. Verma, “Robust blind source separation in a reverberant room based on beamforming with a large-aperture microphone array.” in ICASSP. IEEE, 2016, pp. 440–444. [Online]. Available: http://dblp.uni-trier.de/db/conf/icassp/icassp2016.html
- L. K. Hansen and K. B. Petersen, “Monaural ICA of white noise mixtures is hard,” in Proceedings of ICA’2003 Fourth Int. Symp.. on Independent Component Analysis and Blind Signal Separation, Nara Japan, April 4,, 2003, pp. 815–820. [Online]. Available: http://www2.imm.dtu.dk/pubdb/p.php?1650
- N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” CoRR, vol. abs/1709.05362, 2017. [Online]. Available: http://arxiv.org/abs/1709.05362
- J. Le Roux, F. J. Weninger, and J. R. Hershey, “Sparse nmf–half-baked or well done?” Mitsubishi Electric Research Labs (MERL), Cambridge, MA, USA, Tech. Rep., no. TR2015-023, 2015.
- A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” CoRR, vol. abs/1609.03499, 2016. [Online]. Available: http://arxiv.org/abs/1609.03499
- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 31–35.
- Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux, and N. Mesgarani, “Deep Clustering and Conventional Networks for Music Separation: Stronger Together,” ArXiv e-prints, Nov. 2016.
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111–3119. [Online]. Available: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
- M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A new estimation principle for unnormalized statistical models,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 297–304.
- H. Kameoka, Non-negative Matrix Factorization and Its Variants for Audio Signal Processing. Tokyo: Springer Japan, 2016, pp. 23–50. [Online]. Available: https://doi.org/10.1007/
- T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 1066–1074, March 2007.
- M. N. Schmidt, “Speech separation using non-negative features and sparse non-negative matrix factorization,” in Computer Speech and Language, 2008, submitted. [Online]. Available: http://www.imm.dtu.dk/pubdb/p.php?5377, 2008.
- X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 1759–1763.
- E. M. Grais and M. D. Plumbley, “Single Channel Audio Source Separation using Convolutional Denoising Autoencoders,” ArXiv e-prints, Mar. 2017.
- P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
- Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” CoRR, http://arxiv.org/abs/1607.02173, vol. abs/1607.02173, 2016.
- Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” CoRR, vol. abs/1611.08930, 2016. [Online]. Available: http://arxiv.org/abs/1611.08930
- D. Yu, M. Kolbæk, Z. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” CoRR, vol. abs/1607.00325, 2016. [Online]. Available: http://arxiv.org/abs/1607.00325
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
- J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22st ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
- Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, no. 12, pp. 1849–1858, 2014.
- X.-J. Mao, C. Shen, and Y.-B. Yang, “Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections,” ArXiv e-prints, Jun. 2016.
- B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong, “Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering,” ArXiv e-prints, Oct. 2016.