On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising
We present a psychoacoustically enhanced cost function to balance network complexity and perceptual performance of deep neural networks for speech denoising. While training the network, we utilize perceptual weights added to the ordinary mean-squared error to emphasize contribution from frequency bins which are most audible while ignoring error from inaudible bins. To generate the weights, we employ psychoacoustic models to compute the global masking threshold from the clean speech spectra. We then evaluate the speech denoising performance of our perceptually guided neural network by using both objective and perceptual sound quality metrics, testing on various network structures ranging from shallow and narrow ones to deep and wide ones. The experimental results showcase our method as a valid approach for infusing perceptual significance to deep neural network operations. In particular, the more perceptually sensible enhancement in performance seen by simple neural network topologies proves that the proposed method can lead to resource-efficient speech denoising implementations in small devices without degrading the perceived signal fidelity.
On Psychoacoustically Weighted Cost Functions Towards Resource-Efficient Deep Neural Networks for Speech Denoising
|Kai Zhen, Aswin Sivaraman, Jongmo Sung, Minje Kim††thanks: This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government (MSIT) (2017-0-00072, Development of Audio/Video Coding and Light Field Media Fundamental Technologies for Ultra Realistic Tera-media).|
|Indiana University, School of Informatics, Computing, and Engineering, Bloomington, IN|
|Electronics and Telecommunications Research Institute, Daejeon, South Korea|
|email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org|
Index Terms— Network compression, psychoacoustic model, speech enhancement, deep neural networks, resource-efficient machine learning
Deep Neural Networks (DNNs) have seen exponentially greater usage with regards to audio signal processing, improving the state-of-the-art in source separation, noise reduction, and speech enhancement. In many of these studies, their improved performance in terms of the quality recovery of the sources relies greatly on the enlarged model complexity. For example, a network with structure (2 hidden layers, each of which has 300 units) showed speech separation performance more than 1 dB better than a traditional dictionary-based separation model (where the dictionaries are learned from Non-negative Matrix Factorization (NMF) [1, 2] in advance) in terms of Signal-to-Distortion Ratio (SDR) . Another recent example would be a DNN with structure where both phase and magnitudes of the source are effectively predicted . If the network was represented as a weight matrix in a linear transformation, the required number of floating-point operations would easily be over a few million. In  it is also shown that for standard feed-forward networks, a larger network structure ( with 3.1M parameters) outperforms a smaller network ( with 644K parameters) by 1.17 dB. As expected, the network complexity is predominantly increased in favor of improved performance. However, the enlarged structure can become a bottleneck when it comes to implementing the DNN in a small device with limited resources (e.g. power and memory), especially when there is a stringent requirement for real-time speech enhancement.
As DNNs increase in their size and resource usage, neural network compression has grown to be a lively research area. Carefully pruning some of the units can reduce the size of the network as shown in . Lowering the quantization level for the network parameters by reducing the number of bits to represent each parameter is another way to compress the network. For example, recent studies report that binary or ternary quantization schemes do not significantly reduce accuracy in famous benchmark classification tasks [7, 8, 9, 10]. However, those general-purpose compression techniques do not utilize the audio-specific characteristics of the problem. For example, as proposed in the Perceptual Evaluation methods for Audio Source Separation (PEASS) toolkit , standard energy-based objective metrics such as Signal-to-Noise Ratio (SNR) and its variants are not the best way to judge the perceptual quality of audio signals ).
In this paper we claim that a neural network for audio enhancement can be optimized further if it exploits human perception. For example, legacy digital audio coding schemes leverage principles of sound perception to reduce the coded signal’s data rate; we refer to these perceptual coding schemes as Psychoacoustic Models (PAM). By discarding inaudible tones and allowing more quantization noise in the least audible portions of the audio spectrum, PAM reduces the bit rate while minimizing degradation of the overall signal fidelity . The psychoacoustic concepts most relevant to the design of a perceptual audio coder are the phenomena of hearing thresholds and of auditory masking. These concepts have been empirically modeled and can be used in conjunction with time-frequency analysis to identify the perceptually significant (i.e. tonal and non-tonal) components of the signal. Once identified, auditory irrelevancies, which are either masked or unheard, can be removed. Because psychoacoustics literature is diverse in its findings, different modern perceptual audio codecs have adopted their own PAM. We will incorporate PAM-1 , popularized by the ISO/IEC MPEG-1 standard without the loss of generality.
Although PAM has been commonly used for traditional audio coding, it has not yet been used with respect to neural network compression. Recent work proposed by Liu et al. regarding a perceptually weighted DNN for speech enhancement shows a promising amount of improvement in both the signal quality and intelligibility . However, the network uses a perceptual weight model based on a sigmoid function applied to the signal, which prioritizes high-energy components of the target speech’s spectral power. Although this assumption is valid in the case of speech, unlike PAM it does not hold across all types of audio. In this regard, our work generalizes and greatly expands upon the findings found by Liu et al.
In this paper, we present a perceptually weighted cost function to train a DNN that is structurally simpler, but conducts perceptually comparable speech denoising. To do this, we will generate meaningful weights based on the global masking threshold of our training data as prescribed by PAM-1, and then harmonize the weights with the mean-squared error. We evaluate denoising results from various network architectures and show that the proposed method leads to a more condensed network topology without losing the perceptual quality of the recovered speech.
2.1 Conventional Mask Learning Networks
The input of the mask learning model is the magnitude spectra of the noisy utterances that approximates the mixture of speech and noise in the complex domain: 111We assume all data matrices are magnitude spectrograms, however the exact mixture is defined in the complex time-frequency domain. Rectified Linear Units (ReLU) are common as the activation function to avoid the gradient vanishing problem. For the -th hidden layer, the feed-forward process is defined as follow:
where indicates the layers with as the special case for the input layer (i.e. stands for the input), and and are for the weights and bias, respectively. denotes Hadamard product. In practice, we find it useful to conduct a smooth weight clipping by applying the hyperbolic tangent function to each weight and bias (1), which will be bounded within the range of to . Dropout is applied to the output of a layer before it is fed to the next layer; for dropout, we use as the masking matrix for the -th layer whose elements are binary values drawn from a Bernoulli distribution with parameter .
The speech denoising networks are trained to predict the Ideal Ratio Mask (IRM) , which can mask the input to recover the speech spectrogram . Hence, we employ the logistic function as the final layer activation, modifying (1) as follows:
From this we define the mean-squared error function between the IRM and the network output:
where and are the number of input dimensions and the number of samples in the training data, respectively.
2.2 Psychoacoustic Models
Incorporation of a psychoacoustic model is essential to constructing a weight matrix which influences the cost function to focus on signal components of greatest perceptual significance. We utilize a simplified version of PAM-1, by only considering the tonal signal components and ignoring the non-tonal (i.e. noise-making) signal components. PAM-1 computes the global masking threshold for all frames of the input spectrogram. The threshold is computed first by performing a Sound Pressure Level (SPL) normalization of the training spectrogram () to determine the signal’s Power Spectral Density (PSD): . Note that is fixed to dB. Then, the model identifies the tonal maskers, ignoring those which fall below the absolute threshold of hearing (ATH). A spreading function is used to generate masking curves for each tone. The combination of these individual masking curves plus the ATH yields the global masking threshold . This implementation of PAM-1 is detailed in . Fig. 1 showcases the various components of the PAM-1 model for an example frame from the training data set.
3 Proposed Perceptual Weighting
The proposed method reformulate a given ordinary cost function by using the perceptual weights derived from the masking curves computed from PAM-1. Using the training clean speech signal’s power spectral density () and its corresponding global masking threshold (), we define a perceptual weight matrix () which is applied to the network cost function (2):
Therefore, is the log ratio between the signal power and the masked threshold rescaled from dB-SPL. Division is carried out in the element-wise fashion. The intuition behind this weight matrix definition can be understood by observing Fig. 1. For any signal energy in frequency bin of the -th time frame, if the signal’s power is greater than its masking threshold, i.e. , this tone must be audible. In Fig. 1, the audible regions are those where the blue line is higher than the green dotted line. On the other hand, if the power of the source spectrum is lower than the threshold, the region is masked and inaudible. With this understanding, we define weights bounded between 0 and , whose smaller extreme says that the masking threshold is very large and any sound, such as the reconstruction error at that time-frequency bin, is not audible. Conversely, a large weight value means that the source spectral component is large and audible even considering the masking threshold. Therefore, the system should not create much error. Now that the weight matrix encodes the perceptual importance of all time-frequency bins, we combine it with the original MSE function:
Figure 2 shows the training signal’s power spectral density compared to the perceptual weight matrix for a short speech signal. Note that roughly follows the spectral density while suppressing weaker areas with near-zero weights.
The expected benefit of using this weighting scheme is that the neural network prediction can enjoy a relaxed version of the error function that underweights the less audible output dimensions. As a result, the network can focus more on the narrower output dimensions–an easier optimization task that a smaller and compressed network can also solve with similar perceptual quality.
4 Experimental Setup
Data Preparation: The noisy dataset is constructed by mixing utterances from TIMIT corpus and ten non-stationary noise types used in . For this, speakers are randomly selected from the training set with equal gender probability. Each utterance is mixed with one of ten different noise types. Thus, noisy utterances are used for training. This same procedure is used to generate noisy utterances for validation as well as another utterances to create the test set. Noise signals used for training do not overlap with those for test mixtures.
Sources are normalized so that the mixture SNR is dB. The Short-Time Fourier Transform (STFT) is used with a -point Hann window and a -overlap for all spectrogram computation. The complex spectrogram of the clean signal () and the background noise () are mixed to create matrices of dimensions for the training set and for the validation set. The input mixture to be denoised is acquired by adding up and . For the larger networks (with 1024 or 2048 hidden units per layer) three consecutive spectra are then concatenated and vectorized to form an input vector of . Vectorizing consecutive input is common practice to provide contextual information to fully connected neural networks. For the smaller networks (with 128 or 512 hidden units per layer), the individual frames are the input. The mini-batch size is 256 throughout the training procedure.
The energy-based Ideal Ratio Mask (IRM) gives us a nonnegative real-valued masking matrix . The source spectrogram is recovered by multiplying the mask to the mixture spectrogram, i.e. . We chose IRM to be our target signal, but the proposed perceptual weighting can be used for other targets, such the source magnitude spectra, without the loss of generality.
Parameter Settings: As we seek a condensed network structure, we limit the maximum number of hidden layers to be , each of which can have , , , and units, totalling different network topologies assessed in this experiment. The weights per layer are initialized from the truncated normal distribution divided by the square root of the size of the layer, where the standard deviation is set to be . We use MSE as our cost function. Learning rates are set to be or given different model topologies. For the dropout rate on the input layer, we choose to keep most of the units active in most models, but for the concatenated input layer we use . Dropout rate is set to be for the “wider” hidden layers ( units), for layers with units and for units. Due to a limited space, the paper only shows the results from the best-fit parameter configuration which we found via validation. Each choice of model structure is trained with two different error functions: with and without the perceptual weights.
Model Training: The model is trained in TensorFlow using the Adam optimizer . The number of epochs is set to be . The training time varies in a range between to hours given different network structures and parameter settings, but the proposed perceptual weighting does not notably reduce training time.
5 Experimental Results
For the present, we use the BSS_Eval toolbox to evaluate the objective sound quality , as well as the STOI and PEASS toolkits for the perceptual evaluation . These separation metrics are used to compare the denoised speech results obtained from the proposed neural network, using either the conventional MSE cost function (2) or the perceptually weighted MSE cost function (4).
5.1 Objective Quality Assessment
We use BSS_Eval toolbox  to objectively compare the denoising quality over two groups of models. Particularly, we consider three measures: SIR for the ratio of the source over the remaining interference, SAR to measure the amount of artifacts introduced during the separation process, and SDR to reflect the overall source separation performance. These measures are calculated for each reconstructed utterance and are presented in Fig. 3 as weighted averages over speech signals based on the lengths of the signals.
The weighted models overlook noise below the global masking threshold, focusing instead on audible noise affecting human speech perception. Because of this, the model does not objectively denoise the utterance, resulting in a slightly lower SIR (Fig. 3 (b)) than the unweighted model. However, the perceptual models add barely any artifacts in comparison to the unweighted models (Fig. 3 (c)). Similarly, the output SDR of networks utilizing the PAM weights is still comparable to that from conventional denoising networks (Fig. 3 (a)). Irrespective of the cost function being weighted or unweighted, there is a trend–wider and deeper neural networks guarantee better objective separation quality. This trend justifies the popularity of deep learning for speech denoising tasks. However, we argue that the variation of the perceptual quality with respect to the model complexity has a different pattern.
5.2 Perceptual Quality Assessment
PEASS measures the the perceptual quality of the denoised speech signals. In the case of single-channel source separation there are three metrics: Overall, Interference-related, and Artifact-related Perceptual Scores (OPS, IPS, and APS). These perceptual scores complement objective measures SDR, SIR and SAR, respectively. For all network topologies considered, the proposed perceptual weighting yields higher OPS (Fig. 3 (d)). While SDR is highly contingent on network complexity, OPS is well-maintained even by shallow networks. As discussed in Section 5.1, PAM-weighted networks do not equally consider all frequency bins. Therefore, the IPS is reasonably lower than that from non-PAM models (Fig. 3 (e)). However, artifacts introduced from perceptual weighting are also dampened which leads to higher APS (Fig. 3 (f)). Overall, we can conclude that the PEASS evaluation metrics says that the proposed perceptual weighting leads to compressed network structures that do not suffer from as much perceptual performance drop as the ones that minimize an unweighted MSE cost function.
We additionally verify the effect of the proposed weighted cost function on Short Time Objective Intelligibility (STOI) scores . In Fig. 4, we see that the average STOI score from weighted models is marginally higher, which reassures the stability of the perceptual weighting scheme. However, we stay conservative in asserting that the proposed method improves speech intelligibility, a claim which can be researched further by performing subjective evaluation of audio quality with frameworks such as MUSHRA .
In this paper, we proposed a psychoacoustically weighted cost function that leads to a more efficient network structure for speech denoising tasks. Such efficient networks are with a less number of parameters, so that their implementations are hardware-friendly especially in the resource-constrained environments, while they maintain comparable perceptual quality. In the future, we plan to investigate the effect of the proposed perceptual weighting scheme on the other kinds of network compression such as the quantization scheme for the parameters.
-  D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, pp. 788–791, 1999.
-  D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing Systems (NIPS). 2001, vol. 13, MIT Press.
-  P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136–2147, Dec 2015.
-  Florian Mayer, Donald S Williamson, Pejman Mowlaee, and DeLiang Wang, “Impact of phase estimation on single-channel speech separation based on time-frequency masking,” The Journal of the Acoustical Society of America, vol. 141, no. 6, pp. 4668–4679, 2017.
-  Jonathan Le Roux, John R. Hershey, and Felix Weninger, “Deep NMF for speech separation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2015.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
-  K. Hwang and W. Sung, “Fixed-point feedforward deep neural network design using weights +1, 0, and -1,” in 2014 IEEE Workshop on Signal Processing Systems (SiPS), Oct 2014.
-  M. Kim and P. Smaragdis, “Bitwise neural networks,” in International Conference on Machine Learning (ICML) Workshop on Resource-Efficient Machine Learning, Jul 2015.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 4107–4115.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” arXiv preprint arXiv:1603.05279, 2016.
-  Valentin Emiya, Emmanuel Vincent, Niklas Harlander, and Volker Hohmann, “Subjective and objective quality assessment of audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2046–2057, 2011.
-  E. Vincent, C. Fevotte, and R. Gribonval, “Performance measurement in blind audio source separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006.
-  Marina Bosi and Richard E. Goldberg, Introduction to Digital Audio Coding and Standards, Kluwer Academic Publishers, Norwell, MA, USA, 2002.
-  Ted Painter and Andreas Spanias, “Perceptual coding of digital audio,” Proceedings of the IEEE, vol. 88, no. 4, pp. 451–515, 2000.
-  Qingju Liu, Wenwu Wang, Philip Jackson, and Yan Tang, “A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions,” in Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), 2017.
-  A. Narayanan and D. L. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2013, pp. 7092–7096.
-  Zhiyao Duan, Gautham J Mysore, and Paris Smaragdis, “Online plca for real-time semi-supervised source separation.,” LVA/ICA, vol. 7191, pp. 34–41, 2012.
-  Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., “Tensorflow: A system for large-scale machine learning.,” in OSDI, 2016, vol. 16, pp. 265–283.
-  Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 4214–4217.
-  ITUR Recommendation, “Bs. 1534-1. method for the subjective assessment of intermediate sound quality (mushra),” International Telecommunications Union, Geneva, 2001.