Joint Estimation of Reverberation Time and Direct-to-Reverberation Ratio from Speech using Auditory-Inspired Features
Blind estimation of acoustic room parameters such as the reverberation time and the direct-to-reverberation ratio () is still a challenging task, especially in case of blind estimation from reverberant speech signals. In this work, a novel approach is proposed for joint estimation of and from wideband speech in noisy conditions. 2D Gabor filters arranged in a filterbank are exploited for extracting features, which are then used as input to a multi-layer perceptron (MLP). The MLP output neurons correspond to specific pairs of estimates; the output is integrated over time, and a simple decision rule results in our estimate. The approach is applied to single-microphone fullband speech signals provided by the Acoustic Characterization of Environments (ACE) Challenge. Our approach outperforms the baseline systems with median errors of close-to-zero and -1.5 dB for the and estimates, respectively, while the calculation of estimates is 5.8 times faster compared to the baseline.
Joint Estimation of Reverberation Time and Direct-to-Reverberation Ratio from Speech using Auditory-Inspired Features
|Feifei Xiong, Stefan Goetze, Bernd T. Meyer|
|Fraunhofer Institute for Digital Media Technology IDMT,|
|Project Group Hearing, Speech and Audio Technology (HSA), Oldenburg, Germany|
|Medizinische Physik, Carl von Ossietzky University Oldenburg, Germany|
|Cluster of Excellence Hearing4all, Carl von Ossietzky University Oldenburg, Germany|
Index Terms— Reverberation time, direct-to-reverberation ratio, 2D Gabor features, multi-layer perceptron, ACE Challenge
The acoustic characteristics of a room have been shown to be important to predict the speech quality and intelligibility, which is relevant to speech enhancement  as well as for automatic speech recognition (ASR) . The reverberation time and the direct-to-reverberation ratio () are two important acoustic parameters. Traditionally, and can be obtained from a measured room impuls response (RIR) . However, it is not practical or not even possible to measure the corresponding RIRs in most applications. Consequently, the demand of blind and estimation directly from speech and audio signals is increasing.
A number of approaches for blind estimation have been proposed earlier: Based on the spectral decay distribution of the reverberant signal, is determined in  by estimating the decay rate in each frequency band. A noise-robust version is presented in . In  a blind estimation is achieved by a statistical model of the sound decay characteristics of reverberant speech. Inspired by this,  uses a pre-selection mechanism to detect plausible decays and a subsequent application of a maximum-likelihood criterion to estimate with a low computational complexity. Alternatively, motivated by the progress that has been achieved using artificial neural networks in machine learning tasks,  proposed a method to estimate blindly from reverberant speech using trained neural networks, for which short-term root-mean square values of speech signals were used as the network input. The approach in  was also extended to estimate various acoustic room parameters in  using the low frequency envelope spectrum. Our work  proposed a multi-layer perceptron using spectro-temporal modulation features to estimate . A comparison of energies at high and low modulation frequencies, the so-called speech-to-reverberation modulation energy ratio (SRMR), which is highly correlated to and , is evaluated in .
The approaches mentioned so far use a single audio channel for obtaining the estimate, however, the majority of blind off-the-shelf estimators rely on multi-channel data. An approach to estimate based on a binaural input signal from which the direct component is eliminated by an equalization-cancellation operation was proposed in . Another method using an octagonal microphone array has been presented in , where a spatial coherence matrix for the mixture of a direct and diffuse sound field was employed to estimate using a least-squares criterion. In , an analytical expression was derived for the relationship between the and the binaural magnitude-squared coherence function. A null-steering beamformer is employed in  to estimate the with a two-element microphone array.
Motivated by the fact that the amount of perceived reverberation depends on both and , we propose a novel approach to simultaneously and blindly estimate these parameters. In our previous work [10, 16], we found spectro-temporal modulation features obtained by a 2D Gabor filterbank to be strongly and non-linearly correlated with reverberation parameters. We refer to these features as auditory Gabor features, since the filters used for extraction resemble the spectro-temporal receptive fields in the auditory cortex of mammals , i.e., it is likely that our auditory system is explicitly tuned to such patterns. The Gabor features are used as input to an artificial neural network, i.e. a multi-layer perceptron (MLP), which is trained for blind estimation of the parameters pair . The evaluation of performance focuses on the Acoustic Characterization of Environments (ACE) Challenge  evaluation test set in fullband mode with a single microphone.
The remainder of this paper is organized as follows: Section 2 introduces the blind estimator based on the 2D Gabor features and an MLP classifier. The detailed experimental procedure is described in Section 3 according to the ACE Challenge regulations. The results and discussion are presented in Section 4 for the proposed estimator with the ACE evaluation test set, and Section 5 concludes the paper.
2 Blind Estimator
An overview of the estimation process is presented in Figure 1: In a first step, reverberant signals are converted to spectro-temporal Gabor filterbank features [19, 20] to capture information relevant for room parameters estimation. An MLP is trained to map the input pattern to pairs of parameters , where the label information is according to from the available RIRs. Since the MLP generates one estimate per time step, we obtain an utterance-based estimate by simple temporal averaging and subsequent selection of the output neuron with the highest average activation (winner-takes-all), as shown in Figure 2 for instance. The noisy reverberant speech signal is constructed from clean (anechoic) speech convolved with measured RIRs and an additive noise , denoted as with time index .
Gabor features are generated by 2D Gabor filters applied to filter log-mel-spectrograms. The filters are localized spectro-temporal patterns that are with a high sensitivity towards amplitude modulations, as defined by
with and denoting the (mel-)spectral and temporal frame indices, and the Hann-envelope window lengths with the center indices , respectively. The periodicity of the sinusoidal-carrier function is defined by the radian frequencies , which allow the Gabor filters to be tuned to particular directions of spectro-temporal modulation. The purely diagonal Gabor filters as shown in Figure 3, were found to result in the maximal sensitivity to the reverberation effect  and thus, are used here to construct the Gabor features for the estimator. Each log-mel-spectrogram is filtered with these 48 filters in the filterbank that cover temporal modulations from 2.4 to 25 Hz and spectral modulations from -0.25 to 0.25 cycles/channel, respectively.
3 Experimental Setup
3.1 ACE Challenge
The ACE Challenge provides a development (Dev) dataset for algorithm fine-tuning and an evaluation (Eval) dataset for the final algorithm test. The task is aiming at blindly estimating two acoustic parameters, i.e. and , from noisy and reverberant speech. Two different modes i.e. fullband and subband (1/3-octave ISO  since and are both frequency dependent parameters), and six microphone configurations, i.e. a single microphone (Single) and microphone arrays with two (Laptop), three (Mobile), five (Cruciform), eight (Linear), and thirty-two (Spherical) microphones, were introduced. The dataset was generated using anechoic speech convolved with RIRs measured from real rooms with additive noise recorded under the same conditions. Also, three types of noise signals, i.e. ambient, babble and fan noises, were added to generate the noisy reverberant dataset. For Dev dataset, the signal-to-noise ratios (SNRs) were chosen to be 0, 10 and 20 dB, while for Eval, the SNRs were -1, 12 and 18 dB. The Dev dataset is approximately 120 h length from all multi-microphone scenarios. Each test set from Eval contains 4500 utterances categorized by these 3 noise types and 3 SNRs. For this paper, we focus on the tasks in the fullband mode of and estimation in the single microphone scenarios. Our approach is also applicable to multi-microphone scenarios by selecting any channel of the speech data.
The ground truth values of and were provided by the ACE Challenge. The ground truth is based on the energy decay curve computed from the RIRs using the Schroeder integral , to which the method proposed in  is used to estimate . This method is shown to be more reliable under all conditions than the standard method according to ISO3382 . The ground truth is estimated using the method of , where the direct path is determined by the 8 ms around the maximum found using an equalization filter .
The MLP shown in Figure 1 was implemented with the open-source Kaldi ASR toolkit  compiled with a Tesla K20c NVIDIA GPU with 5 GB memory size. It had 3 layers: The number of neurons in the input layer is 600, i.e. dimension of the 2D diagonal Gabor features (cf. Figure 3) calculated in Matlab. The temporal context considered by the MLP is limited to 1 frame, i.e. no splicing is applied. The number of hidden units is a free parameter that was optimized given the amount of training data and set to 8192 units, and the number of output neurons corresponds to the amount of pairs, i.e. 100 as defined in the following (also cf. Figure 2).
3.3 Speech Database
ACE database was recorded by different individuals who were reading different text materials in English. Here, we applied TIMIT corpus  to generate the training data for MLP, since TIMIT contains recordings of phonetically-balanced prompted English speech and a total of 6300 sentences (approximately 5.4 h). To avoid a strong mismatch between training and test data (which is likely to hurt MLP classification performance) we added the ACE Dev dataset to the training data. In order to match the amount of the Dev dataset (approximately 120 h), thereby balancing the two sets, TIMIT utterances were convolved with the collected RIRs circularly, which resulted in approximately 117 h TIMIT training data. The sampling rate of all signals is 16 kHz.
3.4 RIR Database
To cover a wide range of RIRs that occur in real life scenarios, we use several open-source RIR databases such as MARDY , AIR database , REVERB Challenge  and SMARD . Further, we also recorded several RIRs in two regular office rooms in our group. Figure 4 shows the distribution of values from the collected RIRs, as well as the ACE Dev and Eval datasets. ground truth values of the collected RIRs were calculated based on the methods described in Section 3.1. Due to the lack of the corresponding equalization filters for the source, the absolute peak position is considered as the maximum to determine the direct path for the calculations.
An MLP has a limited number of output neurons, which limits the resolution for the target estimate. We chose a resolution based on the distribution of training RIRs, with the aim of obtaining a sufficient number of observations for each pair, which is 100 ms for and 1 dB for (cf. Figure 4 where one bounding box represents one class). The boundaries of are 100 ms and 900 ms, with ranging from -6 dB to 15 dB. With these boundaries and the chosen resolution, 76 classes are obtained for the collected RIRs (light blue boxes), and 51 classes are obtained from the ACE Dev dataset (light red boxes). These classes are partially overlapping (light yellow boxes) and result in a total of 100 classes.
3.5 Noise Signals
The ACE noise signals were recorded in the same acoustic conditions as the RIR measurement, i.e., the noise captured by the microphone is reverberated. Hence, the noise signals combined with our extended RIRs should be reverberated as well. Since the original noise signals were not available in the context of the challenge, we created noise signals with similar characteristics as the original ambient, babble and fan noise.
Ambient noise was created by mixing recorded car noise and pink noise to obtain a colored noise with high energy in the low frequencies (as the original ambient noise).
To create babble noise, we mixed clean speech signals (two male, two female speakers) from the WSJCAM0  corpus.
A fan noise was recorded in an almost anechoic chamber to obtain the last noise type.
Subsequently, the noise signals were added to the anechoic speech at SNRs of 0, 10 and 20 dB (mimicking the procedure for the ACE Dev dataset), which were then convolved with the collected RIRs.
The estimation error is used for analysis and is defined as the difference between the estimated value and the ground truth value, i.e. in s for and in dB for . For comparison, the methods proposed in  and in  are employed as baseline to blindly estimate and , respectively. Note that the blind estimator in  requires a mapping function between the overall SRMR from 5th to 8th channel and the (both expressed in dB), which is obtained by the ACE Dev Single dataset.
As seen in Figure 5, in general, the proposed method outperforms the baseline approaches. For , the baseline method works better in slightly noisy environments with an SNR of 18 dB, while the performances degrade with lower SNRs. The proposed method has a higher robustness with respect to additive noise, presumably because the statistical model is trained on noisy reverberant speech with various SNRs. The median values of are close to 0 ms for all conditions (3 noise types and 3 SNRs), and the upper and lower percentiles are within 250 ms, which indicates that the proposed method is capable of providing accurate blind estimation. In addition, far less outliers are obtained compared to the baseline method. The same trend can be observed for , for which the baseline produces large errors for both median and percentiles, particularly in the low SNR situations.
The is underestimated by approximately -1.5 dB. This could be explained by the limited resolution of estimates (100 ms for , 1 dB for ) and the mismatch of data range for training data on the one hand, and for Eval dataset on the other: As shown in Figure 4, values from 1100 ms to 1400 ms are not covered by the training data at all. A detailed post analysis showed that underestimates of that arise from this mismatch go along with underestimates of ; for instance, a test sample with ground truth of (1293 ms, 4.96 dB) was estimated to be (750 ms, 1 dB). It appears that the underestimated reverberation effect caused by an underestimate of is somehow compensated by the corresponding underestimate of . Further, the mismatches of the SNRs and the noise signals might also lead to estimation errors, and it seems that such mismatches affect the estimate stronger than the estimate.
Additionally, the proposed estimator is tested with the ACE multi-microphone data, but only one channel (here the first channel ch1) is selected to perform the same estimation process. The overall trend of the estimation results as shown in Figure 6 is similar to previous results, which serves as verification of our approach on a different (and larger) test set. Again, the median values of are near to 0 ms and the percentiles are within 250 ms, and the median values of are between -1 dB and -2 dB with 2.5 dB percentiles. Consistent performances across noise types and SNRs indicate the importance of exploiting training data with a high amount of variability for a discriminative model in order to achieve robustness in adverse conditions.
The computational cost of our approach is quantified in terms of the real-time factor (RTF), defined as the ratio between the time taken to process a sentence and the length of the sentence. Two components in our approach contribute most to the overall complexity, i.e., the calculation of Gabor features and the forward-run of the neural net (cf. Figure 4). For optimization of the first component, the 2D convolution of spectrograms with Gabor filters was replaced by multiplication with a universal matrix. Since the proposed MLP estimator operates on a GPU (cf. Section 3.2), the computational complexity is measured in frames per second (FPS) with the frame length of 25 ms and overlapping of 10 ms. A rough transfer from FPS to RTF can be computed by . With an average GPU speed of 23736 FPS, an average RTF of 0.0042 is obtained. In summary, the average RTF of the proposed estimator for the single-microphone scenario (4500 utterances) is (providing both and ), while the RTFs of baseline estimator  and estimator  are 0.0483 and 0.3101, respectively.
This contribution presented a novel method for and in a blind and joint way using an MLP for classification. It has been shown that the proposed method is capable of accurately estimating and in the context of the ACE Challenge using single-microphone, fullband speech signals. The estimation errors of and cover a relatively small range of 250 ms and 2.5 dB with corresponding median values of nearly 0 ms and -1.5 dB on average, respectively. Furthermore, compared to the baseline approaches that only estimate either or estimation at a time, the computational complexity of the proposed estimator is significantly lower since the signal processing for feature extraction and the forward-run of the neural net are not very demanding in terms of computational cost, and since the and are estimated simultaneously.
-  P. A. Naylor and N. D. Gaubitch, Eds., Speech Dereverberation. London: Springer, 2010.
-  A. Sehr, “Reverberation Modeling for Robust Distant-Talking Speech Recognition,” Ph.D. dissertation, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany, Oct. 2009.
-  H. Kuttruff, Room Acoustics, 4th ed. London: Spon Press, 2000.
-  J. Y. C. Wen, E. A. P. Habets, and P. A. Naylor, “Blind Estimation of Reverberation Time based on the Distribution of Signal Decay Rates,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Apr. 2008, pp. 329–332.
-  J. Eaton, N. D. Gaubitch, and P. A. Naylor, “Noise-Robust Reverberation Time Estimation using Spectral Decay Distributions with Reduced Computational Cost,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013, pp. 161–165.
-  R. Ratnam, D. L. Jones, B. C. Wheeler, J. W. D. O’Brien, C. R. Lansing, and A. S. Feng, “Blind Estimation of Reverberation Time,” J. Acoust. Soc. Amer., vol. 114, no. 5, pp. 2877–2892, 2003.
-  H. Löllmann, E. Yilmaz, M. Jeub, and P. Vary, “An Improved Algorithm for Blind Reverberation Time Estimation,” in Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC), Tel Aviv, Isreal, Aug. 2010.
-  T. J. Cox, F. Li, and P. Dalington, “Extracting Room Reverberation Time from Speech Using Artificial Neural Networks,” J. Acoust. Soc. Amer., vol. 94, no. 4, pp. 219–230, 2001.
-  P. Kendrick, T. J. Cox, F. F. Li, Y. Zhang, and J. A. Chambers, “Monaural Room Acoustic Parameters from Music and Speech,” J. Acoust. Soc. Amer., vol. 124, no. 1, pp. 278–287, 2008.
-  F. Xiong, S. Goetze, and B. T. Meyer, “Blind Estimation of Reverberation Time based on Spectro-Temporal Modulation Filtering,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013, pp. 443–447.
-  T. H. Falk and W.-Y. Chan, “Temporal Dynamics for Blind Measurement of Room Acoustical Parameters,” IEEE Trans. Instrum. Meas., vol. 59, no. 4, pp. 978–989, Apr. 2010.
-  Y. C. Lu and M. Cooke, “Binaural Distance Perception based on Direct-to-Reverberant Energy Ratio,” in Proc. Int. Workshop on Acoustic Echo and Noise Control (IWAENC), 2008.
-  Y. Hioka, K. Niwa, S. Sakauchi, K. Furuya, and Y. Haneda, “Estimating Direct-to-Reverberant Energy Ratio using D/R Spatial Correlation Matrix Model,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 8, pp. 2374–2384, 2011.
-  M. Kuster, “Estimating the Direct-to-Reverberant Energy Ratio from the Coherence between Coincident Pressure and Particle Velocity,” J. Acoust. Soc. Amer., vol. 130, no. 6, pp. 3781–3787, 2011.
-  J. Eaton, A. H. Moore, P. A. Naylor, and J. Skoglund, “Direct-to-Reverberant Ratio Estimation using a Null-Steered Beamformer,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015, pp. 46–50.
-  F. Xiong, S. Goetze, and B. T. Meyer, “Estimating Room Acoustic Parameters for Speech Recognizer Adaptation and Combination in Reverberant Environments,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy, May 2014.
-  A. Qiu, C. E. Schreiner, and M. A. Escabí, “Gabor Analysis of Auditory Midbrain Receptive Fields: Spectrotemporal and Binaural Composition,” Journal of Neurophysiology, vol. 90, no. 1, pp. 456–476, Jul. 2003.
-  J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor, “The ACE Challenge - Corpus Description and Performance Evaluation,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York, USA, Oct. 2015.
-  B. T. Meyer, S. V. Ravuri, M. R. Schädler, and N. Morgan, “Comparing Different Flavors of Spectro-Temporal Features for ASR,” in Proc. Interspeech, Florence, Italy, Aug. 2011, pp. 1269–1272.
-  M. R. Schädler, B. T. Meyer, and B. Kollmeier, “Spectro-Temporal Modulation Subspace-Spanning Filter Bank Features for Robust Automatic Speech Recognition,” J. Acoust. Soc. Am., vol. 131, no. 5, pp. 4134–4151, 2012.
-  ANSI S1.1-1986 (ASA 65-1986), Specifications for Octave-Band and Fractional-Octave-Band Analog and Digital Filters, Std., 1993.
-  M. R. Schroeder, “New Method of Measuring Reverberation Time,” J. Acoust. Soc. Amer., vol. 37, no. 3, pp. 409–412, 1965.
-  M. Karjalainen, P. Antsalo, A. Mäkivirta, T. Peltonen, and V. Välimäki, “Estimation of Modal Decay Parameters from Noisy Response Measurements,” Journal Audio Eng. Soc., vol. 11, pp. 867–878, 2002.
-  ISO-3382, Acoustics-Measurement of the Reverberation Time of Rooms with Reference to Other Acoustical Parameters, Std., 2009.
-  S. Mosayyebpour, H. Sheikhzadeh, T. Gulliver, and M. Esmaeili, “Single-Microphone LP Residual Skewness-based Inverse Filtering of the Room Impulse Response,” IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 5, pp. 1617–1632, 2012.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz, J. Silovský, G. Stemmer, and K. Veselý, “The Kaldi Speech Recognition Toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Big Island, HI, USA, Jul. 2011.
-  J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, and V. Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1,” in Linguistic Data Consortium (LDC), 1994.
-  J. Wen, N. D. Gaubitch, E. Habets, T. Myatt, and P. A. Naylor, “Evaluation of Speech Dereverberation Algorithms using the (MARDY) Database,” in Proc. Intl. Workshop Acoust. Echo Noise Control (IWAENC), Paris, France, 2006.
-  M. Jeub, M. Schäfer, and P. Vary, “A Binaural Room Impulse Response Database for the Evaluation of Dereverberation Algorithms,” in Proc. of Int. Conf. on Digital Signal Processing, Santorini, Greece, July 2009, pp. 1–4.
-  K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, and B. Raj, “The REVERB Challenge: A Common Evaluation Framework for Dereverberation and Recognition of Reverberant Speech,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2013.
-  J. K. Nielsen, J. R. Jensen, S. H. Jensen, , and M. G. Christensen, “The Single- and Multichannel Audio Recordings Database (SMARD),” in Proc. Int. Workshop Acoustic Signal Enhancement (IWAENC), Antibes - Juan les Pins, France, Sep. 2014, pp. 40–44.
-  T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAM0: A British English Speech Corpus for Large Vocabulary Continuous Speech Recognition,” in IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Detroit, Michigan, USA, May 1995, pp. 81–84.