# Improving Reverberant Speech Training Using Diffuse Acoustic Simulation

## Abstract

We present an efficient and realistic geometric acoustic simulation approach for generating and augmenting training data in speech-related machine learning tasks. Our physically-based acoustic simulation method is capable of modeling occlusion, specular and diffuse reflections of sound in complicated acoustic environments, whereas the classical image method can only model specular reflections in simple room settings. We show that by using our synthetic training data, the same neural networks gain significant performance improvement on real test sets in far-field speech recognition by 1.58% and keyword spotting by 21%, without fine-tuning using real impulse responses.

Zhenyu Tang Lianwu Chen Bo Wu Dong Yu Dinesh Manocha
\address University of Maryland Tencent AI Lab

{zhy, dm}@cs.umd.edu, {lianwuchen, lambowu, dyu}@tencent.com
{keywords}
reverberation, diffuse reflection, speech recognition, data augmentation, acoustic simulation

## 1 Introduction

Over the past few years, deep learning approaches have gained significant ground in the speech community, surpassing the performance of many classical machine learning models in a variety of related sub-fields. State-of-the-art deep neural networks (DNNs) are powerful tools for exploiting variable-length contextual information embedded in noisy speech sequences. ^{1}

**Main contribution** To overcome limitations of existing simulation methods and better augment the training data, we propose an efficient and realistic geometric acoustic simulation approach that models occlusion, specular, and diffuse reflections, where sound energy can be reflected randomly and thus not following an ideal specular path. We sample 5000 different acoustic room configurations and use our method to simulate far-field sound propagation in each room. The speech training data is generated by randomly convolving over 1500 hours of clean speech utterances with simulated RIRs and adding environmental noise. We train two different models independently based on 1D/2D convolution + long short-term memory (LSTM) structures for an ASR task and a KWS task, and then evaluate them on different real-world data. We observe accuracy improvement using our method by 1.58% in terms of ASR and by 21% in terms of KWS.

The rest of the paper is organized as follows. In Section 2 we explain our ray-tracing based geometric acoustic simulation algorithm. We describe several speech training benchmarks in Section 3 and present our results in Section 4.

## 2 Acoustic Simulation

### 2.1 Impulse Response Modeling

Acoustic simulation engines have been used in computer aided designs (CAD), theoretical research, the game industry, and many other fields. The simulation goal is usually to observe how the sound pressure changes according to time at some position when there is a sound source at some other position in space. IR is the most common way to describe sound propagation between two points in a fixed environment, so we use to denote the IR at time from the point source at location to the listener at location . In practice, an IR can be measured by exciting an impulse using a shotgun as a sound source; the sound pressure is then recorded at the target receiver location. From a first principle view, the propagation of sound waves follow the acoustic wave equation [12], which describes the sound pressure variation in both spatial and temporal domain and is the foundation of wave-based solvers. There are several ways to implement wave-based solvers, including Finite Element Methods (FEM), Boundary Element Methods (BEM), finite-difference time domain (FDTD) approaches [13], and Adaptive Rectangular Decomposition (ARD) methods [14]. Wave-based techniques yield the most accurate results, but are only feasible for low frequencies and small scenes because they do not scale well with space and time granularity.

When the wavelength of the sound is smaller than the size of the obstacles in the environment, the sound wave can be treated in the form of a ray, which is the key idea of geometric methods. Typical geometric methods include the image method [15], path tracing methods [16, 17, 18, 19], and beam or frustum tracing methods [20, 21]. Our method is based on efficient Monte Carlo path tracing [22].

### 2.2 Sound Propagation

From the perspective of geometric methods, there are two types of reflections that can occur at a rigid surface: specular reflections and diffuse reflections. Specular reflections occur at mostly flat and uniform surfaces and the outgoing direction of the sound ray is the same as the incident angle in Fig. 1(a), known as Snell’s Law in geometric optics. However, real-world object surfaces usually do not completely satisfy the specular condition and scatter sound energy in all directions according to Lambert’s cosine-law, which is called diffuse reflections as illustrated in Fig 1(b). IRs are constructed by accumulating sound energy from both specular and diffuse reflection paths with the correct time delay and energy decay, which can be calculated from the total length of the path. Conventionally, an IR is decomposed into 3 parts: direct response, early reflections, and late reverberation. The direct response is determined by the visibility between the source and listener. Early reflections are mostly due to specular reflections, whereas the late reverberation is caused by diffuse reflections. A typical IR energy distribution is shown in Fig. 2.

### 2.3 Image Method

The image method is the current most widely used method in the speech community for generating RIRs in various learning-based tasks [23]. It is based on the principle of specular reflections where all reflection paths can be constructed by mirroring sound sources with respect to the reflecting plane, shown in Fig 3. A source will be mirrored multiple times depending on the desired order of reflections. Therefore, the image method fails to model the late reverberation part of an IR. Computationally, for a scene with one source, reflective surfaces, and reflection order , the time complexity is , which is prohibitive for simulations at high orders or scene complexities.

### 2.4 Diffuse Acoustic Simulation

Diffuse reflections occur when sound energy is scattered into non-specular directions. Diffuse reflections are widely observed in real-world and have been shown to be important for modeling sound fields in room environments [24, 25, 26]. Diffuse acoustic simulations correctly model not only the specular, but also the diffuse soundfield.

We propose our geometric acoustic simulation (GAS) method for this purpose. In contrast to the image method, our method is based on stochastic path tracing illustrated in Fig. 4: sound paths are randomly traced in all directions and each path follows either specular or diffuse reflections. We explicitly define the scattering coefficient between 0 and 1, which denotes the proportion of sound energy that is diffusely reflected at a surface (0 means perfectly specular and 1 means perfectly diffuse). Specifically, the sound energy reflected at a surface point to direction is computed by integrating the incoming energy over a hemisphere centered at on the surface:

(1) |

where is the incident angle, is the incoming direction, and is the probability distribution function that describes the probability of generating the sound path from to , which is generic to include both specular and diffuse reflections. In practice, Eq. 1 is recursive and can only be solved numerically using Monte Carlo integration. The diffuse reflection paths are generated by tracing random rays from the source, the listener, or both [27]. A large number of ray samples is required for solution convergence. The complexity of Monte Carlo path tracing is , where is the total number of rays traced to solve Eq. 1 and is the number of surfaces in the scene. One of its computational advantages over the image method is that most invalid paths that are generated, verified, and rejected in the image method are not considered in path tracing, so the number of surfaces does not greatly impact the efficiency of path tracing. This allows us to compute both early reflections and late reverberation efficiently.

In a far-field speech simulation setting, we define an acoustic room by its length, width, and height. Acoustic absorption and scattering coefficients can be defined for each surface element (triangular mesh), which determines the relative strength of diffraction. After specifying the sound source and receiver locations within the room, our simulation generates an RIR. Detailed configurations are in Section 3.1. One speech-related problem that has benefited from more accurate simulations is the direction-of-arrival estimation task [28]. We argue that using a more accurate geometric acoustic simulation that faithfully models the late reverberation for general speech-related training will lead to better performance in learning-based models.

## 3 Training with Acoustic Simulation

To evaluate our proposed approach, we conduct far-field automated speech recognition and keyword spotting experiments and then compare our approach with the popular image method. Both experiments are reverberant speech training tasks in which the test set is always real-world noisy reverberant speech recordings, but the training set can consist of clean speech or synthetic reverberant speech generated by either the image method or our geometric acoustic simulation.

### 3.1 Impulse Response Generation

We consider a 6-microphone circular array with 7cm diameter with speakers and the microphone array randomly located in the room at least 0.3m away from the wall. Both the image method and the geometric sound simulation method were employed to simulate the impulse response randomly generated from 5000 different room configurations with the size (length-width-height) ranging from 3m-3m-2.5m to 8m-10m-6m. The distance between the speaker and microphones ranges from 0.5m to 6m. The reverberation time T60 is sampled in a range of 0.05s to 0.5s. In general, there are two IR sets, each with 5000 IRs generated with the image method and the geometric sound simulation method, respectively. The IRs were used for data augmentation in ASR and KWS tasks.

### 3.2 Automated Speech Recognition

#### Data

The training corpus consists of two sets: (i) a clean corpus of 1.5 million clean speech utterances that translates to about 1500 hours in total and (ii) a noisy far-field training set simulated based on the clean corpus by adding reverberations and mixing with various environmental noises with SNRs ranging from 0 to 24 dB. For each IR generation method, the corresponding noisy far-field training set was generated using the IRs described in Section 3.1, and the first channel of simulated data was used as the input to the ASR system. The clean speech was first used to train the acoustic model and then both the clean speech and the simulated noisy speech were used to fine-tune the model. Depending on which of the two IR simulation methods were used to generate the noisy training sets, we got two acoustic models, one for the image method and one for the GAS method. The dataset sizes for clean, the image method, and the GAS method are the same. The testing corpus contains 2000 utterances of real far-field recording from 48 speakers; each utterance is 5 seconds on average and the whole set is about 3 hours. The data is recorded in 5 different rooms with sizes of about 4m-4m-3m. The distances between the microphones and the speaker are randomly set as 0.5 m, 1 m, 3 m and 5 m, and the SNR ranges from 5 to 20dB with the background noise of an air-conditioning or fan.

#### Model Configuration

The framework of the ASR system is shown in Fig. 5 and consists of feature extraction, an acoustic model [29], and a decoder. 40-dimensional Mel filter bank features were computed with a 25-ms window length and a 10-ms hop size to form a 120-dimensional vector along with their first and second order differences. After normalization, the feature vector of the current frame is concatenated with that of the 5 preceding and 5 subsequent frames, resulting in an input vector of dimension 1320 = 120 (5 + 1 + 5). The acoustic model contains two 2-dimensional convolutional layers, each with a kernel size of (3, 3) and a stride of (1, 1), followed by a maxpooling layer with a kernel size of (2, 2) and a stride of (2, 2), and then five LSTM layers, each with 1024 hidden units and peepholes, and then one full-connection layer plus a softmax layer. Batch normalization is applied after each CNN and LSTM layer to accelerate convergence and improve model generalization. We use context-dependent (CD) phonemes as the output units, which form 12000 classes in our Chinese ASR system. The Adam optimizer was adopted with an initial learning rate of 0.0001. A 5-gram language model with size of 190 GB was used. The vocabulary’s size was 280K and the training corpus was collected from news, blogs, messages, encyclopedias, etc.

### 3.3 Keyword Spotting

#### Data

The original training corpus contains 2500 hours of clean speech data, including 1250 hours of target keyword “Hi, Liu Bei” and 1250 hours of negative speech samples. The corresponding multi-channel reverberant data was simulated using each IR generation method. Noises with SNRs ranging from 0 to 24dB were also added into the augmented speech. The 2500 hours of simulated reverberant data are used for model training. The test corpus contains 8000 utterances with target keyword randomly selected from real user data from smart-speakers in a typical living room scenario, as well as 33 hours of negative samples from different categories, including music, TV noise, chatter, and other indoor noises. The 6-channel microphone signals were processed by an MVDR beamformer [30], and the output enhanced mono-channel signal was used for keyword spotting.

#### Model Configuration

The framework of the keyword spotting system, which is similar to [7] is shown in Fig. 6, comprising feature extraction, a classification model, and a posterior handling module. The 40-dimensional Mel filter bank features were computed with a 25-ms window length and a 10-ms hop size, and then combined with the first and second order differences to form a 120-dimensional frame feature. The current frame feature was concatenated with the 10 preceding frames and 5 subsequent frames, resulting in an input vector of dimension 1920=40 3 (10 + 1 + 5). The classification model contains one layer of 1D CNN [31] with a kernel size of 4 and is followed with a maxpooling layer with a kernel size of 3. The output of the CNN is passed to two layers of LSTM (hidden units 256) and then to a softmax layer with 4 (3 words + 1 garbage) output classes. Cross entropy is used for loss calculation. The outputs were then passed through a posterior handling module to obtain decisions. The final keyword score is defined as the largest product of the smoothed posteriors in an input sliding window, subject to the constraint that the individual words “fire” in the same order as specified in the keyword.

## 4 Results and Analysis

Table 1 shows the character accuracy of ASR systems achieved with the clean acoustic model (Clean), the noisy acoustic model based on the image method (Noisy_IM), and the geometric sound simulation method (Noisy_GAS). We collected 2K real-world test utterances that are corrupted by reverberations and noises to evaluate IR methods. Compared with the “Clean” setup, the “Noisy_IM” setup improved the system performance significantly by adding simulated noisy training data. Our proposed approach outperformed the image method by increasing the accuracy from 59.96% to 61.54%, illustrating the superiority of the proposed realistic geometric sound simulation approach.

Model | % |

Clean | 31.178 |

Noisy_IM | 59.961 |

Noisy_GAS | 61.540 |

Model | % |

Noisy_IM | 1.48 |

Noisy_GAS | 1.17 |

The equal error rates (EERs) of keyword spotting systems are shown in Table 2. These results indicate that we can achieve an EER of 1.17% and 1.48% when the augmented training data was generated using the geometric sound simulation method and the image method, respectively. This translates to a 21% EER reduction. In these experiments, the input to the keyword spotting system is the enhanced speech from an MVDR beamformer. This indicates that the proposed IRs are robust to multichannel signal processing algorithms.

In both experiments, we carefully controlled the training and evaluation conditions, where the only difference is the RIR simulation method. Due to our faithful simulation of diffuse sound reflections, the domain gap between synthetic training data and real data is further reduced and therefore we observe significant accuracy gains.

## 5 Discussion and Future Work

In this paper, we described a geometric acoustic simulation method that simulates both the specular and the diffuse soundfields for reverberant speech training. On the speech recognition and keyword spotting tasks, we showed that the proposed approach outperformed the popular image method, where the gain is mostly attributable to the more realistic simulation of reverberation and diffuse reflections. One limitation of this work is that neither method can model low-frequency or diffraction phenomena. A partial solution would be to compensate RIRs at low-frequency bands [32].

Although we demonstrated the efficacy of the proposed approach mainly on speech recognition and keyword spotting tasks, we believe a similar improvement on performance can be achieved on tasks such as source localization [33], speech separation, and the cocktail problem [1, 2], all of which can benefit from data-driven techniques and are future research directions. The proposed approach is thus of wide interest, especially because it can significantly reduce the effort of collecting training data under real-usage scenarios.

### Footnotes

- This work is supported in part by ARO grant W911NF-18-1-0313, NSF grant #1910940, Tencent, Adobe, Facebook and Intel. The authors thank Jie Chen and Dan Su from Tencent for their help with the ASR and KWS systems. Project website https://gamma.umd.edu/pro/speech/asr

### References

- J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 31–35.
- D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
- F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Twelfth annual conference of the international speech communication association, 2011.
- G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
- W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “The microsoft 2016 conversational speech recognition system,” in IEEE International Conference on Acoustics, 2017.
- D. Yu and J. Li, “Recent progresses in deep learning based acoustic models,” IEEE/CAA Journal of automatica sinica, vol. 4, no. 3, pp. 396–409, 2017.
- G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” 05 2014, pp. 4087–4091.
- R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath, “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” 2015.
- M. L. Seltzer, Y. Dong, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in IEEE International Conference on Acoustics, 2013.
- C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home,” in Interspeech, 2017.
- M. Doulaty, R. Rose, and O. Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Spoken Language Technology Workshop, 2017.
- R. P. Feynman, R. B. Leighton, and M. Sands, The Feynman lectures on physics, Vol. I: The new millennium edition: mainly mechanics, radiation, and heat. Basic books, 2011, vol. 1.
- S. Sakamoto, A. Ushiyama, and H. Nagatomo, “Numerical analysis of sound propagation in rooms using the finite difference time domain method,” The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 3008–3008, 2006.
- N. Raghuvanshi, R. Narain, and M. C. Lin, “Efficient and accurate sound propagation using adaptive rectangular decomposition,” Visualization and Computer Graphics, IEEE Transactions on, vol. 15, no. 5, pp. 789–801, 2009.
- J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
- M. T. Taylor, A. Chandak, L. Antani, and D. Manocha, “Resound: interactive sound rendering for dynamic virtual environments,” in Proceedings of the 17th ACM international conference on Multimedia. ACM, 2009, pp. 271–280.
- M. Taylor, A. Chandak, Q. Mo, C. Lauterbach, C. Schissler, and D. Manocha, “Guided multiview ray tracing for fast auralization,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 11, pp. 1797–1810, 2012.
- C. Schissler and D. Manocha, “Interactive sound propagation and rendering for large multi-source scenes,” ACM Transactions on Graphics (TOG), vol. 36, no. 1, p. 2, 2016.
- C. Schissler and D. Manocha, “Interactive sound rendering on mobile devices using ray-parameterized reverberation filters,” arXiv preprint arXiv:1803.00430, 2018.
- T. Funkhouser, I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. West, “A beam tracing approach to acoustic modeling for interactive virtual environments,” in Proceedings of the 25th annual conference on Computer graphics and interactive techniques. ACM, 1998, pp. 21–32.
- A. Chandak, C. Lauterbach, M. Taylor, Z. Ren, and D. Manocha, “Ad-frustum: Adaptive frustum tracing for interactive sound propagation,” IEEE Transactions on Visualization and Computer Graphics, vol. 14, no. 6, pp. 1707–1722, 2008.
- J. T. Kajiya, “The rendering equation,” in ACM SIGGRAPH computer graphics, vol. 20, no. 4. ACM, 1986, pp. 143–150.
- T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5220–5224.
- M. Hodgson, “Evidence of diffuse surface reflections in rooms,” The Journal of the Acoustical Society of America, vol. 89, no. 2, pp. 765–771, 1991.
- B.-I. Dalenbäck, M. Kleiner, and P. Svensson, “A macroscopic view of diffuse reflection,” Journal of the Audio Engineering Society, vol. 42, no. 10, pp. 793–807, 1994.
- Z. Tang, N. J. Bryan, D. Li, T. R. Langlois, and D. Manocha, “Scene-aware audio rendering via deep acoustic analysis,” arXiv preprint arXiv:1911.06245, 2019.
- C. Cao, Z. Ren, C. Schissler, D. Manocha, and K. Zhou, “Interactive sound propagation with bidirectional path tracing,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, p. 180, 2016.
- Z. Tang, J. Kanu, K. Hogan, and D. Manocha, “Regression and classification for direction-of-arrival estimation with convolutional recurrent neural networks,” in Interspeech, 2019.
- T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in IEEE International Conference on Acoustics, 2015.
- O. Hoshuyama and A. Sugiyama, “Robust adaptive beamforming,” IEEE Transactions on Acoustics Speech & Signal Processing, vol. 35, no. 10, pp. 1365–1376, 2008.
- O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio Speech & Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
- Z. Tang, H.-Y. Meng, and D. Manocha, “Low-frequency compensated synthetic impulse responses for improved far-field speech recognition,” arXiv preprint arXiv:1910.10815, 2019.
- R. Takeda and K. Komatani, “Sound source localization based on deep neural networks with directional activate function exploiting phase information,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, pp. 405–409.