Improving Reverberant Speech Training Using Diffuse Acoustic Simulation

Improving Reverberant Speech Training Using Diffuse Acoustic Simulation

Abstract

We present an efficient and realistic geometric acoustic simulation approach for generating and augmenting training data in speech-related machine learning tasks. Our physically-based acoustic simulation method is capable of modeling occlusion, specular and diffuse reflections of sound in complicated acoustic environments, whereas the classical image method can only model specular reflections in simple room settings. We show that by using our synthetic training data, the same neural networks gain significant performance improvement on real test sets in far-field speech recognition by 1.58% and keyword spotting by 21%, without fine-tuning using real impulse responses.

\name

Zhenyu Tang   Lianwu Chen   Bo Wu   Dong Yu  Dinesh Manocha \address University of Maryland   Tencent AI Lab
{zhy, dm}@cs.umd.edu, {lianwuchen, lambowu, dyu}@tencent.com {keywords} reverberation, diffuse reflection, speech recognition, data augmentation, acoustic simulation

1 Introduction

Over the past few years, deep learning approaches have gained significant ground in the speech community, surpassing the performance of many classical machine learning models in a variety of related sub-fields. State-of-the-art deep neural networks (DNNs) are powerful tools for exploiting variable-length contextual information embedded in noisy speech sequences. 1 Some very famous applications of DNN techniques in speech include Microsoft Cortana®, Apple Siri®, Google Now®, and Amazon Alexa®. These applications usually integrate several fundamental speech tasks such as speech enhancement and separation [1, 2], automated speech recognition (ASR) [3, 4, 5, 6], and keyword spotting (KWS) [7, 8]. Another important enabling factor behind the success of DNNs in these tasks is the huge amount of annotated speech corpus made available by research groups and large companies. Deep learning theory indicates that having more training examples is crucial to reduce the generalization error of trained models in real test cases [9]. However, the majority of popular speech corpuses were recorded under relatively ideal conditions, i.e. anechoic speech with negligible noise and environmental reverberation. When training models for real-world applications, it is common to distort the clean speech by adding noise and reverberation as a pre-processing step to augment the training data [10, 11]. Reverberation is a characteristic effect of a particular acoustic environment and can be described by impulse responses (IRs) or frequency responses. In practice, both recorded IRs and synthetic IRs have been used to convolve with the clean speech. Significant improvements in model accuracy have been observed due to this type of data augmentation. However, there is still a performance gap when the application is deployed in conditions not matched to training conditions. IRs pre-recorded in a limited number of environments may not generalize well to infinite real-world conditions. However, it is in-efficient to shrink the gap by collecting more real-world IRs; recording IRs is not a trivial task because it requires professional equipment and trained people. An alternative and cost-effective way is simulating room impulse responses (RIRs) by using acoustic simulators. A simple RIR simulator should take in the room geometry, source and listener positions, and surface absorption/reflection properties, and generate an RIR for each source-listener pair. One classical approach is the image method (IM), which models specular sound reflections in rectangular rooms and has been proven to work well in some tasks. However, one notable drawback of this method is its over-simplification of room acoustics by ignoring diffuse reflections that are very common in real-world environments. Furthermore, it does not deal with occlusion. These limitations make the image method less realistic in terms of augmenting data, especially in applications where late reverberation plays a significant role.

Main contribution To overcome limitations of existing simulation methods and better augment the training data, we propose an efficient and realistic geometric acoustic simulation approach that models occlusion, specular, and diffuse reflections, where sound energy can be reflected randomly and thus not following an ideal specular path. We sample 5000 different acoustic room configurations and use our method to simulate far-field sound propagation in each room. The speech training data is generated by randomly convolving over 1500 hours of clean speech utterances with simulated RIRs and adding environmental noise. We train two different models independently based on 1D/2D convolution + long short-term memory (LSTM) structures for an ASR task and a KWS task, and then evaluate them on different real-world data. We observe accuracy improvement using our method by 1.58% in terms of ASR and by 21% in terms of KWS.

The rest of the paper is organized as follows. In Section 2 we explain our ray-tracing based geometric acoustic simulation algorithm. We describe several speech training benchmarks in Section 3 and present our results in Section 4.

2 Acoustic Simulation

2.1 Impulse Response Modeling

Acoustic simulation engines have been used in computer aided designs (CAD), theoretical research, the game industry, and many other fields. The simulation goal is usually to observe how the sound pressure changes according to time at some position when there is a sound source at some other position in space. IR is the most common way to describe sound propagation between two points in a fixed environment, so we use to denote the IR at time from the point source at location to the listener at location . In practice, an IR can be measured by exciting an impulse using a shotgun as a sound source; the sound pressure is then recorded at the target receiver location. From a first principle view, the propagation of sound waves follow the acoustic wave equation [12], which describes the sound pressure variation in both spatial and temporal domain and is the foundation of wave-based solvers. There are several ways to implement wave-based solvers, including Finite Element Methods (FEM), Boundary Element Methods (BEM), finite-difference time domain (FDTD) approaches [13], and Adaptive Rectangular Decomposition (ARD) methods [14]. Wave-based techniques yield the most accurate results, but are only feasible for low frequencies and small scenes because they do not scale well with space and time granularity.

When the wavelength of the sound is smaller than the size of the obstacles in the environment, the sound wave can be treated in the form of a ray, which is the key idea of geometric methods. Typical geometric methods include the image method [15], path tracing methods [16, 17, 18, 19], and beam or frustum tracing methods [20, 21]. Our method is based on efficient Monte Carlo path tracing [22].

(a) Specular reflections
(b) Diffuse reflections
Figure 1: Two types of reflections of sound at a surface. Both phenomena are frequency dependent.

2.2 Sound Propagation

From the perspective of geometric methods, there are two types of reflections that can occur at a rigid surface: specular reflections and diffuse reflections. Specular reflections occur at mostly flat and uniform surfaces and the outgoing direction of the sound ray is the same as the incident angle in Fig. 1(a), known as Snell’s Law in geometric optics. However, real-world object surfaces usually do not completely satisfy the specular condition and scatter sound energy in all directions according to Lambert’s cosine-law, which is called diffuse reflections as illustrated in Fig 1(b). IRs are constructed by accumulating sound energy from both specular and diffuse reflection paths with the correct time delay and energy decay, which can be calculated from the total length of the path. Conventionally, an IR is decomposed into 3 parts: direct response, early reflections, and late reverberation. The direct response is determined by the visibility between the source and listener. Early reflections are mostly due to specular reflections, whereas the late reverberation is caused by diffuse reflections. A typical IR energy distribution is shown in Fig. 2.

Figure 2: Energy distribution of an impulse response in time. Our goal is to accurately model the late reverberation effects in simulated IRs.

2.3 Image Method

The image method is the current most widely used method in the speech community for generating RIRs in various learning-based tasks [23]. It is based on the principle of specular reflections where all reflection paths can be constructed by mirroring sound sources with respect to the reflecting plane, shown in Fig 3. A source will be mirrored multiple times depending on the desired order of reflections. Therefore, the image method fails to model the late reverberation part of an IR. Computationally, for a scene with one source, reflective surfaces, and reflection order , the time complexity is , which is prohibitive for simulations at high orders or scene complexities.

Figure 3: Construction and validation of image paths. The source is mirrored into 5 image sources marked as by 5 planes. A sound path is connected to the listener from each image source. Then a path validation is performed by checking whether the image path intersects with the plane that generates this image source. The path from to does not intersect with plane 1; therefore it is infeasible and rejected. The other 4 image paths are valid and can be used to compute the IR analytically.

2.4 Diffuse Acoustic Simulation

Diffuse reflections occur when sound energy is scattered into non-specular directions. Diffuse reflections are widely observed in real-world and have been shown to be important for modeling sound fields in room environments [24, 25, 26]. Diffuse acoustic simulations correctly model not only the specular, but also the diffuse soundfield.

We propose our geometric acoustic simulation (GAS) method for this purpose. In contrast to the image method, our method is based on stochastic path tracing illustrated in Fig. 4: sound paths are randomly traced in all directions and each path follows either specular or diffuse reflections. We explicitly define the scattering coefficient between 0 and 1, which denotes the proportion of sound energy that is diffusely reflected at a surface (0 means perfectly specular and 1 means perfectly diffuse). Specifically, the sound energy reflected at a surface point to direction is computed by integrating the incoming energy over a hemisphere centered at on the surface:

(1)

where is the incident angle, is the incoming direction, and is the probability distribution function that describes the probability of generating the sound path from to , which is generic to include both specular and diffuse reflections. In practice, Eq. 1 is recursive and can only be solved numerically using Monte Carlo integration. The diffuse reflection paths are generated by tracing random rays from the source, the listener, or both [27]. A large number of ray samples is required for solution convergence. The complexity of Monte Carlo path tracing is , where is the total number of rays traced to solve Eq. 1 and is the number of surfaces in the scene. One of its computational advantages over the image method is that most invalid paths that are generated, verified, and rejected in the image method are not considered in path tracing, so the number of surfaces does not greatly impact the efficiency of path tracing. This allows us to compute both early reflections and late reverberation efficiently.

In a far-field speech simulation setting, we define an acoustic room by its length, width, and height. Acoustic absorption and scattering coefficients can be defined for each surface element (triangular mesh), which determines the relative strength of diffraction. After specifying the sound source and receiver locations within the room, our simulation generates an RIR. Detailed configurations are in Section 3.1. One speech-related problem that has benefited from more accurate simulations is the direction-of-arrival estimation task [28]. We argue that using a more accurate geometric acoustic simulation that faithfully models the late reverberation for general speech-related training will lead to better performance in learning-based models.

Figure 4: Monte Carlo path tracing for solving the sound transport problem. Ray samples are generated in random directions from the source . Reflections upon hitting a plane are simulated by generating subsequent random rays while conserving the total energy. Once a ray intersects with the listener , the energy is accumulated to the IR.

3 Training with Acoustic Simulation

To evaluate our proposed approach, we conduct far-field automated speech recognition and keyword spotting experiments and then compare our approach with the popular image method. Both experiments are reverberant speech training tasks in which the test set is always real-world noisy reverberant speech recordings, but the training set can consist of clean speech or synthetic reverberant speech generated by either the image method or our geometric acoustic simulation.

3.1 Impulse Response Generation

We consider a 6-microphone circular array with 7cm diameter with speakers and the microphone array randomly located in the room at least 0.3m away from the wall. Both the image method and the geometric sound simulation method were employed to simulate the impulse response randomly generated from 5000 different room configurations with the size (length-width-height) ranging from 3m-3m-2.5m to 8m-10m-6m. The distance between the speaker and microphones ranges from 0.5m to 6m. The reverberation time T60 is sampled in a range of 0.05s to 0.5s. In general, there are two IR sets, each with 5000 IRs generated with the image method and the geometric sound simulation method, respectively. The IRs were used for data augmentation in ASR and KWS tasks.

3.2 Automated Speech Recognition

Data

The training corpus consists of two sets: (i) a clean corpus of 1.5 million clean speech utterances that translates to about 1500 hours in total and (ii) a noisy far-field training set simulated based on the clean corpus by adding reverberations and mixing with various environmental noises with SNRs ranging from 0 to 24 dB. For each IR generation method, the corresponding noisy far-field training set was generated using the IRs described in Section 3.1, and the first channel of simulated data was used as the input to the ASR system. The clean speech was first used to train the acoustic model and then both the clean speech and the simulated noisy speech were used to fine-tune the model. Depending on which of the two IR simulation methods were used to generate the noisy training sets, we got two acoustic models, one for the image method and one for the GAS method. The dataset sizes for clean, the image method, and the GAS method are the same. The testing corpus contains 2000 utterances of real far-field recording from 48 speakers; each utterance is 5 seconds on average and the whole set is about 3 hours. The data is recorded in 5 different rooms with sizes of about 4m-4m-3m. The distances between the microphones and the speaker are randomly set as 0.5 m, 1 m, 3 m and 5 m, and the SNR ranges from 5 to 20dB with the background noise of an air-conditioning or fan.

Figure 5: The framework of our ASR system used for evaluations.

Model Configuration

The framework of the ASR system is shown in Fig. 5 and consists of feature extraction, an acoustic model  [29], and a decoder. 40-dimensional Mel filter bank features were computed with a 25-ms window length and a 10-ms hop size to form a 120-dimensional vector along with their first and second order differences. After normalization, the feature vector of the current frame is concatenated with that of the 5 preceding and 5 subsequent frames, resulting in an input vector of dimension 1320 = 120 (5 + 1 + 5). The acoustic model contains two 2-dimensional convolutional layers, each with a kernel size of (3, 3) and a stride of (1, 1), followed by a maxpooling layer with a kernel size of (2, 2) and a stride of (2, 2), and then five LSTM layers, each with 1024 hidden units and peepholes, and then one full-connection layer plus a softmax layer. Batch normalization is applied after each CNN and LSTM layer to accelerate convergence and improve model generalization. We use context-dependent (CD) phonemes as the output units, which form 12000 classes in our Chinese ASR system. The Adam optimizer was adopted with an initial learning rate of 0.0001. A 5-gram language model with size of 190 GB was used. The vocabulary’s size was 280K and the training corpus was collected from news, blogs, messages, encyclopedias, etc.

3.3 Keyword Spotting

Data

The original training corpus contains 2500 hours of clean speech data, including 1250 hours of target keyword “Hi, Liu Bei” and 1250 hours of negative speech samples. The corresponding multi-channel reverberant data was simulated using each IR generation method. Noises with SNRs ranging from 0 to 24dB were also added into the augmented speech. The 2500 hours of simulated reverberant data are used for model training. The test corpus contains 8000 utterances with target keyword randomly selected from real user data from smart-speakers in a typical living room scenario, as well as 33 hours of negative samples from different categories, including music, TV noise, chatter, and other indoor noises. The 6-channel microphone signals were processed by an MVDR beamformer [30], and the output enhanced mono-channel signal was used for keyword spotting.

Figure 6: The framework of our KWS system used for evaluations.

Model Configuration

The framework of the keyword spotting system, which is similar to  [7] is shown in Fig. 6, comprising feature extraction, a classification model, and a posterior handling module. The 40-dimensional Mel filter bank features were computed with a 25-ms window length and a 10-ms hop size, and then combined with the first and second order differences to form a 120-dimensional frame feature. The current frame feature was concatenated with the 10 preceding frames and 5 subsequent frames, resulting in an input vector of dimension 1920=40 3 (10 + 1 + 5). The classification model contains one layer of 1D CNN [31] with a kernel size of 4 and is followed with a maxpooling layer with a kernel size of 3. The output of the CNN is passed to two layers of LSTM (hidden units 256) and then to a softmax layer with 4 (3 words + 1 garbage) output classes. Cross entropy is used for loss calculation. The outputs were then passed through a posterior handling module to obtain decisions. The final keyword score is defined as the largest product of the smoothed posteriors in an input sliding window, subject to the constraint that the individual words “fire” in the same order as specified in the keyword.

4 Results and Analysis

Table 1 shows the character accuracy of ASR systems achieved with the clean acoustic model (Clean), the noisy acoustic model based on the image method (Noisy_IM), and the geometric sound simulation method (Noisy_GAS). We collected 2K real-world test utterances that are corrupted by reverberations and noises to evaluate IR methods. Compared with the “Clean” setup, the “Noisy_IM” setup improved the system performance significantly by adding simulated noisy training data. Our proposed approach outperformed the image method by increasing the accuracy from 59.96% to 61.54%, illustrating the superiority of the proposed realistic geometric sound simulation approach.

Model %
Clean 31.178
Noisy_IM 59.961
Noisy_GAS 61.540
Table 1: Character accuracy of ASR systems. Our GASmethod has the highest accuracy and outperforms IM by 1.58%.
Model %
Noisy_IM 1.48
Noisy_GAS 1.17
Table 2: Equal error rates of KWS systems. Our GASmethod has the lowest equal error rate and results in a 21% error reduction relative to that of IM.

The equal error rates (EERs) of keyword spotting systems are shown in Table 2. These results indicate that we can achieve an EER of 1.17% and 1.48% when the augmented training data was generated using the geometric sound simulation method and the image method, respectively. This translates to a 21% EER reduction. In these experiments, the input to the keyword spotting system is the enhanced speech from an MVDR beamformer. This indicates that the proposed IRs are robust to multichannel signal processing algorithms.

In both experiments, we carefully controlled the training and evaluation conditions, where the only difference is the RIR simulation method. Due to our faithful simulation of diffuse sound reflections, the domain gap between synthetic training data and real data is further reduced and therefore we observe significant accuracy gains.

5 Discussion and Future Work

In this paper, we described a geometric acoustic simulation method that simulates both the specular and the diffuse soundfields for reverberant speech training. On the speech recognition and keyword spotting tasks, we showed that the proposed approach outperformed the popular image method, where the gain is mostly attributable to the more realistic simulation of reverberation and diffuse reflections. One limitation of this work is that neither method can model low-frequency or diffraction phenomena. A partial solution would be to compensate RIRs at low-frequency bands [32].

Although we demonstrated the efficacy of the proposed approach mainly on speech recognition and keyword spotting tasks, we believe a similar improvement on performance can be achieved on tasks such as source localization [33], speech separation, and the cocktail problem [1, 2], all of which can benefit from data-driven techniques and are future research directions. The proposed approach is thus of wide interest, especially because it can significantly reduce the effort of collecting training data under real-usage scenarios.

Footnotes

  1. This work is supported in part by ARO grant W911NF-18-1-0313, NSF grant #1910940, Tencent, Adobe, Facebook and Intel. The authors thank Jie Chen and Dan Su from Tencent for their help with the ASR and KWS systems. Project website https://gamma.umd.edu/pro/speech/asr

References

  1. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2016, pp. 31–35.
  2. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 241–245.
  3. F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Twelfth annual conference of the international speech communication association, 2011.
  4. G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2011.
  5. W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “The microsoft 2016 conversational speech recognition system,” in IEEE International Conference on Acoustics, 2017.
  6. D. Yu and J. Li, “Recent progresses in deep learning based acoustic models,” IEEE/CAA Journal of automatica sinica, vol. 4, no. 3, pp. 396–409, 2017.
  7. G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” 05 2014, pp. 4087–4091.
  8. R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath, “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” 2015.
  9. M. L. Seltzer, Y. Dong, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in IEEE International Conference on Acoustics, 2013.
  10. C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home,” in Interspeech, 2017.
  11. M. Doulaty, R. Rose, and O. Siohan, “Automatic optimization of data perturbation distributions for multi-style training in speech recognition,” in Spoken Language Technology Workshop, 2017.
  12. R. P. Feynman, R. B. Leighton, and M. Sands, The Feynman lectures on physics, Vol. I: The new millennium edition: mainly mechanics, radiation, and heat.   Basic books, 2011, vol. 1.
  13. S. Sakamoto, A. Ushiyama, and H. Nagatomo, “Numerical analysis of sound propagation in rooms using the finite difference time domain method,” The Journal of the Acoustical Society of America, vol. 120, no. 5, pp. 3008–3008, 2006.
  14. N. Raghuvanshi, R. Narain, and M. C. Lin, “Efficient and accurate sound propagation using adaptive rectangular decomposition,” Visualization and Computer Graphics, IEEE Transactions on, vol. 15, no. 5, pp. 789–801, 2009.
  15. J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
  16. M. T. Taylor, A. Chandak, L. Antani, and D. Manocha, “Resound: interactive sound rendering for dynamic virtual environments,” in Proceedings of the 17th ACM international conference on Multimedia.   ACM, 2009, pp. 271–280.
  17. M. Taylor, A. Chandak, Q. Mo, C. Lauterbach, C. Schissler, and D. Manocha, “Guided multiview ray tracing for fast auralization,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 11, pp. 1797–1810, 2012.
  18. C. Schissler and D. Manocha, “Interactive sound propagation and rendering for large multi-source scenes,” ACM Transactions on Graphics (TOG), vol. 36, no. 1, p. 2, 2016.
  19. C. Schissler and D. Manocha, “Interactive sound rendering on mobile devices using ray-parameterized reverberation filters,” arXiv preprint arXiv:1803.00430, 2018.
  20. T. Funkhouser, I. Carlbom, G. Elko, G. Pingali, M. Sondhi, and J. West, “A beam tracing approach to acoustic modeling for interactive virtual environments,” in Proceedings of the 25th annual conference on Computer graphics and interactive techniques.   ACM, 1998, pp. 21–32.
  21. A. Chandak, C. Lauterbach, M. Taylor, Z. Ren, and D. Manocha, “Ad-frustum: Adaptive frustum tracing for interactive sound propagation,” IEEE Transactions on Visualization and Computer Graphics, vol. 14, no. 6, pp. 1707–1722, 2008.
  22. J. T. Kajiya, “The rendering equation,” in ACM SIGGRAPH computer graphics, vol. 20, no. 4.   ACM, 1986, pp. 143–150.
  23. T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2017, pp. 5220–5224.
  24. M. Hodgson, “Evidence of diffuse surface reflections in rooms,” The Journal of the Acoustical Society of America, vol. 89, no. 2, pp. 765–771, 1991.
  25. B.-I. Dalenbäck, M. Kleiner, and P. Svensson, “A macroscopic view of diffuse reflection,” Journal of the Audio Engineering Society, vol. 42, no. 10, pp. 793–807, 1994.
  26. Z. Tang, N. J. Bryan, D. Li, T. R. Langlois, and D. Manocha, “Scene-aware audio rendering via deep acoustic analysis,” arXiv preprint arXiv:1911.06245, 2019.
  27. C. Cao, Z. Ren, C. Schissler, D. Manocha, and K. Zhou, “Interactive sound propagation with bidirectional path tracing,” ACM Transactions on Graphics (TOG), vol. 35, no. 6, p. 180, 2016.
  28. Z. Tang, J. Kanu, K. Hogan, and D. Manocha, “Regression and classification for direction-of-arrival estimation with convolutional recurrent neural networks,” in Interspeech, 2019.
  29. T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in IEEE International Conference on Acoustics, 2015.
  30. O. Hoshuyama and A. Sugiyama, “Robust adaptive beamforming,” IEEE Transactions on Acoustics Speech & Signal Processing, vol. 35, no. 10, pp. 1365–1376, 2008.
  31. O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio Speech & Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
  32. Z. Tang, H.-Y. Meng, and D. Manocha, “Low-frequency compensated synthetic impulse responses for improved far-field speech recognition,” arXiv preprint arXiv:1910.10815, 2019.
  33. R. Takeda and K. Komatani, “Sound source localization based on deep neural networks with directional activate function exploiting phase information,” in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2016, pp. 405–409.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407205
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description