Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition

Recurrent Models for Auditory Attention in Multi-Microphone Distance Speech Recognition

Abstract

Integration of multiple microphone data is one of the key ways to achieve robust speech recognition in noisy environments or when the speaker is located at some distance from the input device. Signal processing techniques such as beamforming are widely used to extract a speech signal of interest from background noise. These techniques, however, are highly dependent on prior spatial information about the microphones and the environment in which the system is being used. In this work, we present a neural attention network that directly combines multi-channel audio to generate phonetic states without requiring any prior knowledge of the microphone layout or any explicit signal preprocessing for speech enhancement. We embed an attention mechanism within a Recurrent Neural Network (RNN) based acoustic model to automatically tune its attention to a more reliable input source. Unlike traditional multi-channel preprocessing, our system can be optimized towards the desired output in one step. Although attention-based models have recently achieved impressive results on sequence-to-sequence learning, no attention mechanisms have previously been applied to learn potentially asynchronous and non-stationary multiple inputs. We evaluate our neural attention model on the CHiME-3 challenge task, and show that the model achieves comparable performance to beamforming using a purely data-driven method.

1Introduction

Many real-world speech recognition applications, including teleconferencing, robotics and in-car spoken dialog systems, must deal with speech from distant microphones in noisy environments. When a human voice is captured with far-field microphones in these environments, the audio signal is severely degraded by reverberation and background noise. This makes the distant speech recognition task far more challenging than near-field speech recognition, which is commonly used for voice-based interaction today.

Acoustic signals from multiple microphones can be used to enhance recognition accuracy due to the availability of additional spatial information. Many researchers have proposed techniques to efficiently integrate inputs from multiple distant microphones. The most representative multi-channel processing technique is the beamforming approach [33], which generates an enhanced single output signal by aligning multiple signals through digital delays that compensate for the different distances of the input signals. However, the performance of beamforming is highly dependant on prior information about microphone location and the location of the target source. For downstream tasks such as speech recognition, this preprocessing step is suboptimal because it is not directly optimized towards the final objective of interest: speech recognition accuracy [29].

Over the past few years, deep neural networks (DNNs) have been successfully applied to acoustic models in speech recognition [28]. Other works [17] have shown that DNNs can learn suitable representations for distant speech recognition by directly using multi-channel input. These approaches however, simply concatenated acoustic features from multiple microphones without considering the spatial properties of acoustic signal propagation, or used convolutional neural networks (CNNs) to implicitly account for spatial relationships between channels [26].

Recently, an “attention mechanism” in neural networks has been proposed to address the problem of learning variable-length input and output sequences [1]. At each output step, the previous output history is used to generate an attention vector over the input sequence. This attention vector enables models to learn to focus attention on specific parts of their input. These attention-equipped frameworks have shown very promising results on many challenging tasks involving inputs and outputs with variable length, including machine translation [1], parsing [35], image captioning [36] and conversational modeling [34]. Specifically, for the speech recognition tasks, [5] attempted to align the input features and the desired character sequence using an attention mechanism. However, no attention mechanisms have been applied to learn to integrate multiple inputs.

In this work, we propose a novel attention-based model that enables to learn misaligned and non-stationary multiple input sources for distant speech recognition. We embed an attention mechanism within a Recurrent Neural Network (RNN) based acoustic model to automatically tune its attention to a more reliable input source among misaligned and non-stationary input sources at each output step. The attention module is learned with the normal acoustic model and is jointly optimized towards phonetic state accuracy. Our attention module is unique in the way that we 1) deal with the problem of integrating different qualities and misalignment of multiple sources, and 2) exploit spatial information between multiple sources to accelerate learning of auditory attention. Our system plays a similar role to traditional multichannel preprocessing through deep neural network architecture, but bypasses the limitations of preprocessing, which requires an expensive, separate step and depends on prior information.

Through a series of experiments on the CHiME-3 [14] dataset, we show that our proposed approach improves recognition accuracy in various types of noisy environments. In addition, we also compare our approach with the beamforming technique[14]. The paper is organized as follows: in Section 2 we describe our proposed attention based model. In Section 3, we evaluate the performance of our model. Finally, in Section 4 we draw conclusions.

2Model

In this section, we describe our neural attention model, which allows neural networks to focus more on reliable input sources across different temporal locations. We formulate the proposed framework with applications in multi-channel distant speech recognition. While there has been some recent work on end-to-end neural speech recognition systems - from speech directly to transcripts [9] - our model is based on typical hybrid DNN-HMM frameworks [21], wherein the acoustic model estimates hidden Markov model (HMM) state posteriors, because we focus on dealing with the re-weighted input representation of misaligned multiple input sources.

Given a set of input sequences , where is an input sequence from the th microphone, , our system computes a corresponding sequence of HMM acoustic states, . We model each output at time as a conditional distribution over the previous outputs and the multiple inputs at time using the chain rule:

Our system consists of two subnetworks: and . is an attention-equipped Recurrent Neural Network (RNN) for learning to determine and focus on reliable channels and temporal locations among the candidate multiple input sequences. produces re-weighted inputs, , based on the learned attention. This is used for the next subnetwork , which is a Long Short-Term Memory (LSTM) acoustic model to estimate the probability of the output HMM state . Figure 1 visualizes our overall model with these two components. We describe more details of each component in the following subsections Section 2.1 and Section 2.2.

2.1Attention mechanism for multiple sources

The challenge we attempt to address with the neural attention mechanism is the problem of misaligned multiple input sources with non-stationary quality over time. Specifically, in multi-channel distant speech recognition, the arrival time of each channel is different because the acoustic path length of each signal differs according to the location of the microphone. This results in the misalignment of input features. These differences in arrival time are even greater when the space between microphones is larger. Even worse, signal quality across channels can also vary over time because the speaker and interfering noise sources may keep changing. Figure 1 describes the asynchronous arrival of multiple inputs due to acoustic path length differences.

We now introduce an attention mechanism to cope with the misaligned input problem, and formulate the . At every output step , the function produces a re-weighted input representation , given th candidate input set . is a subsequence of time frames. As proposed by [2], we perform similar windowing to limit the exploring temporal location of inputs for computational efficiency and scalability. We limit the range of attention to =7 time frames (). In our experiments, longer time steps had little impact on overall performance and would rather benefit from microphones placed further apart from each other.

For re-weighting the input , predicts an attention weight matrix at each output step . Unlike previous attention mechanisms, we produce a weight matrix rather than a vector, because our attention mechanism additionally identifies which channel, in a given time step, is more relevant. Therefore, is the (number of channels) by (number of candidate input frames) matrix - here it is x matrix. Attention weights are calculated based on four different information sources: 1) attention history , 2) content in the candidate sequences , 3) decoding history , and 4) additional spatial information between multiple microphones based on phase difference information corresponding to . The following three formulations describe the function:

Specifically, (in Equation 1) computes an energy matrix ( x ) by the following equation:

where , , , and are parameter matrices, and is a parameter vector. Once we compute the energy at time , then we obtain by normalizing , such that, , , and (in equation ?). Finally, re-weighted output is generated by calculating the dot product of the attention weights and candidate input (in equation ?). Typically, the selection of elements from input candidates is a weighted sum. However, we only calculate the dot product in order to avoid losing information.

To accelerate the learning of the attention mechanism, we use additional spatial information based on analysis of differences in arrival time. It is generally assumed that the human auditory system can localize multiple sounds and attend to the desired signal using information from the interaural time difference (ITD) [31]. A previous study [15] attempted to emulate human binaural processing and estimate ITD indirectly by comparing the phase difference between two microphones at each frequency domain. The authors identified a “close” time-frequency component to the speaker based on the estimated ITD. Similarly, we use the phase difference between two microphones to infer spatial information. The following equations are used to compute phase difference between two microphones and , where :

From these equations, we calculate the phase differences of each time-frequency bin of each pair of multiple microphones. In our work, we use 256 frequency bins for 25ms windows. The phase feature is calculated in every pair of channels, then the network accepts the corresponding to the input candidates, with as an additional input.

2.2LSTM Acoustic Model

Our next subnetwork serves as a typical RNN-based acoustic model, except that it accepts the re-weighted input instead of the original input . uses a Long Short-Term Memory RNN (LSTM)[13], which has been successfully applied to speech recognition tasks due to its ability to handle long-term dependencies. The LSTM contains special units called memory blocks in the recurrent hidden layer, and each block has memory cells with special three-gates (input , output , and forget ) to control the flow of information.

In our work, we use a simplified version of an LSTM without peephole connections and biases to reduce the computational expense of learning the standard LSTM models. Although LSTMs have many variations for enhancing their performance, such as BLSTM [7], LSTMP [27], and PBLSTM [4], in our work, we focus on verifying an additional attention mechanism with a simple LSTM architecture, instead of improving LSTM acoustic modeling overall.

maps a re-weighted input sequence based on the attention mechanism , where , to an output sequence by calculating the network unit activations using the following equations iteratively from to :

where terms denote weight matrices, and the logistic sigmoid function. , , , and are the input gate, forget gate, output gate and cell activation vectors, respectively. Finally, the output is used to predict the current HMM state label by softmax (in Equation 3). is also used to predict the next attention matrix as well as the next hidden state of .

3Experiments

3.1Dataset

We evaluated the performance of our architecture on the CHiME-3 task. The CHiME-3 [14] task is automatic speech recognition for a multi-microphone tablet device in an everyday environment - a cafe, a street junction, public transport, and a pedestrian area. There are two types of datasets: REAL and SIMU. The REAL data consists of 6-channel recordings. 12 US English speakers were asked to read the sentences from the WSJ0 corpus [6] while using the multi-microphone tablet. They were encouraged to adjust their reading positions, so that the target distance kept changing over time. The simulated data SIMU was generated by mixing clean utterances from WSJ0 into background recordings. To verify our method in a real noisy environment, we first chose not to use the simulated dataset but rather to use only the REAL dataset, with 5 channels from the five microphones, which were located in each corner of tablet, about 10cm to 20cm away from each other (we excluded one microphone, which faced backward in the tablet device). We then evaluated our system on the full CHiME3 dataset, MULTI, including REAL and SIMU.

3.2System Training

All the networks were trained on the 1,600 utterance (about 2.9 hours) REAL dataset and then on the 8,738 utterance (about 18 hours) MULTI dataset. The dataset was represented with 25ms frames of 40-dimensional log-filterbank energy features computed every 10ms. We produced 1,992 HMM state labels from a trained GMM-HMM system using near-field microphone data, and these state labels were used in all subsequent experiments. We use one layer of LSTM architecture with 512 cells. The weights in all the networks were initialized to the range (-0.03, 0.03) with a uniform distribution, and the initial attention weights were initialized to in dimensions. We set the configuration of the learning rate to 0.4 and after two epochs it decays during training. All models resulted in a stable convergence range from 1e-04 to 5e-04. To avoid the exploding gradient problem, we limited the norm of the gradient to 1 [23]. Apart from the gradient clipping, we did not limit the activations of the weights.

During training, we evaluated frame accuracies (i.e. phone state labeling accuracy of acoustic frames) on the development set of 1,640 utterances in REAL and 3,280 utterances in MULTI. The trained models were evaluated in a speech recognition system on a test set of 1,320 utterances. For all the decoding experiments, we used a size 18 beam and size 10 lattices. There is a mismatch between the Kaldi baseline [25] and our results because we did not perform sequence training (sMBR) or language model rescoring (5-gram rescoring or RNNLM). The inputs for all networks were log-filterbank features, with 5 channels stacking, and then with 7 frames stacking (+3-3).

3.3Results

In Table 1 and Table 2, we summarize word error rates (WERs) obtained on the subset of the CHiME3 task. ALSTM is our proposed model, which has an attention mechanism for multiple inputs as described in Section 2.1, and ALSTM (with phase) used phase information in addition to ALSTM.

As our baselines, we built three models on the REAL dataset and used the same simple version of the LSTM architecture that we described in Section 2.2 with three different inputs. LSTM (Preprocessing 5 noisy-channel) was trained on the enhanced signal from 5 noisy channels. We obtained the enhanced signal from the beamforming toolkit, which was provided by the CHiME3 organizer [14]. LSTM (single noisy-channel) was trained on a single noisy channel, and LSTM (5 noisy-channels) used the concatenated 5 noisy channels. We also built LSTM (Preprocessing 5 noisy-channel) on the MULTI dataset.

As expected, LSTM (Preprocessing 5 noisy-channel) provided a substantial improvement in WER compared to LSTM (single noisy-channel) and LSTM (5 noisy-channel), showing a 13.3% and 5.0% relative improvement in WER, respectively. We also found that the model, which simply combined 5 features across microphones, did not perform very well. It showed poorer results than even the model trained with single microphone data. This result underscores the importance of integrating channels based on analysis of differences in arrival times.

Our model with the attention mechanism provided a significant improvement in WER compared to LSTM (5 noisy-channel). Compared to LSTM (5 noisy-channel), ALSTM (with phase) achieved a 17% reduction in relative error rate on the evaluation set, and ALSTM achieved a 13% relative error rate. These results suggest that we can leverage the attention mechanism to integrate multiple channels efficiently. To ensure the improvement of the system was coming from our time-channel attention mechanism, we compared our model to a model with an attention mechanism across time only on single-channel input. This comparison model helped to improve accuracy by 3%, a lower gain than that achieved by the time-channel attention mechanism.

We also found that the additional phase information can help to learn attention and WER improved by 4.6% relatively. In comparison with LSTM (Preprocessing 5 noisy-channel), we found that our proposed model achieved comparable performance to beamforming without any preprocessing. Although ALSTM shows a slightly lower performance as compared to LSTM (Preprocessing 5 noisy-channel), a 4.0% relative error rate was obtained by ALSTM (with phase). When we used LSTM-AM with the additional phase features without any attention mechanism, it had a negative influence on learning. Thus, using the phase features for the attention mechanism is more effective than using the phase features as direct inputs of the acoustic model.

We also evaluated the models on the MULTI dataset. We found that our system outperformed LSTM (Preprocessing 5 noisy-channel) by 5%, and the gain from the time-channel attention mechanism increased.

We then analyzed the computational aspects of our system. As the multi-microphone is performed as part of the acoustic model computation we have actually found it to be more computationally efficient than performing beamforming followed by an LSTM acoustic model. On our development machine (Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz), the proposed multi-microphone model with attention and phase operated 0.08 real-time, which was significantly faster than the beamforming followed by acoustic model computation which operated at 0.6 real-time.

4Conclusions

We proposed an attention-based model (ALSTM) that uses asynchronous and non-stationary inputs from multiple channels to generate outputs. For a distant speech recognition task, we embedded a novel attention mechanism within a RNN-based acoustic model to automatically tune its attention to a more reliable input source. We presented our results on the CHiME3 task and found that ALSTM showed a substantial improvement in WER. Our model achieved comparable performance to beamforming without any prior knowledge of the microphone layout or any explicit preprocessing.

The implications of this work are significant and far-reaching. Our work suggests a way to build a more efficient ASR system by bypassing preprocessing. Our findings suggest that this approach will likely do well on tasks that need to exploit misaligned and non-stationary inputs from multiple sources, such as multimodal problems and sensory fusion. We believe that our attention framework can greatly improve these tasks by maximizing the benefits of using inputs from multiple sources.

Acknowledgments

The authors would like to acknowledge Richard M. Stern and William Chan for their valuable and constructive suggestions. This research was supported by LGE.

References

1. Neural machine translation by jointly learning to align and translate.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. arXiv preprint arXiv:1409.0473
2. End-to-end attention-based large vocabulary speech recognition.
Bahdanau, Dzmitry, Chorowski, Jan, Serdyuk, Dmitriy, Brakel, Philemon, and Bengio, Yoshua. arXiv preprint arXiv:1508.04395
3. Multi-source tdoa estimation in reverberant audio using angular spectra and clustering.
Blandin, Charles, Ozerov, Alexey, and Vincent, Emmanuel. Signal Processing
4. Listen, attend and spell.
Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals, Oriol. arXiv preprint arXiv:1508.01211
5. End-to-end continuous speech recognition using attention-based recurrent nn: First results.
Chorowski, Jan, Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. arXiv preprint arXiv:1412.1602
6. Csr-i (wsj0) complete.
Garofalo, John, Graff, David, Paul, Doug, and Pallett, David. Linguistic Data Consortium, Philadelphia
7. Hybrid speech recognition with deep bidirectional lstm.
Graves, Alan, Jaitly, Navdeep, and Mohamed, Abdel-rahman. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 273–278. IEEE, 2013.
8. Towards end-to-end speech recognition with recurrent neural networks.
Graves, Alex and Jaitly, Navdeep. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772, 2014.
9. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.
Graves, Alex, Fernández, Santiago, Gomez, Faustino, and Schmidhuber, Jürgen. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. ACM, 2006.
10. Deepspeech: Scaling up end-to-end speech recognition.
Hannun, Awni, Case, Carl, Casper, Jared, Catanzaro, Bryan, Diamos, Greg, Elsen, Erich, Prenger, Ryan, Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam, et al. arXiv preprint arXiv:1412.5567
11. Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition.
Himawan, Ivan, Motlicek, Petr, Imseng, David, Potard, Blaise, Kim, Namhoon, and Lee, Jaewon. In IEEE International Conference on Acoustics, Speech, and Signal Processing, number EPFL-CONF-207946, 2015.
12. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.
Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Signal Processing Magazine, IEEE
13. Long short-term memory.
Hochreiter, Sepp and Schmidhuber, Jürgen. Neural computation
14. The third ’chime’ speech separation and recognition challenge: Dataset, task and baselines.
Jon Barker, Ricard Marxer, Emmanuel Vincent Shinji Watanabe. Submitted to IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU)
15. Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain.
Kim, Chanwoo, Kumar, Kshitiz, Raj, Bhiksha, and Stern, Richard M. In INTERSPEECH, pp. 2495–2498, 2009.
16. Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors.
Kumatani, Kenichi, McDonough, John, and Raj, Bhiksha. Signal Processing Magazine, IEEE
17. Using neural network front-ends on far field multiple microphones based speech recognition.
Liu, Yulan, Zhang, Pengyuan, and Hain, Thomas. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 5542–5546. IEEE, 2014.
18. Adaptive segmentation and separation of determined convolutive mixtures under dynamic conditions.
Loesch, Benedikt and Yang, Bin. In Latent Variable Analysis and Signal Separation, pp. 41–48. Springer, 2010.
Mestre, Xavier, Lagunas, Miguel, et al. In Signal Processing and Information Technology, 2003. ISSPIT 2003. Proceedings of the 3rd IEEE International Symposium on, pp. 459–462. IEEE, 2003.
20. Acoustic modeling using deep belief networks.
Mohamed, Abdel-rahman, Dahl, George E, and Hinton, Geoffrey. Audio, Speech, and Language Processing, IEEE Transactions on
21. Connectionist speech recognition: a hybrid approach, 1994.
Morgan, N and Bourlard, H.
22. Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero-crossings.
Park, Hyung-Min and Stern, Richard M. Speech Communication
23. On the difficulty of training recurrent neural networks.
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. arXiv preprint arXiv:1211.5063
24. Distant speech separation using predicted time–frequency masks from spatial features.
Pertilä, Pasi and Nikunen, Joonas. Speech Communication
25. The kaldi speech recognition toolkit.
Povey, Daniel, Ghoshal, Arnab, Boulianne, Gilles, Burget, Lukas, Glembek, Ondrej, Goel, Nagendra, Hannemann, Mirko, Motlicek, Petr, Qian, Yanmin, Schwarz, Petr, Silovsky, Jan, Stemmer, Georg, and Vesely, Karel. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, December 2011.
26. Neural networks for distant speech recognition.
Renals, Steve and Swietojanski, Pawel. In Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on, pp. 172–176. IEEE, 2014.
27. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition.
Sak, Haşim, Senior, Andrew, and Beaufays, Françoise. arXiv preprint arXiv:1402.1128
28. Conversational speech transcription using context-dependent deep neural networks.
Seide, Frank, Li, Gang, and Yu, Dong. In Interspeech, pp. 437–440, 2011.
29. Bridging the gap: Towards a unified framework for hands-free speech recognition using microphone arrays.
Seltzer, Michael L. In Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, pp. 104–107. IEEE, 2008.
30. Likelihood-maximizing beamforming for robust hands-free speech recognition.
Seltzer, Michael L, Raj, Bhiksha, and Stern, Richard M. Speech and Audio Processing, IEEE Transactions on
31. Binaural and multiple-microphone signal processing motivated by auditory perception.
Stern, Richard M, Gouvêa, Evandro, Kim, Chanwoo, Kumar, Kshitiz, and Park, Hyung-Min. In Hands-Free Speech Communication and Microphone Arrays, 2008. HSCMA 2008, pp. 98–103. IEEE, 2008.
32. Convolutional neural networks for distant speech recognition.
Swietojanski, Pawel, Ghoshal, Arnab, and Renals, Steve. Signal Processing Letters, IEEE
33. Speech recognition in noisy environments with the aid of microphone arrays.
Van Compernolle, Dirk, Ma, Weiye, Xie, Fei, and Van Diest, Marc. Speech Communication
34. A neural conversational model.
Vinyals, Oriol and Le, Quoc. arXiv preprint arXiv:1506.05869
35. Grammar as a foreign language.
Vinyals, Oriol, Kaiser, Lukasz, Koo, Terry, Petrov, Slav, Sutskever, Ilya, and Hinton, Geoffrey. arXiv preprint arXiv:1412.7449
36. Show, attend and tell: Neural image caption generation with visual attention.
Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Courville, Aaron, Salakhutdinov, Ruslan, Zemel, Richard, and Bengio, Yoshua. arXiv preprint arXiv:1502.03044
37. Far-field speech recognition using cnn-dnn-hmm with convolution in time.
Yoshioka, Takuya, Karita, Shigeki, and Nakatani, Tomohiro. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4360–4364. IEEE, 2015.
10427