Automatic context window composition for distant speech recognition

Automatic context window composition
for distant speech recognition

Mirco Ravanelli, Maurizio Omologo Fondazione Bruno Kessler, Trento, Italy
Abstract

Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been recently used with feed-forward neural networks. This context configuration turns out to be useful not only to address low-latency speech recognition, but also to boost the recognition performance under reverberant conditions.

This paper investigates on the mechanisms occurring inside DNNs, which lead to an effective application of asymmetric contexts. In particular, we propose a novel method for automatic context window composition based on a gradient analysis. The experiments, performed with different acoustic environments, features, DNN architectures, microphone settings, and recognition tasks show that our simple and efficient strategy leads to a less redundant frame configuration, which makes DNN training more effective in reverberant scenarios.

keywords:
Distant Speech Recognition, Deep Learning, Context Window, Reverberation
journal: Journal of Speech Communication

1 Introduction

Distant Speech Recognition (DSR) represents a fundamental technology towards flexible human-machine interfaces. There are indeed various real-life situations where DSR is more natural, convenient and attractive than traditional close-talking speech recognition dasr (). For instance, applications such as meeting transcriptions and smart TVs have been studied over the past decade in the context of the AMI/AMIDA ami () and the DICIT dicit_1 () projects, respectively. More recently, speech-based domestic control gained a lot of attention vacher (); isidoros (). To this end, the EU DIRHA project developed voice-enabled automated home environments based on distant-speech interaction in different languages lrec (); dirha_asru (). Concerning this application, innovative commercial products, such as Amazon Echo and Google Home, have recently been introduced in the market. Robotics, finally, represents another emerging scenario, where users can freely talk with distant mobile platforms.

Several efforts have been devoted in the last years to improve DSR technology, as witnessed by the great success of some international challenges such as CHiME chime3 (), REVERB revch_short (), and ASpIRE aspire (). A major role in improving this technology is being played by deep learning Goodfellow-et-al-2016-Book (); lideng (); dnn_shared (), which has contributed to significantly outperform HMM-GMM speech recognizers. Deep learning, in effect, has been rapidly evolving during the last years, progressively offering more powerful and robust techniques, including effective regularization methods dropout (); batchnorm (), improved optimization algorithms adam (), as well as better architectures cnn1 (); tdnn2 (); lstm_highway (); gru1 ().

A key aspect behind the success of deep learning in speech recognition is the ability of modern DNNs to perform predictions based on a large time context. A valid architecture able to learn long and short-term dependencies is represented by Recurrent Neural Networks (RNNs) as Long Short-Term Memory (LSTM) lstm () or Gated Recurrent Units (GRUs) gru1 (); ravanelli_is17 (). In order to simultaneously manage both past and future time contexts, the most suitable solution would be bidirectional RNNs graves (), that turned out to automatically learn (through their recurrent connections) how to properly exploit the contextual information. The price to be paid for automatically learning contexts from speech data is an increased computational complexity. LSTMs, for instance, are based on a rather complex cell design based on three multiplicative gates, which normally require much more computations at each time steps if compared to a simpler feed-forward NN. Bidirectional RNNs, moreover, can generate a sequence of posterior probabilities only after processing the entire sentence. Both of these features often impair their use in real-time/low-latency applications.

To circumvent this drawback, unidirectional RNNs or feed-forward DNNs can be used. For low-latency applications, feed-forward DNNs still represent the most preferable choice for many practical applications, as witnessed by the numerous studies in the recent literature on real-time ASR systems operating on devices with low computational power acw1 (); small2 (); small3 (); small4 (); small5 (); online2 (). In line with these recent efforts, this work considers the aforementioned scenario, targeting standard feed-forward DNNs.

In the case of feed-forward DNNs, the input features are typically gathered into a symmetric context window (SCW), that observes the current frame along with the same number of surrounding past and future ones. Nevertheless, an asymmetric context window (ACW) that integrates more past than future frames has gained popularity for real-time/low-latency recognition of close-talking speech online2 (); acw1 (); acw_ct (); tdnn2 (). Interestingly, some recent papers have also evidenced the effectiveness of its application with distant speech recognition acw_rev (); ravanelli15 (), though a deep analysis is missing concerning the conditions under which this approach becomes convenient with reverberated speech.

The goal of this work is to better understand these aspects, and to propose a methodology to derive an optimal context window (CW) according to the characteristics of the DSR task. The proposed algorithm, tested on different tasks, datasets, microphone configurations as well as on different acoustic environments, exploits a gradient analysis performed at an early stage of the DNN training. Its application significantly reduces the efforts needed to find an optimal context, while improving ASR performance provided by the use of standard SCWs.

The rest of the paper is organized as follows. Sec. 2 analyzes the effects of ACW on speech signals, input features, and DNN gradients. Sec. 3 describes the proposed algorithm for automatic context window composition. In Sec. 4, an overview of the adopted experimental setup is provided, while the ASR results are reported in Sec. 5. Finally, Sec. 6 draws our conclusions.

2 Asymmetric Context Window for Counteracting Reverberation

To better introduce the motivations behind the use of the ACW, it is useful to recall the effect of reverberation on a speech signal. Let us describe a distant speech signal by the following equation:

(1)

where is the close-talking signal (i.e, the speech signal before its propagation in the acoustic environment, which is assumed to be a latent variable not directly observed), is the acoustic impulse response (IR) between source and microphone, and is the additive noise introduced by the environment. The speech signal is reflected many times by walls, floor, and ceiling as well as by objects within the acoustic environment. Such a multi-path propagation, known as reverberation kutt (), is summarized by the IR , that can be modeled as a causal FIR filter (i.e., ). Fig. 0(a) shows an IR measured in a living-room, whose log-energy decay (reported in Fig. 0(b)) indicates that the reverberation time kutt () is about 780 ms.

(a) Impulse Response
(b) Log-energy decay of
Figure 1: An impulse response measured in a domestic environment with a reverberation time of about 780 ms.

In a DSR system, the distant-talking signal is processed by a feature extraction function that computes a sequence of feature frames , where each frame is a vector consisting of features, and is the total number of frames. Feature extraction is normally carried out by splitting the signal into small chunks (lasting 20-25 ms with an overlap of 10 ms), and by applying a transformation to each chunk. To broaden the time context the DSR system is not only fed with the current frame , but also with some surrounding ones. The set of frames feeding the system, i.e. the context window, is defined in the following way:

Figure 2: HMM-DNN pipeline used for hybrid speech recognition with feed-forward neural networks.
(2)

where and are the number of past and future frames, respectively. In standard SCWs , while ACWs correspond to . To account for different balance factors between past and future frames, let us introduce the coefficient defined as follows:

(3)

It results that for an asymmetric context embedding more past than future frames, for a symmetric context, and when embedding more future frames.

The context window then feeds a DNN, as depicted in Fig. 2. The DNN processes the input features with several non-linear hidden layers and estimates a set of posterior probabilities . The cost function optimized during training (e.g., cross-entropy) is computed from the reference labels and the aforementioned predictions.

2.1 Correlation Analysis

A function that helps study the redundancy introduced by reverberation is the cross-correlation. In particular, it is interesting to compute the cross-correlation between the close-talking speech and the corresponding distant-talking sequence . Let us assume that the additive noise reported in Eq. 1 is omitted here, to focus on reverberation only. It can be easily shown that:

(4)

where and denote the finite length of the IR , and the autocorrelation of the close-talking signal, respectively.

(a) Close-talking signal
(b) Reverberated signal
(c) Autocorrelation
(d) Cross-correlation
Figure 3: Cross- and auto-correlation analysis for the vowel /a/.
(a) Close-talking signal
(b) Reverberated signal
(c) Autocorrelation
(d) Cross-correlation
Figure 4: Cross- and auto-correlation correlation analysis for the fricative /f/.
Figure 5: Envelope of the cross-correlation computed between clean and reverberated speech sentences using symmetric windows of 200 ms. The envelope is averaged over all the utterances of the TIMIT dataset.

The autocorrelation varies significantly according to the particular phoneme and the signal characteristics that are considered. Fig. 2(c), for instance, shows the autocorrelation of a vowel , while Fig. 3(c) illustrates for a fricative . One can easily observe that different autocorrelation patterns are obtained: for the vowel sound , is based on several peaks due to pitch and formants, while for a more impulse-like pattern is observed. The spread of the autocorrelation function around its center also depends on the specific phoneme. If we consider, for instance, the time instant where the energy of decays to 99.9% of its initial value, the autocorrelation length is 104 ms with the vowel /a/, and about 25 ms with the fricative /f/. In both cases, however, the duration of the autocorrelation is significantly shorter than the IR length (see green dashed line of Fig. 2(c) and Fig. 3(c)), except in the case of a very low reverberation time. This characteristic, together with the causality of the impulse response (), originates an asymmetric trend in the cross-correlation , which can be clearly appreciated from both Fig. 2(d) and Fig. 3(d). As shown by the latter examples, corresponding to a medium-high of 780 ms, the right side of this function is influenced by the IR decay. The future samples () are thus, on average, more redundant than previous ones (), and this effect is amplified when reverberation increases, and in correspondence of high-energy portions of speech signals (e.g., the central part of a stressed vowel).

ACWs are therefore more appropriate than traditional symmetric ones, since they lead to a frame configuration less affected by the aforementioned forward correlation effects of reverberation. In other words, with an asymmetric context we can feed the DSR system with information which is, on average, more complementary than that considered in a standard symmetric frame configuration, allowing the DNN to perform more robust predictions.

As emerged from Fig. 3 and 4, the best asymmetric context window would depend on the specific phoneme. However, this is also tightly related to the degree of distortion introduced by reverberation in the related phonetic context, which can also depend on other factors (e.g., the ratio between the energies of the direct input speech and of the reverberation component). As a matter of fact, a simple and practical solution, as outlined in the following of this work, consists in feeding the DNN with a fixed asymmetric context configuration that, on average, works reasonably well for any phonetic contexts in the input speech signal, and for different environmental conditions.

This approach is often adopted within standard DNN-HMM speech recognizers, where the speech signal is progressively processed using a fixed context window that might contain different sounds. To extend our cross-correlation analysis to a more realistic setting, Fig. 5 shows the envelope of the cross-correlation function averaged over all the sentences of the TIMIT dataset timit (). In particular, this result is obtained considering context windows of 200 ms and 10 ms of time shift, which resembles the typical configuration used to feed DNNs under reverberated acoustic conditions, as shown in the following of the paper. Consistently with what emerged from previous experiments, Fig. 5 confirms the high redundancy introduced by reverberation on the future samples.

A similar experimental evidence can be reproduced using other methods to analyze the correlation that holds among frames inside the CW. For instance, the Pearson correlation coefficient pearson (); pearson2 () could be used to highlight the redundancy inside sequences of mel-frequency cepstral coefficients (MFCC) vectors that represent reverberated speech signals.

2.2 Gradient Analysis

(a) Gradient norms at epoch 1
(b) Gradient norms at epoch 21
Figure 6: Gradient norm of the various frames in a close-talking scenario over various training epochs.
(a) Gradient norms at epoch 1
(b) Gradient norms at epoch 21
Figure 7: Gradient norm of the various frames in a distant-talking reverberated scenario over various training epochs.

As far as DNN processing is concerned, it is also of great interest to understand if the network is able to automatically assign different importance to the different frames of the CW. Useful insights can be gained by analyzing the gradient norm over the various inputs of the CW, which can be defined in this way:

(5)

where is the cost function used for DNN training and is the -th feature frame embedded in the CW. In the case of cross-entropy cost, the gradient norm can be written as:

(6)

where is the number of training mini-batches, is the number of training samples in each mini-batch, is the number of phone-states, while and are the label and the DNN output of each training example, respectively. The DNN output depends on the context windows that is written here as to highlight its dependency on the -th frame. Note also that the gradient norm is averaged over all the training mini-batches in order to provide a more reliable estimation

Fig. 6 and 7 shows for a close-talking and a distant-talking case, respectively, which was computed by considering the first and the last training epochs. The results are derived from sequences of MFCC feature vectors, and using the DIRHA-WSJ dataset dirha_asru () with the DNN setup that will be described in Sec. 5.

The two figures highlight that the network is able to automatically assign more importance to the current frame (p=0). In both cases, the gradient norm clearly decreases when progressively moving far away from the current frame. However, a symmetric behavior is observed in the close-talking case only, which means that the network has no preference for past or future information. On the other hand, the network learns to place more importance to past () rather than to future frames () for reverberated speech. This can be readily appreciated from the asymmetric trend achieved in Fig. 7, which is a further indication of the possible benefits deriving from the use of ACWs.

Interestingly, the network learns which frames are more important since the first training epoch, as evidenced by the similar trends reported in Fig. 5(a) and 6(a). This is an important experimental evidence, which suggested us to develop the algorithm introduced in the next section, to effectively optimize the hyperparameters of the ACW.

3 Automatic context window composition

1:Train a DNN with a large symmetric context window for one epoch.
2:Compute the gradient norm , .
3:for  in range (,)  do
4:     ,
5:     for  in range (-1)  do
6:         if   then
7:         else               
8:     Train the DNN with past frames and future frames.
9:     Evaluate the WER performance on the dev-set.
10:     Store {,,WER} for the given .
11:Choose the context window with the best performance
Algorithm 1 Automatic context window composition using gradient analysis.

The characteristics of the context window are of paramount importance to improve the ASR performance. Particular attention should thus be devoted to derive a proper frame configuration, carefully optimizing (on the development set) the main features of the context window (i.e., , ).

A major limitation of the ACW is that it introduces two hyperparameters (i.e., and ), while only one (i.e, the total length of the context ) is needed for standard symmetric contexts. The introduction of an additional hyperparameter has a dramatic impact on the number of combinations to test during the optimization step. A grid search over a single hyperparameter, in fact, has a linear complexity , while the joint optimization of both and has a quadratic complexity . For instance, if we consider a SCW, with a total length that varies from 11 to 25 frames, only 15 DNN training experiments are necessary, against the 270 required with an exhaustive grid search. It is thus of great interest to develop a methodology to optimize more efficiently the hyperparameters of the ACW.

The approach proposed in this paper is based on the gradient norm analysis introduced in the previous section. The norm of the gradient over the various frames, in fact, gives quickly an idea about what frames are considered important by the network. Based on this observation, we propose the algorithm referred to as (Alg. 1), to automatically compose the CW. The idea is to first train a DNN with a very large SCW (e.g. 25 frames) for a single epoch. After the first epoch, the gradient norm over the various input frames is computed. The CW is composed by progressively embedding, at each iteration, the past or future frame that maximizes the gradient norm. The cycle is stopped when the predefined number of context frames has been reached. A new DNN can then be trained with the CW , determined by the proposed procedure, and the corresponding ASR performance is evaluated on a development data set. This operation is repeated for all the CW lengths within a predefined range (), and, after that, the frame configuration , providing the best ASR performance is selected.

Note that this algorithm allows one to optimize the frame configuration of the asymmetric context window with a linear computational complexity, that is comparable to that required for standard symmetric windows. For each context window length , in fact, the optimal configuration , is automatically inferred from the gradient norm profile, allowing one to avoid exploring the full set of context configurations. Similarly to SCW, if we consider a context window length ranging from =11 to =25 frames, only 15 DNN training experiments are necessary to find a proper context window.

4 Experimental Setup

The experimental framework developed in this work is based on the use of both WSJ-5k and LibriSpeech tasks. To provide an accurate analysis of the proposed approach, the experiments are performed under three different acoustic conditions of increasing complexity: close-talking (Clean), distant-talking with reverberation (Rev), and distant-talking with both noise and reverberation (Rev&Noise). The corpora used for each experimental condition are summarized in Table 1 and described in the two following sections. The adopted ASR setup will be described in Sec. 4.3.

Acoustic Condition Training Test
Close-talking (Clean) WSJ-clean DIRHA-WSJ-clean
Close-talking (Clean) LibriSpeech LibriSpeech
Distant-talking (Rev) WSJ-rev DIRHA-WSJ-rev
Distant-talking (Rev) Rev-LibriSpeech Rev-LibriSpeech
Distant-talking (Rev&Noise) WSJ-rev DIRHA-WSJ-rev&noise
Table 1: List of the experimental tasks considered in this work with the related training and test datasets.

4.1 Close-talking experiments

For close-talking experiments, we consider the standard WSJ dataset (i.e., WSJ-clean) for training, and the close-talking portion of the DIRHA English WSJ Dataset (i.e., DIRHA-WSJ-clean) for test purposes. The latter dataset was acquired during the DIRHA project in a recording studio of FBK, using professional equipment to obtain high-quality speech material dirha_asru (). In this work, we used a subset of the corpus, consisting of 409 WSJ sentences (with the same text used for the CHiME chime3 () challenge) uttered by six US speakers (three males and three females).

To evaluate the proposed method on a larger scale ASR task, additional experiments were performed with the LibriSpeech dataset librispeech (), that is based on speech material derived from read audio-books. In particular, we used a training subset consisting of 460 hours of speech uttered by 1172 speakers.

4.2 Distant-talking experiments

The reference environment for several experiments conducted in this study is the living-room of a real apartment (available under the DIRHA project) with a reverberation time of about ms. The living-room was equipped with a microphone network composed of 40 microphones. An IR measurement session, exploring a large number of positions and orientations of the sound source, was conducted in the aforementioned targeted environment with the purpose of generating realistic simulated data. More information on the adopted IR estimation procedure can be found in rav_is16 (); ravanelli ().

A set of experiments is carried out to study distant-talking conditions where only reverberation acts as a source of disturbance (Rev). In this case, training is performed using a contaminated dataset (i.e., WSJ-rev), which is generated by convolving the original WSJ-clean data set with a set of three IRs chosen from the aforementioned collection. The corresponding test data set, i.e. DIRHA-WSJ-rev, is based on a contaminated version of DIRHA-WSJ-clean. In order to simulate several speaker positions and orientations, a set of 36 IRs (different from those used for training) is used for the latter dataset.

To explore more challenging conditions characterized by both noise and reverberation (Rev&Noise), real recordings have also been performed. The real recordings, referred to as DIRHA-WSJ-rev&noise, are part of the recently-released DIRHA English WSJ corpus dirha_asru () and are composed of 409 WSJ sentences (with the same texts used to record DIRHA-WSJ-clean) uttered by six US speakers. Each subject reads a set of WSJ sentences from a tablet, standing still or sitting on a chair. Every 11-12 sentences, he/she was asked to move to a new position and take another orientation. Different typologies of non-stationary domestic noises affect the signals (e.g., vacuum cleaner, microwave noise, interfering speakers talking in other rooms, kitchen tools, open window noises,etc.), resulting in an average SNR of about 10 dB (for more details see dirha_asru ()111This dataset is distributed by the Linguistic Data Consortium (LDC).).

To test our approach in different contexts, other contaminated versions of the training and test data are generated with different IRs (either measured in other real environments, or computed with the image method image ()), as discussed in Sec. 5.2 and Sec. 5.3.

Other experiments are performed with a reverberated version of the LibriSpeech dataset librispeech (). The original close-talking sentences are convolved with 2145 IRs, that are measured in various positions and with different microphone configuration of the aforementioned living-room. The two test sets (here denoted as Test1 and Test2), are composed of 2620 sentences uttered by 40 speakers, and 2939 sentences uttered by 33 speakers, respectively. The test sentences are convolved with about 2000 IRs, corresponding to speaker positions and microphones different from those used for training. Note that the test data of the Librispeech corpus are originally clustered so that lower-WER speakers are gathered into Test1, while the others are in Test2.

4.3 DNN and ASR setup

In this work, we use a context-dependent DNN-HMM speech recognizer, where every unit is modeled by a three state left-to-right HMM, and the tied-state observation probabilities are estimated through a DNN.

Feature extraction is based on splitting the signal into frames of 25 ms with an overlap of 10 ms. The experimental activity is conducted considering different acoustic features, i.e., 39 MFCCs (13 static++), 40 log-mel filter-bank features (FBANKS), as well as 40 fMLLR features (extracted as reported in the s5 recipe of Kaldi kaldi ()). Features of consecutive frames are gathered into both symmetric and asymmetric observation windows. As for MFCCs, it is worth mentioning that one could conduct this study without using derivatives. In the latter case, the experimental results would be quite similar at qualitative level. In other words, we would obtain a trend that reflects what reported in the following section, though with a more prominent relative decrease of performance when adopting a non-optimal context length settings, because of a less effective way the contextual information is exploited. For this reason, here we prefer to report results related to the use of the first and second order derivatives.

WSJ experiments are based on DNNs composed of six sigmoid-based hidden layers of 2048 neurons, that are trained with the Kaldi toolkit kaldi () (Karel’s recipe). Weights are initialized with the standard Glorot initialization xavier (), while biases are initialized to zero. Training is performed with Stochastic Gradient Descend (SGD) that optimizes the cross-entropy loss function. The training evolution is monitored using a small validation set (10% of the training data) that is randomly extracted from the training corpus. The performance on the validation set is monitored after each epoch to perform learning rate annealing as well as for checking the stopping condition. In particular, the initial learning rate is kept fixed at 0.008 as long as the increment of the frame accuracy on the validation is higher than 0.5. For the following epochs, the learning rate is halved until the increment of frame accuracy is less than the stopping threshold of 0.1. The labels for DNN training are derived from an alignment on the tied states, which is performed with a previously-trained HMM-GMM acoustic model kaldi (). For Convolutional Neural Network (CNNs) experiments, we replace the first two fully-connected layers of the above-mentioned DNN with two convolutional layers based on 128 and 256 filters, respectively.

Librispeech experiments rely on the standard implementation of Kaldi, which employs a generalized maxout network (p-norm). In particular, our experiments are based on a four hidden layer p-norm architecture trained for 10 epochs with minibatches of size 128. The initial learning rate is set to 0.01, while the final one is 0.001. See pnorm () and the kaldi recipe in for more details.

5 ASR Results

In the following, we report the experimental results obtained on the addressed ASR tasks. In Sec. 5.1, a comparison between SCWs and ACWs is conducted considering different context configurations, input features as well as DNN architectures. In Sec. 5.2, we test the performance of the proposed algorithm with different recognition tasks and real acoustic environments, while in Sec. 5.3 we extend the speech recognition validation by simulating different reverberation times.

5.1 Reverberant speech recognition with asymmetric context windows

From the preliminary study on ACWs, carried out in the previous sections, we found that the training of distant-talking DNNs tends to naturally attribute more importance to past rather than future frames. In this section, we take a step forward by verifying whether this fact is also observed in terms of recognition performance. With this purpose, Fig. 8 shows the Word Error Rate (WER) results obtained in close-talking (Clean) and reverberant (Rev) conditions, when using fully asymmetric (i.e., single side) context windows of different lengths. Negative x-axis refers to the progressive integration of past frames only (=100%), while positive x-axis refers to future frames (=0%). In this set of experiments, fMLLR features were used as input to the DNN, for both DIRHA-WSJ-clean and DIRHA-WSJ-rev tasks. Note that similar trends have been obtained with both MFCCs and FBANK features.

(a) Close-talking scenario (Clean)
(b) Distant-talking scenario (Rev)
Figure 8: WER(%) obtained with DNN context windows that progressively integrate only past or future frames (using fMLLR features). Results refer to the use of DIRHA-WSJ-clean (a), and DIRHA-WSJ-rev (b), tasks.

Results highlight that a rather symmetric behavior is attained in the close-talking case (Fig. 7(a)), reiterating that in such contexts past and future information provides a similar contribution to improve the system performance. Differently, the role of past information is significantly more important in the distant-talking case, since a faster decrease of the WER(%) is observed when past frames are progressively concatenated (Fig. 7(b)). This result is in line with the findings emerged in the previous sections, and it confirms that an ACW is more suitable than a traditional symmetric one when reverberation arises.

In the previous experiment, we tested only fully asymmetric windows with =0% (future frames) or (past frames). However, it is worth addressing hybrid configurations, where both past and future frames are considered. With this purpose, Fig. 9 compares this kind of asymmetric window under both close-talking and distant-talking conditions, using contexts of different durations. For each CW length, the asymmetric CW curve represents the best ASR performance among all the configurations that derive from varying the balance factor . Fig. 10 shows the results obtained with the Librispeech task, by adopting the CW lengths that turned to be optimal in the case of DIRHA-WSJ task (i.e., 11 in the close-talking condition and 19 for the reverberated case).

(a) Close-talking scenario
(b) Reverberated Scenario
Figure 9: Comparison between SCW and ACW under both a close-talking and distant-talking reverberated conditions (using fMLLR features). Results refer to the used of DIRHA-WSJ-clean (a), and DIRHA-WSJ-rev (b) tasks.
(a) Close-talking scenario
(b) Reverberated Scenario
Figure 10: WER(%) obtained with different context configurations for the close-talking and reverberated version of Librispeech (Test1, fmllr features).

From the close-talking experiments, it emerges that the standard SCW slightly outperforms the best asymmetric one, as clearly highlighted by Fig. 8(a). This trend is also confirmed in Fig. 9(a), where different context windows of length 11 have been tested on the close-talking version of Librispeech. In both cases, the gap between symmetric and asymmetric contexts is not so large (on average less than 2% relative decrease), but it suggests to use an ACW in close-talking scenarios only when real-time/low-latency constraints arise.

Differently, Fig. 8(b) shows that the asymmetric window consistently outperforms the standard symmetric one in the distant-talking case, for all the considered context durations. On average, about 5% relative WER decrease is obtained with, essentially, no additional computational cost. This result is also confirmed by Fig. 9(b), that reports the performance obtained on the reverberated Librispeech for various CW settings. This figure not only shows that an asymmetric window that embeds more past than future frames is a proper choice when reverberation arises, but it also highlights that the opposite setting (i.e., embedding more future than past frames) leads to a rather significant loss of performance.

Previous experiments were based on fMLLR features. In Table 2 we extend the experimental validation to other acoustic features, such as FBANK and MFCC coefficients. We also consider CNNs as an alternative to the fully-connected DNNs used so far.

Architecture Features SCW (9-1-9) ACW (11-1-7)
DNN fMLLR 15.2 14.8
DNN MFCC 21.8 20.8
DNN FBANK 20.7 20.2
CNN FBANK 18.5 18.1
Table 2: Comparison between the WERs(%) achieved with SCWs and ACWs, when different features and DNN architectures are used.

Results confirm that the ACW outperforms the symmetric one in all the considered settings. The last row of Table 2 also highlights an interesting performance improvement achieved with CNNs. CNNs are based on local connectivity, weight sharing, and pooling operations that allow them to exhibit some invariance to small feature shifts along the frequency axis, with well-known benefits against speaker and environment variations cnn1 (). Hence, they represent a valid alternative to fully-connected DNNs, also jointly used with ACW under reverberant conditions.

(a) Fully asymmetric context
(b) Symmetric vs asymmetric CW.
Figure 11: Comparison between SCW and ACW under mismatched conditions. Training is performed using reverberated data (using WSJ-rev), while test material is corrupted by both noise and reverberation (DIRHA-WSJ-rev&noise). fMLLR features are used in this experiment.

To study the effectiveness of asymmetric contexts under mismatching conditions (that often arise in real applications), we now train the DSR system with reverberated data (Rev) and test it on real signals (DIRHA-WSJ-rev&noise) affected by both noise and reverberation. Fig. 10(a) shows the results obtained when fully ACWs are adopted. Fig. 10(b), instead, compares symmetric and optimal asymmetric windows, with different CW lengths. Due to the more challenging conditions characterizing this test, WER(%) is significantly worse than that highlighted in Fig. 7(b) and Fig. 9. However, it is worth noting that the benefits deriving from the use of ACWs are maintained even under the addressed mismatching case.

5.2 ASR experiments with automatic context window composition

As discussed in Sec. 3, the hyperparameters and of the ACW can be derived by applying (Alg. 1). In this section, we conduct a set of experiments to evaluate the loss of performance introduced by it, when compared to the ideal (and computational expensive) conditions under which previous experiments (in Fig. 10(b)) were performed. Let us recall that, in the latter cases, a grid optimization was done over all the possible CW combinations.

The first row of Table 3 shows the results obtained with the aforementioned mismatching condition. The best performance, 27.2 % WER, is obtained using an optimal CW 11-1-7, that is an overall length of 19 frames. It is however worth noting that applying leads to a very similar combination, i.e., 12-1-6, is obtained, which corresponds to a comparable recognition performance, i.e., 27.3% WER. The last two rows, instead, report the results achieved with the reverberated version of Librispeech. In this case, the proposed algorithm provides a CW setting that corresponds to the optimal choice. For both recognition tasks, we can also observe that applying leads to 2-3% relative reduction of WER provided by the SCW.

(ms) SCW ACW (opt) AutoCW
DIRHA-WSJ-rev&noise CW 8-1-8 11-1-7 12-1-6
WER 27.9 27.2 27.3
Rev-LibriSpeech (Test1) CW 9-1-9 11-1-7 11-1-7
WER 22.1 21.4 21.4
Rev-LibriSpeech (Test2) CW 9-1-9 11-1-7 11-1-7
WER 51.3 50.1 50.1
Table 3: Comparison between WER(%) obtained with SCW, the optimal asymmetric one (ACW opt), and with the context configuration derived by our algorithm (). The experiments are performed with fMLLR features.

Another set of experiments concerns a different kind of mismatch that occurs when training and test are performed in different acoustic environments. As reported in Table 4, training was carried out in the DIRHA living-room, using the WSJ-rev corpus, while test is performed in three different contexts, i.e., an office, a surgery room, as well as a room of another apartment. The test data were generated following the approach described in Sec. 4.2 for DIRHA-WSJ-rev, but using real IRs that were measured in the aforementioned environments.

    SCW (opt) ACW (opt) AutoCW
9-1-9 11-1-7 12-1-6
Office ( ms) 16.6 16.2 16.2
Home ( ms) 19.5 19.1 19.3
Surgery Room ( ms) 21.4 20.3 20.5
Table 4: WER(%) obtained with SCWs and ACWs, in different acoustic environments and under mismatched conditions. Training is performed in the DIRHA livingroom (=750ms) using WSJ-rev, while test is performed in different acoustic environments with different reverberation times.

Results show that the use of ACW introduces advantages in terms of ASR performance under all the tested conditions, even when training and test are performed in different acoustic environments. Moreover, the application of provides a CW composition 12-1-6, very similar to the optimal one, which corresponds to a 2-3% relative WER reduction, if compared to the performance obtained using SCW.

5.3 Performance analysis with different reverberation times

As pointed out above, the application of can have a different impact according to the reverberant conditions under which training and test are performed. Concerning this, we further extended our validation by simulating acoustic environments with increasing reverberation times . For this study, a set of IRs simulated with the image method image () were used to contaminate both training (WSJ-clean) and testing corpora (DIRHA-WSJ-clean). Table 5 summarizes the results obtained with ranging from 250 ms to 1000 ms.

(ms) SCW ACW (opt) AutoCW
0 ms CW 5-1-5 6-1-4 5-1-5
WER 3.6 3.7 3.6
250 ms CW 5-1-5 6-1-4 6-1-4
WER 5.5 5.1 5.1
500 ms CW 6-1-6 7-1-5 8-1-4
WER 9.1 8.5 8.7
750 ms CW 9-1-9 12-1-6 11-1-7
WER 15.2 14.8 14.9
1000 ms CW 12-1-12 18-1-6 19-1-5
WER 20.5 20.1 20.1
Table 5: Comparison between WER(%) obtained with SCW and ACW under different reverberation conditions. The last column reports the results obtained with the proposed algorithm.

As expected, results show that the performance progressively degrades as increases. More interestingly, the asymmetric window is able to overtake standard symmetric ones in all the explored reverberant conditions. It is also worth noting that larger contexts are needed when increasing the reverberation time, as highlighted in Fig. 11(a). For instance, when =250 ms the optimal window integrates only 11 frames, while 25 frames are necessary when =1000 ms. Interestingly enough, the coefficient , that measures the amount of asymmetricity in the CW, increases as the reverberation time increases (see Fig. 11(b)). This means that the reverberation effects significantly reduce the usefulness of future frames in the case of large s, which makes convenient the use of more asymmetric context windows. It is worth noting that the proposed algorithm provides nearly optimal contexts, that lead to a negligible performance reduction over the best CW for all the considered reverberation times. Under close-talking conditions (=0 ms), correctly derives a symmetric context window of 11 frames. Similarly to the optimal case, the proposed method correctly provides longer and more asymmetric context windows when reverberation increases.

(a) Length of the context window vs .
(b) vs .
Figure 12: Main features of the optimal context window for different reverberation times.

6 Conclusions

In this paper, we extensively studied the role played by ACWs to counteract the adverse effects of reverberation in a distant speech recognizer. Under these environmental conditions, this windowing mechanism has proven to be a viable alternative to a more standard symmetric context. The asymmetric window, in fact, feeds the DNN with a more convenient frame configuration which carries, on average, information that is less redundant and less affected by the correlation effects introduced by reverberation.

To optimize the characteristics of the asymmetric context window, this work proposed a novel algorithm that analyzes the norm of the DNN gradients over the various input frames. The algorithm, tested on different tasks, datasets, and environments turned out to derive nearly optimal windows under different acoustic conditions. Our method, that is characterized by a linear computational complexity, is significantly more efficient than the traditional grid search optimization over all the possible frame configurations, which has a quadratic complexity.

As previously mentioned, an open issue is represented by the flexibility of the proposed approach to tackle possible changes of the reverberant conditions. This issue can be addressed in several possible ways, for instance by combining the current solution with a pre-processing step that realizes a preliminary environmental classification, which aims at selecting in real-time the most suitable asymmetric context as well as the related neural network.

Overall, the use of ACW and of turns out to represent a simple and effective approach to improve DSR performance under noisy and reverberant conditions, in particular with medium-high reverberation times, and for the development of real-time low-complexity applications.

References

References

  • (1) M. Wölfel, J. McDonough, Distant Speech Recognition, Wiley, 2009.
  • (2) S. Renals, T. Hain, H. Bourlard, Interpretation of Multiparty Meetings the AMI and Amida Projects, in: Proc. of HSCMA, 2008, pp. 115–118.
  • (3) M. Omologo, A prototype of distant-talking interface for control of interactive TV, in: Proc. of ASILOMAR, 2010, pp. 1711–1715.
  • (4) B. Lecouteux, M. Vacher, F. Portet, Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions, in: Proc. of Interspeech, 2013, pp. 2273–2276.
  • (5) I. Rodomagoulakis, P. Giannoulis, Z. I. Skordilis, P. Maragos, G. Potamianos, Experiments on far-field multichannel speech processing in smart homes, in: Proc. of DSP, 2013, pp. 1–6.
  • (6) L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad, M. Hagmueller, P. Maragos, The DIRHA simulated corpus, in: Proc. of LREC, 2014, pp. 2629–2634.
  • (7) M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments, in: Proc. of ASRU, 2015, pp. 275–282.
  • (8) J. Barker, R. Marxer, E. Vincent, S. Watanabe, The third CHiME Speech Separation and Recognition Challenge: Dataset, task and baselines, in: Proc. of ASRU, 2015, pp. 504–511.
  • (9) K. Kinoshita, et al., The reverb challenge: A Common Evaluation Framework for Dereverberation and Recognition of Reverberant Speech, in: Proc. of WASPAA 2013, pp. 1–4.
  • (10) M. Harper, The Automatic Speech recognition In Reverberant Environments (ASpIRE) challenge, in: Proc. of ASRU, 2015, pp. 547–554.
  • (11) I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.
  • (12) D. Yu, L. Deng, Automatic Speech Recognition - A Deep Learning Approach, Springer, 2015.
  • (13) G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition, Signal Processing Magazine.
  • (14) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014) 1929–1958.
  • (15) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proc. of ICML, 2015, pp. 448–456.
  • (16) D. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proc. of ICLR, 2015.
  • (17) O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10) (2014) 1533–1545.
  • (18) V. Peddinti, D. Povey, S. Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts, in: Proc of Interspeech, 2015, pp. 3214–3218.
  • (19) Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, J. R. Glass, Highway long short-term memory RNNS for distant speech recognition, in: Proc. of ICASSP, 2016, pp. 5755–5759.
  • (20) K. Cho, B. van Merrienboer, D. Bahdanau, Y. Bengio, On the properties of neural machine translation: Encoder-decoder approaches, in: Proc. of SSST, 2014.
  • (21) S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780.
  • (22) M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, Improving speech recognition by revising gated recurrent units, in: Proc. of Interspeech, 2017.
  • (23) A. Graves, N. Jaitly, A. Mohamed., Hybrid speech recognition with Deep Bidirectional LSTM, in: Proc of ASRU, 2013.
  • (24) G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in: Proc. of ICASSP, 2014, pp. 4087–4091.
  • (25) Y. Wang, J. Li, Y. Gong, Small-footprint high-performance deep neural network-based speech recognition using split-VQ, in: Proc. of ICASSP, 2015, pp. 4984–4988.
  • (26) G. Chen, C. Parada, G. Heigold, Small-footprint keyword spotting using deep neural networks, in: Proc. of ICASSP, 2014, pp. 4087–4091.
  • (27) T. N. Sainath, C. Parada, Convolutional neural networks for small-footprint keyword spotting, in: Proc. of Interspeech, 2015, pp. 4087–4091.
  • (28) T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba, A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani, A. Nakamura, J. Yamato, Low-latency real-time meeting recognition and understanding using distant microphones and omni-directional camera, IEEE Transactions on Audio, Speech, and Language Processing 20 (2) (2012) 499–513.
  • (29) X. Lei, A. Senior, A. Gruenstein, J. Sorensen, Accurate and compact large vocabulary speech recognition on mobile devices., in: Proc. of Interspeech, 2013, pp. 662–665.
  • (30) A. K. Dhaka, G. Salvi, Semi-supervised learning with sparse autoencoders in phone classification, CoRR.
    URL http://arxiv.org/abs/1610.00520
  • (31) V. Peddinti, G. Chen, D. Povey, S. Khudanpur, Reverberation robust acoustic modeling using i-vectors with time delay neural networks., in: Proc. of Interspeech, 2015, pp. 2440–2444.
  • (32) M. Ravanelli, M. Omologo, Contaminated speech training methods for robust DNN-HMM distant speech recognition, in: Proc. of Interspeech, 2015, pp. 756–760.
  • (33) H. Kuttruff, Room acoustic, 5th Edition, Spon Press, 2009.
  • (34) J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM (1993).
  • (35) K. Pearson, Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity, and Panmixia, Proceedings of the Royal Society of London, Philosophical Transactions of the Royal Society 187 (1886) 253–318.
  • (36) J. Benesty, J. Chen, Y. Huang, On the Importance of the Pearson Correlation Coefficient in Noise Reduction, IEEE Transactions on Audio, Speech, and Language Processing 16 (4) (2008) 757–765.
  • (37) V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: An asr corpus based on public domain audio books, in: Proc. of ICASSP, 2015, pp. 5206–5210.
  • (38) M. Ravanelli, P. Svaizer, M. Omologo, Realistic multi-microphone data simulation for distant speech recognition, in: Proc. of Interspeech, 2016, pp. 2786–2790.
  • (39) M. Ravanelli, A. Sosi, P. Svaizer, M. Omologo, Impulse response estimation for robust speech recognition in a reverberant environment, in: Proc. of EUSIPCO, 2012, pp. 1668–1672.
  • (40) J. Allen, D. Berkley, Image method for efficiently simulating small‐room acoustics, in: J. Acoust. Soc. Am, 1979, pp. 2425–2428.
  • (41) D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech Recognition Toolkit, in: Proc. of ASRU, 2011.
  • (42) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proc. of AISTATS, 2010, pp. 249–256.
  • (43) X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in: Proc. of ICASSP, 2014, pp. 215–219.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
199952
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description