Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training
Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the front-end feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with the minimum cross entropy (CE) criterion. More specifically, during training we compute the average MSE or CE over the whole utterance for each possible utterance-level output-target assignment, pick the one with the minimum MSE or CE, and optimize for that assignment. This strategy elegantly solves the label permutation problem observed in the deep learning based multi-talker mixed speech separation and recognition systems. The proposed architectures are evaluated and compared on an artificially mixed AMI dataset with both two- and three-talker mixed speech. The experimental results indicate that our proposed architectures can cut the word error rate (WER) by 45.0% and 25.0% relatively against the state-of-the-art single-talker speech recognition system across all speakers when their energies are comparable, for two- and three-talker mixed speech, respectively. To our knowledge, this is the first work on the multi-talker mixed speech recognition on the challenging speaker-independent spontaneous large vocabulary continuous speech task.
Thanks to the significant progresses made in the recent years , the ASR systems now surpassed the threshold for adoption in many real-world scenarios and enabled services such as Microsoft Cortana, Apple’s Siri and Google Now, where close-talk microphones are commonly used.
However, the current ASR systems still perform poorly when far-field microphones are used. This is because many difficulties hidden by close-talk microphones now surface under distant recognition scenarios. For example, the signal to noise ratio (SNR) between the target speaker and the interfering noises is much lower than that when close-talk microphones are used. As a result, the interfering signals, such as background noise, reverberation, and speech from other talkers, become so distinct that they can no longer be ignored.
In this paper, we aims at solving the speech recognition problem when multiple talkers speak at the same time and only a single channel of mixed speech is available. Many attempts have been made to attack this problem. Before the deep learning era, the most famous and effective model is the factorial GMM-HMM , which outperformed human in the 2006 monaural speech separation and recognition challenge . The factorial GMM-HMM, however, requires the test speakers to be seen during training so that the interactions between them can be properly modeled. Recently, several deep learning based techniques have been proposed to solve this problem . The core issue that these techniques try to address is the label ambiguity or permutation problem (refer to Section 3 for details).
In Weng et al.  a deep learning model was developed to recognize the mixed speech directly. To solve the label ambiguity problem, Weng et al. assigned the senone labels of the talker with higher instantaneous energy to output one and the other to output two. This, although addresses the label ambiguity problem, causes frequent speaker switch across frames. To deal with the speaker switch problem, a two-speaker joint-decoder with a speaker switching penalty was used to trace speakers. This approach has two limitations. First, energy, which is manually picked, may not be the best information to assign labels under all conditions. Second, the frame switching problem introduces burden to the decoder.
In Hershey et al.  the multi-talker mixed speech is first separated into multiple streams. An ASR engine is then applied to these streams independently to recognize speech. To separate the speech streams, they proposed a technique called deep clustering (DPCL). They assume that each time-frequency bin belongs to only one speaker and can be mapped into a shared embedding space. The model is optimized so that in the embedding space the time-frequency bins belong to the same speaker are closer and those of different speakers are farther away. During evaluation, a clustering algorithm is used upon embeddings to generate a partition of the time-frequency bins first, separated audio streams are then reconstructed based on the partition. In this approach, the speech separation and recognition are usually two separate components.
Chen et al.  proposed a similar technique called deep attractor network (DANet). Following DPCL, their approach also learns a high-dimensional embedding of the acoustic signals. Different from DPCL, however, it creates cluster centers, called attractor points, in the embedding space to pull together the time-frequency bins corresponding to the same source. The main limitation of DANet is the requirement to estimate attractor points during evaluation time and to form frequency-bin clusters based on these points.
In Yu et al.  and Kolbak et al., a simpler yet equally effective technique named permutation invariant training (PIT)
Moreover, most of previous works on multi-talker speech still focus on speech separation . In contrast, the multi-talker speech recognition is much harder and the related work is less. There has been some attempts, but the related tasks are relatively simple. For example, the 2006 monaural speech separation and recognition challenge  was defined on a speaker-dependent, small vocabulary, constrained language model setup, while in  a small vocabulary reading style corpus was used. We are not aware of any extensive research work on the more real, speaker-independent, spontaneous large vocabulary continuous speech recognition (LVCSR) on multi-talker mixed speech before our work.
In this paper, we attack the multi-talker mixed speech recognition problem with a focus on the speaker-independent setup given just a single-channel of the mixed speech. Different from , here we extend and redefine PIT over log filter bank features and/or senone posteriors. In some architectures PIT is defined upon the minimum mean square error (MSE) between the true and estimated individual speaker features to separate speech at the feature level (called PIT-MSE from now on). In some other architectures, PIT is defined upon the cross entropy (CE) between the true and estimated senone posterior probabilities to recognize multiple streams of speech directly (called PIT-CE from now on). Moreover, the PIT-MSE based front-end feature separation can be combined with the PIT-CE based back-end recognition in a joint optimization architecture. We evaluate our architectures on the artificially generated AMI data with both two- and three-talker mixed speech. The experimental results demonstrate that our proposed architectures are very promising.
The rest of the paper is organized as follows. In Section 2 we describe the speaker independent multi-talker mixed speech recognition problem. In Section 3 we propose several PIT-based architectures to recognize multi-streams of speech. We report experimental results in Section 4 and conclude the paper in Section 5.
2Single-Channel Multi-Talker Speech Recognition
In this paper, we assume that a linearly mixed single-microphone signal is known, where are streams of speech sources from different speakers. Our goal is to separate these streams and recognize every single one of them. In other words, the model needs to generate output streams, one for each source, at every time step. However, given only the mixed speech , the problem of recognizing all streams is under-determined because there are an infinite number of possible (and thus recognition results) combinations that lead to the same . Fortunately, speech is not random signal. It has patterns that we may learn from a training set of pairs and , where is the senone label sequence for stream .
In the single speaker case, i.e., , the learning problem is significantly simplified because there is only one possible recognition result, thus it can be casted as a simple supervised optimization problem. Given the input to the model, which is some feature representation of , the output is simply the senone posterior probability conditioned on the input. As in most classification problems, the model can be optimized by minimizing the cross entropy between the senone label and the estimated posterior probability.
When is greater than , however, it is no longer as simple and direct as in the single-talker case and the label ambiguity or permutation becomes a problem in training. In the case of two speakers, because speech sources are symmetric given the mixture (i.e., equals to and both and have the same characteristics), there is no predetermined way to assign the correct target to the corresponding output layer. Interested readers can find additional information in  on how training progresses to nowhere when the conventional supervised approach is used for the multi-talker speech separation.
3Permutation Invariant Training for Multi-Talker Speech Recognition
To address the label ambiguity problem, we propose several architectures based on the permutation invariant training (PIT)  for multi-talker mixed speech recognition. For simplicity and without losing the generality, we always assume there are two-talkers in the mixed speech when describing our architectures in this section.
Note that, DPCL  and DANet  are alternative solutions to the label ambiguity problem when the goal is speech source separation. However, these two techniques cannot be easily applied to direct recognition (i.e., without first separating speech) of multiple streams of speech because of the clustering step required during separation, and the assumption that each time-frequency bin belongs to only one speaker (which is false when the CE criterion is used).
3.1Feature Separation with Direct Supervision
To recognize the multi-talker mixed speech, one straightforward approach is to estimate the features of each speech source given the mixed speech feature and recognize them one by one using a normal single-talker LVCSR system. This idea is depicted in Figure ? where we learn a model to recover the filter bank (FBANK) features from the mixed FBANK features and then feed each stream of the recovered FBANK features to a conventional LVCSR system for recognition.
In the simplest architecture, which is denoted as Arch#1 and illustrated in Figure ?(a), feature separation can be considered as a multi-class regression problem, similar to many previous works . In this architecture, , the feature of mixed speech, are used as the input to some deep learning models, such as deep neural networks (DNNs), convolutional neural networks (CNNs), and long short-term memory (LSTM) recurrent neural networks (RNNs), to estimate feature representation of each individual talker. If we use the bidirectional LSTM-RNN model, the model will compute
where is the input, is the number of hidden layers, is the -th hidden layer, and are the forward and backward RNNs at hidden layer , respectively, is the estimated separated features from the output layers for each speech stream .
During training, we need to provide the correct reference (or target) features for all speakers in the mixed speech to the corresponding output layers for supervision. The model parameters can be optimized to minimize the mean square error (MSE) between the estimated separated feature and the original reference feature ,
where is the number of mixed speakers. In this architecture, it is assumed that the reference features are organized in a given order and assigned to the output layer segments accordingly. Once trained, this feature separation module can be used as the front-end to process the mixed speech. The separated feature streams are then fed into a normal single-speaker LVCSR system for decoding.
3.2Feature Separation with Permutation Invariant Training
The architecture depicted in Figure ?(a) is easy to implement but with obvious drawbacks. Since the model has multiple output layer segments (one for each stream), and they depend on the same input mixture, assigning reference is actually difficult. The fixed reference order used in this architecture is not quite right since the source speech streams are symmetric and there is no clear clue on how to order them in advance. This is referred to as the label ambiguity (or label permutation) problem in . As a result, this architecture may work well on the speaker-dependent setup where the target speaker is known (and thus can be assigned to a specific output segment) during training, but cannot generalize well to the speaker-independent case.
The label ambiguity problem in the multi-talker mixed speech recognition was addressed with limited success in  where Weng et al. assigned reference features depending on the energy level of each speech source. In the architecture illustrated in Figure ?(b), named as Arch#2, permutation invariant training (PIT)  is utilized to estimate individual feature streams. In this architecture, The reference feature sources are given as a set instead of an ordered list. The output-reference assignment is determined dynamically based on the current model. More specifically, it first computes the MSE for each possible assignment between the reference and the estimated source , and picks the one with minimum MSE. In other words, the training criterion is
where is a permutation of . We note two important ingredients in this objective function. First, it automatically finds the appropriate assignment no matter how the labels are ordered. Second, the MSE is computed over the whole sequence for each assignment. This forces all the frames of the same speaker to be aligned with the same output segment, which can be regarded as performing the feature-level tracing implicitly. With this new objective function, We can simultaneously perform label assignment and error evaluation on the feature level. It is expected that the feature streams separated with PIT (Figure ?(b)) has higher quality than that separated with fixed reference order (Figure ?(a)). As a result, the recognition errors on these feature streams should also be lower. Note that the computational cost associated with permutation is negligible compared to the network forward computation during training, and no permutation (and thus no cost) is needed during evaluation.
3.3Direct Multi-Talker Mixed Speech Recognition with PIT
In the previous two architectures mixed speech features are first separated explicitly and then recognized independently with a conventional single-talker LVCSR system. Since the feature separation is not perfect, there is mismatch between the separated features and the normal features used to train the conventional LVCSR system. In addition, the objective function of minimizing the MSE between the estimated and reference features is not directly related to the recognition performance. In this section, we propose an end-to-end architecture that directly recognizes mixed speech of multiple speakers.
In this architecture, denoted as Arch#3, we apply PIT to the CE between the reference and estimated senone posterior probability distributions as shown in Figure ?(a). Given some feature representation of the mixed speech , this model will compute
using a deep bidirectional RNN, where Equations (8)(11) are similar to Equations (1)(4). is the excitation at output layer for each speech stream , and is the output segment for stream . Different from architectures discussed in previous sections, in this architecture each output segment represents the estimated senone posterior probability for a speech stream. No additional feature separation, clustering or speaker tracing is needed. Although various neural network structures can be used, in this study we focus on bidirectional LSTM-RNNs.
In this direct multi-talker mixed speech recognition architecture, we minimize the objective function
In other words, we minimize the minimum average CE of every possible output-label assignment. All the frames of the same speaker are forced to be aligned with the same output segment by computing the CE over the whole sequence for each assignment. This strategy allows for the direct multi-talker mixed speech recognition without explicit separation. It is a simpler and more compact architecture for multi-talker speech recognition.
3.4Joint Optimization of PIT-based Feature Separation and Recognition
As mentioned above, the main drawback of the feature separation architectures is the mismatch between the distorted separation result and the features used to train the single-talker LVCSR system. The direct multi-talker mixed speech recognition with PIT, which bypassed the feature separation step, is one solution to this problem. Here we propose another architecture named joint optimization of PIT-based feature separation and recognition, and it is denoted as Arch#4 and shown in Figure ?(b).
This architecture contains two PIT-components, the front-end feature separation module with PIT-MSE and the back-end recognition module with PIT-CE. Different from the architecture in Figure ?(b), in this architecture a new LVCSR system is trained upon the output of the feature separation module with PIT-CE. The whole model is trained progressively: the front-end feature separation module is firstly optimized with PIT-MSE; Then the parameters in the back-end recognition module are optimized with PIT-CE while keeping the parameters in the feature separation module fixed. Finally parameters in both modules are jointly refined with PIT-CE using a small learning rate. Note that the reference assignment in the recognition (PIT-CE) step is the same as that in the separation (PIT-MSE) step.
During decoding, the mixed speech features are fed into this architecture, and the final posterior streams are used for decoding as normal.
To evaluate the performance of the proposed architectures, we conducted a series of experiments on an artificially generated two- and three-talker mixed speech dataset based on the AMI corpus .
There are four reasons for us to use AMI: 1) AMI is a speaker-independent spontaneous LVCSR corpora. Compared to small vocabulary, speaker-dependent, read English datasets used in most of the previous studies , observations made and conclusions drawn from AMI are more likely generalized to other real-world scenarios; 2) AMI is a really hard task with different kinds of noises, truly spontaneous meeting style speech, and strong accents. It reflects the true ability of LVCSR when the training set size is around 100hr. The state-of-the-art word error rate (WER) on AMI is around 25.0% for the close-talk condition  and more than 45.0% for the far-field condition with single-microphone . These WERs are much higher than that on other corpora, such as Switchboard  on which the WER is now below 10.0% ; 3) Although the close-talk data (AMI IHM) was used to generate mixed speech in this work, the existence of parallel far-field data (AMI SDM/MDM) allows us to evaluate our architectures based on the far-field data in the future; 4) AMI is a public corpora, using AMI allows interested readers to reproduce our results more easily.
The AMI IHM (close-talk) dataset contains about 80hr and 8hr speech in training and evaluation sets, respectively . Using AMI IHM, we generated a two-talker (IHM-2mix) and a three-talker (IHM-3mix) mixed speech dataset.
To artificially synthesize IHM-2mix, we randomly select two speakers and then randomly select an utterance for each speaker to form a mixed-speech utterance. For easier explanation, the high energy (High E) speaker in the mixed speech is always chosen as the target speaker and the low energy (Low E) speaker is considered as interference speaker. We synthesized mixed speech for five different SNR conditions (i.e. 0dB, 5dB, 10dB, 15dB, 20dB) based on the energy ratio of the two-talkers. To eliminate easy cases we force the lengths of the selected source utterances comparable so that at least half of the mixed speech contains overlapping speech. When the two source utterances have different lengths, the shorter one is padded with small noise at the front and end. The same procedure is used for preparing both the training and testing data. We generated in total 400hr two-talker mixed speech, 80hr per SNR condition, as the training set. A subset of 80hr speech from this 400hr training set was used for fast model training and evaluation. For evaluation, total 40hr two-talker mixed speech, 8hr per SNR condition, is generated and used.
The IHM-3mix dataset was generated similarly. The relative energy of the three speakers in each mixed utterance varies randomly in the training set. Different from the training set, all the speakers in the same mixed utterance have equal energy in the testing set. We generated in total 400hr and 8hr three-talker mixed speech as the training and testing set, respectively.
Figure 1 compares the spectrogram of a single-talker clean utterance and the corresponding 0db two-talker mixed utterance in the IHM-2mix dataset. Obviously it is really hard to separate the spectrogram and reconstruct the source utterances by visually examining it.
4.1Single-speaker Recognition Baseline
In this work, all the neural networks were built using the latest Microsoft Cognitive Toolkit (CNTK)  and the decoding systems were built based on Kaldi . We first followed the officially released kaldi recipe to build an LDA-MLLT-SAT GMM-HMM model. This model uses 39-dim MFCC feature and has roughly 4K tied-states and 80K Gaussians. We then used this acoustic model to generate the senone alignment for neural network training. We trained the DNN and BLSTM-RNN baseline systems with the original AMI IHM data. 80-dimensional log filter bank (LFBK) features with CMVN were used to train the baselines. The DNN has 6 hidden layers each of which contains 2048 Sigmoid neurons. The input feature for DNN contains a window of 11 frames. The BLSTM-RNN has 3 bidirectional LSTM layers which are followed by the softmax layer. Each BLSTM layer has 512 memory cells. The input to the BLSTM-RNN is a single acoustic frame. All the models explored here are optimized with cross-entropy criterion. The DNN is optimized using SGD method with 256 minibatch size, and the BLSTM-RNN is trained using SGD with 4 full-length utterances in each minibatch.
For decoding, we used a 50K-word dictionary and a trigram language model interpolated from the ones created using the AMI transcripts and the Fisher English corpus. The performance of these two baselines on the original single-speaker AMI corpus are presented in Table 1. These results are comparable with that reported by others  even though we did not use adapted fMLLR feature. It is noted that adding more BLSTM layers did not show meaningful WER reduction in the baseline.
To test the normal single-speaker model on the two-talker mixed speech, the above baseline BLSTM-RNN model is utilized to decode the mixed speech directly. During scoring we compare the decoding output (only one output) with the reference of each source utterance to obtain the WER for the corresponding source utterance. Table 2 summarizes the recognition results. It is clear, from the table, that the single-speaker model performs very poorly on the multi-talker mixed speech as indicated by the huge WER degradation of the high-energy speaker when SNR decreases. Further more, in all the conditions, the WERs for the low energy speaker are all above 100.0%. These results demonstrate the great challenge in the multi-talker mixed speech recognition.
|High E Spk||Low E Spk|
4.2Evaluation of Two-talker Speech Recognition Architectures
The proposed four architectures for two-taker speech recognition are evaluated here. For the first two approaches (Arch#1 and Arch#2) that contain an explicit feature separation stage (with and without PIT-MSE), a 3-layer BLSTM is used in the feature separation module. The separated feature streams are fed into a normal 3-layer BLSTM LVCSR system, trained with single-talker speech, for decoding. The whole system contains in total six BLSTM layers. For the other two approaches (Arch#3 and Arch#4), in which PIT-CE is used, 6-layer BLSTM models are used so that the number of parameters is comparable to the other two architectures. In all these architectures the input is the 40-dimensional LFBK feature and each layer contains 768 memory cells. To train the latter two architectures that exploit PIT-CE we need to prepare the alignments for the mixed speech. The senone alignments for the two-talkers in each mixed speech utterance are from the single-speaker baseline alignment. The alignment of the shorter utterance within the mixed speech is padded with the silence state at the front and the end. All the models were trained with a minibatch of 8 utterances. The gradient was clipped to 0.0003 to guarantee the training stability. To obtain the results reported in this section we used the 80hr mixed speech training subset.
The recognition results on both speakers are evaluated. For scoring, we evaluated the two hypotheses, obtained from two output sections, against the two references and pick the assignment with better WER to compute the final WER.
The results on the 0db SNR condition are shown in Table 3. Compared to the 0dB condition in Table 2, all the proposed multi-talker speech recognition architectures obtain obvious improvement on both speakers. Within the two architectures with the explicit feature separation stage, the architecture with PIT-MSE is significantly better than the baseline feature separation architecture. These results confirmed that the label permutation problem can be well alleviated by the PIT-MSE at the feature level. We can also observe that applying PIT-CE on the recognition module (Arch#3 & Arch#4) can further reduce WER by 10.0% absolute. This is because these two architectures can significantly reduce the mismatch between the separated feature and the feature used to train the LVCSR model. It is also because cross-entropy is more directly related to the recognition accuracy. Comparing Arch#3 and Arch#4, we can see that the architecture with joint optimization on PIT-based feature separation and recognition slightly outperforms the direct PIT-CE based model.
Since Arch#3 and Arch#4 achieve comparable results, and the model architecture and training process of Arch#3 is much simpler than that of Arch#4, our further evaluations reported in the following sections are based on Arch#3. For clarity, Arch#3 is named direct PIT-CE-ASR from now on.
|Front-end||Back-end||High E WER||Low E WER|
4.3Evaluation of the Direct PIT-CE-ASR Model on Large Dataset
We evaluated the direct PIT-CE-ASR architecture on the full IHM-2mix corpus. All the 400hr mixed data under different SNR conditions are pooled together for training. The direct PIT-CE-ASR model is still composed of 6 BLSTM layers with 768 memory cells in each layer. All other configurations are also the same as the experiments conducted on the subset.
The results under different SNR conditions are shown in Table 4. The direct PIT-CE-ASR model achieved significant improvements on both talkers compared to baseline results in Table 2 for all SNR conditions. Comparing to the results in Table 3, achieved with 80hr training subset, we observe that additional absolute 10.0% WER improvement on both speakers can be obtained using the large training set. We also observe that the WER increases slowly when the SNR becomes smaller for the high energy speaker, and the WER improvement is very significant for the low energy speaker across all conditions. In the 0dB SNR scenario, the WERs on two speakers are very close and are 45.0% less than that achieved with the single-talker ASR system for both high and low energy speakers. At 20dB SNR, the WER of the high energy speaker is still significantly better than the baseline, and approaches the single-talker recognition result reported in Table 1.
|High E WER||Low E WER|
4.4Permutation Invariant Training with Alternative Deep Learning Models
We investigated the direct PIT-CE-ASR model with alternative deep learning models. The first model we evaluated is a 6-layer feed-forward DNN in which each layer contains 2048 Sigmoid units. The input to the DNN is a window of 11 frames each with a 40-dimensional LFBK feature.
The results of DNN-based PIT-CE-ASR model is reported at the top of Table 5. Although it still gets obvious improvement over the baseline single-speaker model, the gain is much smaller with near 20.0% WER difference in every condition than that from BLSTM-based PIT-CE-ASR model. The difference between DNN and BLSTM models partially attribute to the stronger modeling power of BLSTM models and partially attribute to the better tracing ability of RNNs.
We also compared the BLSTM models with 4, 6, and 8 layers as shown in Table 5. It is observed that deeper BLSTM models perform better. This is different from the single speaker ASR model whose performance peaks at 4 BLSTM layers . This is because the direct PIT-CE-ASR architecture needs to conduct two tasks - separation and recognition, and thus requires additional modeling power.
|SNR Condition||High E WER||Low E WER|
4.5Analysis on Multi-Talker Speech Recognition Results
To better understand the results on multi-talker speech recognition, we computed the WER separately for the speech mixed with same and opposite genders. The results are shown in Table 6. It is observed that the same-gender mixed speech is much more difficult to recognize than the opposite-gender mixed speech, and the gap is even larger when the energy ratio of the two speakers is closer to 1. It is also observed that the mixed speech of two male speakers is hard to recognize than that of two female speakers. These results suggest that effective exploitation of gender information may help to further improve the multi-talker speech recognition system. We will explore this in our future work.
|SNR Condition||High E WER||Low E WER|
M + M
F + F
M + F
To further understand our model, we examined the recognition results with and without using the direct PIT-CE-ASR. An example of these results on a 0db two-talker mixed speech utterance is shown in Figure 2 (using the single-speaker baseline system) and Figure 3 (with direct PIT-CE-ASR). It is clearly seen that the results are erroneous when the single-speaker baseline system is used to recognize the two-talker mixed speech. In contrast, much more words are recognized correctly with the proposed direct PIT-CE-ASR model.
4.6Three-Talker Speech Recognition with Direct PIT-CE-ASR
In this subsection, we further extend and evaluate the proposed direct PIT-CE-ASR model on the three-talker mixed speech using the IHM-3mix dataset.
The three-talker direct PIT-CE-ASR model is also a 6-layer BLSTM model. The training and testing configurations are the same as those for two-talker speech recognition. The direct PIT-CE-ASR training processes as measured by CE on both two- and three-talker mixed speech training and validation sets are illustrated in Figure 4. It is observed that the direct PIT-CE-ASR model with this specific configuration converges slowly, and the CE improvement progress on the training and validation sets is almost the same. The training progress on three-talker mixed speech is similar to that on two-talker mixed speech, but with an obviously higher CE value. This indicates the huge challenge when recognizing speech mixed with more than two talkers. Note that, in this set of experiments we used the same model configuration as that used in two-talker mixed speech recognition. Since three-talker mixed speech recognition is much harder, using deeper and wider models may help to improve performance. Due to resource limitation, we did not search for the best configuration for the task.
The three-talker mixed speech recognition WERs are reported in Table 7. The WERs on different gender combinations are also provided. The WERs achieved with the single-speaker model are listed at the first line in Table 7. Compared to the results on IHM-2mix, the results on IHM-3mix are significantly worse using the conventional single speaker model. Under this extremely hard setup, the proposed direct PIT-CE-ASR architecture still demonstrated its powerful ability on separating/tracing/recognizing the mixed speech, and achieved 25.0% relative WER reduction across all three speakers. Although the performance gap from two-talker to three-talker is obvious, it is still very promising under this speaker-independent three-talker LVCSR task. Not surprisingly, the mixed speech of different genders is relatively easier to recognize than that of same gender.
Moreover, we conducted another interesting experiment. We used the three-talker PIT-CE-ASR model to recognize the two-talker mixed speech. The results are shown in Table 8. Surprisingly, the results are almost identical to that obtained using the 6-layer BLSTM based two-talker model (shown in Table 4). This demonstrates the good generalization ability of our proposed direct PIT-CE-ASR model over variable number of mixed speakers. This suggests that a single PIT model may be able to recognize mixed speech of different number of speakers without knowing or estimating the number of speakers.
|SNR Condition||High E WER||Low E WER|
In this paper, we proposed several architectures for recognizing multi-talker mixed speech given only a single channel of the mixed signal. Our technique is based on permutation invariant training, which was originally developed for separation of multiple speech streams. PIT can be performed on the front-end feature separation module to obtain better separated feature streams or be extended on the back-end recognition module to predict the separated senone posterior probabilities directly. Moreover, PIT can be implemented on both front-end and back-end with a joint-optimization architecture. When using PIT to optimize a model, the criterion is computed over all frames in the whole utterance for each possible output-target assignment, and the one with the minimum loss is picked for parameter optimization. Thus PIT can address the label permutation problem well, and conduct the speaker separation and tracing in one shot. Particularly for the proposed architecture with the direct PIT-CE based recognition model, multi-talker mixed speech recognition can be directly conducted without an explicit separation stage.
The proposed architectures were evaluated and compared on an artificially mixed AMI dataset with both two- and three-talker mixed speech. The experimental results indicate that the proposed architectures are very promising. Our models can obtain relative 45.0% and 25.0% WER reduction against the state-of-the-art single-talker speech recognition system across all speakers when their energies are comparable, for two- and three-talker mixed speech, respectively. Another interesting observation is that there is even no degradation when using proposed three-talker model to recognize the two-talker mixed speech directly. This suggests that we can construct one model to recognize speech mixed with variable number of speakers without knowing or estimating the number of speakers in the mixed speech. To our knowledge, this is the first work on the multi-talker mixed speech recognition on the challenging speaker-independent spontaneous LVCSR task.
This work was supported by the Shanghai Sailing Program No. 16YF1405300, the China NSFC projects (No. 61573241 and No. 61603252), the Interdisciplinary Program (14JCZ03) of Shanghai Jiao Tong University in China, and the Tencent-Shanghai Jiao Tong University joint project. Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University.
- In , a similar permutation free technique, which is equivalent to PIT when there are exactly two-speakers, was evaluated with negative results and conclusion.
- =2plus 43minus 4 D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach, ser. Signals and Communication Technology.1em plus 0.5em minus 0.4emSpringer London, 2014. [Online]. Available: https://books.google.com/books?id=rUBTBQAAQBAJ =0pt
- D. Yu, L. Deng, and G. E. Dahl, “Roles of pre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” in NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
- G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 20, pp. 30–42, 2012.
- F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks.” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2011, pp. 437–440.
- G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine (SPM), vol. 29, pp. 82–97, 2012.
- O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4277–4280.
- O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 22, pp. 1533–1545, 2014.
- T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long short-term memory, fully connected deep neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4580–4584.
- M. Bi, Y. Qian, and K. Yu, “Very deep convolutional neural networks for LVCSR,” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 3259–3263.
- Y. Qian, M. Bi, T. Tan, and K. Yu, “Very deep convolutional neural networks for noise robust speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 24, no. 12, pp. 2263–2276, 2016.
- Y. Qian and P. C. Woodland, “Very deep convolutional neural networks for robust speech recognition,” in IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 481–488.
- V. Mitra and H. Franco, “Time-frequency convolutional networks for robust speech recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 317–323.
- V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2015, pp. 3214–3218.
- T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, “Very deep multilingual convolutional neural networks for LVCSR,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4955–4959.
- D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in International Conference on Machine Learning (ICML), 2016.
- S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. Dai, “Compact feedforward sequential memory networks for large vocabulary continuous speech recognition,” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2016, pp. 3389–3393.
- D. Yu, W. Xiong, J. Droppo, A. Stolcke, G. Ye, J. Li, and G. Zweig, “Deep convolutional neural networks with layer-wise context expansion and attention.” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2016, pp. 17–21.
- W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “The Microsoft 2016 conversational speech recognition system,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 5255–5259.
- D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241–245.
- M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), accepted, 2017.
- Z. Ghahramani and M. I. Jordan, “Factorial hidden Markov models,” Machine learning (MLJ), vol. 29, no. 2-3, pp. 245–273, 1997.
- M. Cooke, J. R. Hershey, and S. J. Rennie, “Monaural speech separation and recognition challenge,” Computer Speech and Language (CSL), vol. 24, pp. 1–15, 2010.
- C. Weng, D. Yu, M. L. Seltzer, and J. Droppo, “Deep neural networks for single-channel multi-talker speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 23, no. 10, pp. 1670–1679, 2015.
- J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 31–35.
- Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2016, pp. 545–549.
- Z. Chen, Y. Luo, and N. Mesgarani, “Deep attractor network for single-microphone speaker separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 246–250.
- J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Kristjansson, “Super-human multi-talker speech recognition: A graphical modeling approach,” Computer Speech and Language (CSL), vol. 24, pp. 45 – 66, 2010.
- S. J. Rennie, J. R. Hershey, and P. A. Olsen, “Single-channel multitalker speech recognition,” IEEE Signal Processing Magazine (SPM), vol. 27, pp. 66–80, 2010.
- P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 1562–1566.
- F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA).1em plus 0.5em minus 0.4emSpringer-Verlag New York, Inc., 2015, pp. 91–99.
- Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, pp. 1849–1858, 2014.
- Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Letters (SPL), vol. 21, pp. 65–68, 2014.
- P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 23, pp. 2136–2147, Dec 2015.
- J. Du, Y. Tu, L. R. Dai, and C. H. Lee, “A regression approach to single-channel speech separation via high-resolution deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 24, pp. 1424–1437, Aug 2016.
- T. Hain, L. Burget, J. Dines, P. N. Garner, F. Grézl, A. E. Hannani, M. Huijbregts, M. Karafiat, M. Lincoln, and V. Wan, “Transcribing meetings with the AMIDA systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), vol. 20, no. 2, pp. 486–498, 2012.
- D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2016, pp. 2751–2755.
- Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, “Highway long short-term memory RNNs for distant speech recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5755–5759, 2016.
- J. J. Godfrey and E. Holliman, “Switchboard-1 release 2,” Linguistic Data Consortium, Philadelphia, 1997.
- T. Sercu and V. Goel, “Dense prediction on sequences with time-dilated convolutions for speech recognition,” arXiv preprint arXiv:1611.09288, 2016.
- G. Saon, T. Sercu, S. Rennie, and H.-K. J. Kuo, “The IBM 2016 english conversational telephone speech recognition system,” in Annual Conference of International Speech Communication Association (INTERSPEECH), 2016, pp. 7–11.
- P. Swietojanski, A. Ghoshal, and S. Renals, “Hybrid acoustic models for distant and multichannel large vocabulary speech recognition,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2013, pp. 285–290.
- D. Yu, A. Eversole, M. Seltzer, K. Yao, Z. Huang, B. Guenter, O. Kuchaiev, Y. Zhang, F. Seide, H. Wang et al., “An introduction to computational networks and the computational network toolkit,” Microsoft Technical Report MSR-TR-2014–112, 2014.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), no. EPFL-CONF-192584, 2011.