End-to-end training of time domain audio separation and recognition

End-to-end training of time domain audio separation and recognition

Abstract

The rising interest in single-channel multi-speaker speech separation sparked development of \glsE2E approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network–based time domain source separation has not yet been combined with \glsE2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an \glsE2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic \glsE2E frequency domain systems proposed so far.

\glsdisablehyper\newcolumntype

H¿c¡@ \robustify\newacronymWERWERWord Error Rate \newacronymE2EE2EEnd-to-End \newacronymSOTASOTAState of the Art \newacronymASRASRAutomatic Speech Recognition \newacronymSDRSDRSignal-to-Distortion Ratio \newacronymSI-SNRSI-SNRScale-Invariant-Signal-to-Distortion Ratio \newacronymDPCLDPCLDeep Clustering \newacronymPITPITPermutation Invariant Training \newacronymSTFTSTFTShort-Time Fourier Transform \name Thilo von Neumann   Keisuke Kinoshita   Lukas Drude   Christoph Boeddeker Marc Delcroix   Tomohiro Nakatani   Reinhold Haeb-Umbach \addressNTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan
Paderborn University, Department of Communications Engineering, Paderborn, Germany \ninept {keywords} End-to-end speech recognition, speech separation, multi-speaker speech recognition, time domain, joint training

1 Introduction

\gls

ASR is a key technology for the task of automatic analysis of any kind of spoken speech, e.g., phone calls or meetings. For scenarios of relatively clean speech, e.g., recordings of telephone speech or audio books, \glsASR technologies have improved drastically over the recent years [20]. More realistic scenarios like spontaneous speech or meetings with multiple participants in many cases require the \glsASR system to recognize the speech of multiple speakers simultaneously. In meeting scenarios for example, the overlap is in the range of   to   1 and can easily exceed in informal get-togethers2. Thus, there has been a growing interest in source separation systems and multi-speaker \glsASR. A special focus lies on processing of single-channel recordings, as this is not only important in scenarios where only a single-channel is available (e.g., telephone conference recordings), but as well for multi-channel recordings where conventional multi-channel processing methods, e.g., beamforming, cannot separate the speakers well enough in case, e.g., they are spatially too close to each other.

The topic of single-channel source separation has been examined extensively over the last few years, trying to solve the cocktail party problem with techniques such as \glsDPCL [6], \glsPIT [22] and TasNet [10, 11]. In \glsDPCL, a neural network is trained to map each time-frequency bin to an embedding vector in a way that embedding vectors of the same speaker form a cluster in the embedding space. These clusters can be found by a clustering algorithm and be used for constructing a mask for separation in frequency domain. Concurrently, \glsPIT has been developed which trains a simple neural network with multiple outputs to estimate a mask for each speaker with a permutation invariant training criterion. The reconstruction loss is calculated for each possible assignment of training targets to estimations for a mixture, and the permutation that minimizes the loss is then used for training. Both \glsDPCL and \glsPIT show a good separation performance in time-frequency domain. The permutation-invariant training scheme was adopted to time domain source separation with a Time domain Audio Separation Network (TasNet) which replaces the commonly used \glsSTFT with a learnable transformation and directly works on the raw waveform. TasNet achieves a \glsSDR gain of more than , even outperforming oracle masking in frequency domain.

Based on these source separation techniques, multi-speaker \glsASR systems have been constructed. \glsDPCL and \glsPIT have been used as frequency domain source separation front-ends for a state-of-the-art single-speaker \glsASR system and extended to jointly trained \glsE2E or hybrid systems [12, 15, 21, 13]. They showed that joint (re-)training can improve the performance of these models over a simple cascade system. The effectiveness of TasNet as a time domain front-end for \glsASR was investigated in [1], showing an improvement over frequency domain processing for both source separation and \glsASR results. However, TasNet was not yet optimized jointly with an \glsASR system, possibly due to the intricacies of dealing with the high memory consumption or the novelty of the TasNet method.

In this paper, we combine a state-of-the-art front-end, i.e., Conv-TasNet [4], with an \glsE2E CTC/attention [7, 18, 3] \glsASR system to form an E2E multi-speaker \glsASR system that directly operates on raw waveform features. We try to answer the questions whether it is possible to jointly train a time domain source separation system like Conv-TasNet with an \glsE2E \glsASR system and whether the performance can be improved by joint fine-tuning. Going further on the investigations from [1], we retrain pre-trained front- and back-end models jointly and show by evaluating on the WSJ0-2mix database that a simple combination of an independently trained Conv-TasNet and \glsASR system already provides competitive performance compared to other \glsE2E approaches, while a joint fine-tuning of both modules in the style of an \glsE2E system can further improve the performance by a large margin. We enable joint training by distributing the model over multiple GPUs and show that an approximation of truncated back-propagation [19] through time for convolutional networks enables joint training even on a single GPU by significantly reducing the memory usage while still providing a good performance.

We finally put this work into perspective by providing a compact overview of single-channel multi-speaker \glsASR system and illustrate the complexity of the design space.

ASR encoder

ASR encoder

CTC

CTC

att. decoder

att. decoder

perm. inv. SI-SNR loss

perm. assign.

Conv-TasNet

Figure 1: Architecture of the joint E2E ASR model. Sources are separated by a Conv-TasNet and separated audio streams are processed by a single-speaker ASR system. During training, the permutation problem is solved based on the signal level loss with .

2 Relation to Prior Work

Other works already studied the effectiveness of frequency domain source separation techniques as a front-end for \glsASR. \glsDPCL and \glsPIT have been efficiently used for this purpose, and it was shown that joint retraining for fine-tuning can improve performance [12, 15, 13]. \glsE2E systems for single-channel multi-speaker \glsASR have been proposed that no longer consist of individual parts dedicated for source separation and speech recognition, but combine these functionalities into one large monolithic neural network. They extend the encoder of a CTC/attention-based \glsE2E \glsASR system to separate the encoded speech features and let one or multiple attention decoders generate an output sequence for each speaker [14, 4]. These models show promising performance, but they are not on par with hybrid cascade systems yet. Drawbacks of these monolithic \glsE2E models compared to cascade systems include that they cannot make use of parallel and single-speaker data and that they do not allow pre-training of individual system parts. The impact of using raw waveform features directly for the task of multi-speaker \glsASR has only been investigated for a combination of TasNet and a single-speaker \glsASR system [1], but not yet jointly trained.

3 Source separation and speech recognition

3.1 Time domain source separation with Conv-TasNet

Conv-TasNet [11] is a single-channel source separating front-end which can be trained to produce waveforms for a fixed number of speakers from a mixture waveform. It is a variant of [22], replacing the feature extraction by a learnable transformation and the separation network by a convolutional architecture. It outputs an estimated audio stream in time domain for each speaker present in the input signal :

(1)

The model directly works on the raw waveform instead of STFT frequency domain features which makes it possible both to easily model and reconstruct phase information and to propagate gradients through the feature extraction and signal reconstruction parts. While propagating from raw waveform at the output to raw waveform at the input, it is possible to directly optimize a loss on the time domain signals, such as the \glsSI-SNR loss, which we call the front-end loss here. This loss is optimized in a permutation-invariant manner by picking the assignment of estimations to targets that minimizes the loss.

Since the Conv-TasNet is built upon a convolutional architecture, it can be heavily parallelized on GPUs as compared to RNN-based models, but has a limited receptive field. When optimized for source separation only, the limited length of the receptive field is actively exploited by performing a chunk-level training on chunks of length randomly cut from the training examples, both increasing the variability of data within one minibatch and simplifying implementation, since the length of all training examples is fixed.

3.2 End-to end CTC/attention speech recognition

As a speech recognizer we use a CTC/attention-based \glsASR system. We use an architecture similar to [18] with an implementation included in the ESPnet framework [17], but we replace the original filterbank and pitch feature extraction by log-mel features implemented in a way such that gradients can be propagated through. This way, gradients can flow from the \glsASR system to the front-end.

The multi-target loss for the ASR system is composed of a CTC and an attention loss

(2)

with a weight that controls the interaction of both loss terms. During training, teacher forcing using the ground truth transcription labels is employed for the attention decoder. In teacher forcing, the input for the next step of the recurrent attention decoder is calculated using the ground truth label instead of the network output.

4 Joint End-to-end multi-speaker asr

We propose to combine a Conv-TasNet as the source separation front-end with a CTC/attention speech recognizer as displayed in Fig. 1. The input mixture is separated by the front-end and the separated audio streams are processed by a single-speaker \glsASR back-end. Although multi-speaker source separation can already be performed by combining independently trained front- and back-end systems, the source separator produces artifacts unknown to the \glsASR system which disturb its performance. According to [5], and as also shown in [15, 13], such a mismatch can be mitigated by jointly fine-tuning the whole model at once.

We here compare three different variants of joint fine-tuning: (a) fine-tuning just the ASR system on the enhanced signals, (b) fine-tuning just the front-end by propagating gradients through the ASR system but only updating the front-end parameters and (c) jointly fine-tuning both systems. The losses for the front- and back-end are combined as

(3)

where and are manually chosen weights for the front-end and \glsASR losses. For (a) is set to and to , for (b) is set to and to , and for (c) is set to and is set to .

In order to choose the transcription for teacher forcing and loss computation a permutation problem needs to be solved. Two possible options are to use the permutation that minimizes the CTC loss as in [14], or the permutation that minimizes the signal level loss, as in [15]. While using has the advantage of not requiring parallel data, permutation assignment based on works more reliably in our experiments and we use for all fine-tuning experiments even when the front-end is not optimized.

4.1 Approximated truncated back-propagation through time

One-dimensional Convolutional Neural Networks (1D-CNNs) over time, as they are used in the Conv-TasNet, can be seen as an alternative to Recurrent Neural Network (RNN) architectures. Similar to RNNs, this can lead to enormous memory consumption when a sufficiently long time series is used for back-propagation. For example, we here fine-tune the Conv-TasNet with the E2E \glsASR model jointly on single mixtures. Although we constrain ourselves to a batch size of one, this requires to split the model onto four GPUs, three GPUs for the front-end and one GPU for the back-end, by placing individual layers on different devices.

This memory consumption can be addressed by generalizing truncated back-propagation through time (TBPTT) to 1D-CNN architectures. TBPTT for 1D-CNNs can in theory be realized by back-propagating the gradients for a part of the output only. While moving back towards the input, the gradients reaching over the borders of this part are ignored. In practice, however, this is difficult to implement and we here approximate TBPTT for the Conv-TasNet front-end by ignoring the left and right context of the block the gradients are computed for. We first compute the forward step on the whole mixture without building the backward graph to obtain an output estimation for the whole signal. Note that this only requires to store the output signal and no persistent data for the backward computation. We then compute the forward step again with enabled backward graph construction, but only for a chunk randomly cut from the input signal. The approximated output for the whole utterance is formed by overwriting the corresponding part of the full forward output with the approximated chunk output. This full output is passed to the \glsASR back-end and gradients reaching the front-end from the back-end are only back-propagated through the approximated chunk. This technique allows to run the joint training on a single GPU in our case, but even with larger GPU memory this permits increasing the batch size which in general speeds up training and produces a smoother gradient.

5 Experiments

We carry out experiments on the WSJ database and the commonly used WSJ0-2mix dataset first proposed in [6]. The data in WSJ0-2mix is generated by linearly mixing clean utterances from WJS0 (si_tr_s for training and si_et_05 for testing) at ratios randomly chosen from   to   . It consists of two different datasets, namely the min and max datasets. The min dataset was designed for source separation and is formed by truncating the longer one of the two mixed recordings, so that it only contains fully overlapped speech. We use this dataset for pre-training the Conv-TasNet, but it is not suitable for joint training where the audio data needs to match the full transcription. For this need, we use the max dataset that does not truncate any recordings. We use a sampling frequency of for both the front- and back-end to speed up the training process. We remove any labels marked as noisy, i.e., special tokens such as ”lip smack” or ”door slam”, from the training transcriptions since these cannot be assigned to one speaker based on speech information by the front-end which makes their estimation ambiguous.

We evaluate our experiments in terms of \glsWER and, where applicable, by signal reconstruction performance measured by SDR as supplied by the BSS-EVAL toolbox [16] and SI-SNR [9]. For the experiments on mixed speech, the \glsWER is computed for all possible combinations of predictions and ground truth transcriptions for one example and the \glsWER for the permutation with minimum \glsWER is reported.

5.1 Conv-TasNet time domain source separation

We use the best performing architecture according to [11] and optimize it with the ADAM optimizer [8]. In particular, following the hyper-parameter notations in the original paper, we set , , , , , and with global layer normalization. For distributing over multiple GPUs, we split between the three repeating convolutional blocks. Table 1 lists SDR and \glsSI-SNR performance for our Conv-TasNet model, comparing the min and max subsets of WSJ0-2mix. It can be seen that our implementation of the Conv-TasNet reaches a comparable performance on the min dataset when compared to the original paper. There is a slight degradation in performance on the max dataset caused by the mismatch of training and test data because the model never saw long single-speaker regions during training and learned to always output a speech signal on both outputs, while these regions are present in the max dataset.

Dataset SDR SI-SNR
WSJ0-2mix min [11]
WSJ0-2mix min (ours)
WSJ0-2mix max (ours)
Table 1: SDR and SI-SNR in for the min and max test (tt) datasets of the WSJ0-2mix database.
\robustify
Table 2: CER and WER on max test (tt) set of WSJ0-2mix for different variants of fine-tuning. All models are pre-trained.
Model
fine-tune
Join training type
additional
SI-SNR loss
CER
WER
SDR
SI-SNR
front-end back-end
Conv-TasNet + RNN
++ fine-tune ASR single GPU
++ fine-tune TasNet multi GPU
+++ SI-SNR loss multi GPU
++ fine-tune joint multi GPU
+++ SI-SNR loss multi GPU
++ fine-tune joint TBPTT single GPU (TBPTT)
+++ SI-SNR loss single GPU (TBPTT)
Model
structure
pre-training
joint
training
signal
reconstruct.
no parallel
data required
data
WER
train eval CER SDR SI-SNR
DPCL WSJ0-2mix 10.4??
++ DNN-HMM [12] hybrid WSJ0 WSJ0-2mix 16.5
++ CTC/attention [15] E2E WSJ0 WSJ0-2mix 23.1
+++ joint fine-tuning [15] E2E WSJ0-2mix WSJ0-2mix 13.2 10.7
PIT-ASR (best) [13, 4] hybrid WSJ0-2mix WSJ0-2mix 28.2
E2E ASR [14] E2E (✓) WSJ-2mix WSJ-2mix 28.2
E2E ASR [4] E2E WSJ-2mix WSJ-2mix 18.4
E2E ASR [4] E2E WSJ0-2mix WSJ0-2mix 25.4
joint TasNet (our best) E2E
WSJ &
WSJ0-2mix
WSJ0-2mix 11.0 13.8 13.5
Table 3: Comparison of single-channel multi-speaker ASR systems. They differ heavily in their used architecture, training data and technique.

5.2 CTC/attention ASR model

We use a configuration similar to [14] without the speaker dependent layers for the speech recognizer. This results in a model with two CNN layers followed by two BLSTMP layers with units each for the encoder, one LSTM layer with 300 units for the decoder and a feature dimension of . The multi-task learning weight was set to . We use a location-aware attention mechanism and ADADELTA [23] as optimizer. All decoding is performed with an additional word-level RNN language model. Our ASR model achieves a WER of on the WSJ eval92 set.

5.3 Joint finetuning

The results of the different fine-tuning variants are listed for comparison in Table 2. It is notable that combining the independently trained models (Conv-TasNet + RNN) already gives a competitive performance compared other methods (see Section 5.4 and Table 3). Fine-tuning just the \glsASR system (+ fine-tune ASR) can further cut the \glsWER almost in half from to .

Joint fine-tuning without a signal level loss (+ fine-tune joint), when the system is no longer constrained to transport meaningful speech between front- and back-end, can not improve much over just fine-tuning the ASR system and significantly lowers the source separation performance. This indicates that there is enough information available for reliable speech recognition in the separated signals (i.e., retraining of the front-end is not required), but that not all information required to reconstruct speech is required for \glsASR.

Using a signal-level loss (+ fine-tune joint + SI-SNR loss) can further improve the \glsWER to . In this case, the source separation performance stays comparable to the separate Conv-TasNet model. A signal-level loss might help the model to better separate the speech.

The performance for just fine-tuning the front-end (+ fine-tune TasNet) cannot reach the performance of fine-tuning the back-end. This means that it is easier to mitigate the mismatch for the \glsASR back-end (i.e., learn to ignore the artifacts produced by the front-end) than it is for the front-end (i.e., learn to suppress the artifacts).

Comparing the results of the chunk-based fine-tuning (+ fine-tune joint TBPTT) as an approximation of TBPTT with the full joint fine-tuning (+ fine-tune joint), it can be seen that even though the TBPTT-based approach is just an approximation, its performance is comparable to the full joint model if no signal-level loss is used. It even performs slightly better, possibly because TBPTT allowed to use a larger batch size. The degradation in performance for the case with a signal level loss (+ fine-tune joint TBPTT + SI-SNR loss) might be caused by the signal-level loss penalizing the approximation heavily, while the gradient propagated through the \glsASR system is less harmful to the front-end performance.

5.4 Comparison with related work

This section compares the performance of the different related works presented in Section 2. Their major differences and performance in terms of \glsWER are listed in Table 3. While these comparisons are not fair because the presented works differ heavily in their overall model structure, training methods and data, the numbers are meant to give a rough indication of how these methods compare and how complex the design space is.

Keeping in mind that hybrid DNN-HMM models still outperform E2E models in many scenarios, it is notable that the joint fine-tuning of \glsDPCL with an E2E model outperforms the independently trained hybrid model. In the same fashion, the E2E ASR model can outperform the jointly optimized hybrid PIT-ASR on the same dataset. Although not directly comparable, the best results in this table were produced by cascade models that allow reconstruction of the enhanced separated signals (\glsDPCL + joint fine-tuning, joint TasNet), which suggests that having dedicated parts for source separation and speech recognition is helpful, while joint fine-tuning improves the performance. Our time domain approach gives the best result in this comparison.

6 Conclusions

We propose to use a time domain source separation system like Conv-TasNet as a front-end for a single-speaker \glsE2E \glsASR system to form a multi-speaker \glsE2E speech recognizer. We show that independently training the front- and back-end already gives a competitive performance and that joint fine-tuning can drastically improve the performance. Fine-tuning can be performed jointly with the whole model distributed over multiple GPUs, but can as well be sped up roughly by a factor of on a single GPU by approximating TPBTT for convolutional neural networks, while keeping the performance comparable. The results suggest that retraining the \glsASR part can much better compensate the mismatch between front-end and back-end than a fine-tuned front-end could.

Footnotes

  1. Measured on the AMI meeting corpus [2].
  2. Measured on the CHiME-5 database.

References

  1. F. Bahmaninezhad, J. Wu, R. Gu, S. Zhang, Y. Xu, M. Yu and D. Yu (2019) A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv:1905.07497. Cited by: §1, §1, §2.
  2. J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma and P. Wellner (2006) The ami meeting corpus: a pre-announcement. In Machine Learning for Multimodal Interaction, S. Renals and S. Bengio (Eds.), Berlin, Heidelberg, pp. 28–39. External Links: ISBN 978-3-540-32550-5 Cited by: footnote 1.
  3. W. Chan, N. Jaitly, Q. Le and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
  4. X. Chang, Y. Qian, K. Yu and S. Watanabe (2019) End-to-end monaural multi-speaker asr system without pretraining. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6256–6260. External Links: Document Cited by: §2, Table 3.
  5. J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink and R. Haeb-Umbach (2017) Beamnet: end-to-end training of a beamformer-supported multi-channel asr system. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5325–5329. Cited by: §4.
  6. Y. Isik, J. L. Roux, Z. Chen, S. Watanabe and J. R. Hershey (2016) Single-channel multi-speaker separation using deep clustering. arXiv preprint arXiv:1607.02173. External Links: Document Cited by: §1, §5.
  7. S. Kim, T. Hori and S. Watanabe (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. External Links: Document Cited by: §1.
  8. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.1.
  9. J. Le Roux, S. Wisdom, H. Erdogan and J. R. Hershey (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. External Links: Document Cited by: §5.
  10. Y. Luo and N. Mesgarani (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. Cited by: §1.
  11. Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §1, §3.1, §5.1, Table 1.
  12. T. Menne, I. Sklyar, R. Schlüter and H. Ney (2019) Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech. arXiv preprint arXiv:1905.03500. External Links: Document Cited by: §1, §2, Table 3.
  13. Y. Qian, X. Chang and D. Yu (2018) Single-channel multi-talker speech recognition with permutation invariant training. Speech Communication 104, pp. 1–11. External Links: Document Cited by: §1, §2, §4, Table 3.
  14. H. Seki, T. Hori, S. Watanabe, J. L. Roux and J. R. Hershey (2018) A purely end-to-end system for multi-speaker speech recognition. arXiv preprint arXiv:1805.05826. External Links: Document Cited by: §2, §4, §5.2, Table 3.
  15. S. Settle, J. Le Roux, T. Hori, S. Watanabe and J. R. Hershey (2018) End-to-end multi-speaker speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4819–4823. Cited by: §1, §2, §4, §4, Table 3.
  16. E. Vincent (2005) BSSEval. a toolbox for performance measurement in (blind) source separation. línea]. Disponible: http://bassdb. gforge. inria. fr/bss_eval/.[Último acceso: 29 Marzo 2017]. Cited by: §5.
  17. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner and N. Chen (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015. Cited by: §3.2.
  18. S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi (2017) Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. External Links: Document Cited by: §1, §3.2.
  19. P. J. Werbos (1990) Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78 (10), pp. 1550–1560. Cited by: §1.
  20. W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang and A. Stolcke (2018-04) The microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5934–5938. External Links: Document, ISSN Cited by: §1.
  21. D. Yu, X. Chang and Y. Qian (2017) Recognizing multi-talker speech with permutation invariant training. arXiv preprint arXiv:1704.01985. Cited by: §1.
  22. D. Yu, M. Kolbæk, Z. Tan and J. Jensen (2017) Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In The 42nd IEEE International Conference on Acoustics, Speech and Signal ProcessingIEEE International Conference on Acoustics, Speech and Signal Processing, pp. 241–245. Cited by: §1, §3.1.
  23. M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §5.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402480
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description