Multi-Modal Data Augmentation for End-to-end ASR

Multi-Modal Data Augmentation for End-to-end ASR


We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using symbolic input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input. The MMDA architecture attempts to eliminate the need for an external LM, by enabling seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on CER and achieves 8-10% relative WER improvement on the WSJ data set.

Multi-Modal Data Augmentation for End-to-end ASR

Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji Watanabe

Center for Language and Speech Processing

Johns Hopkins University, Baltimore, MD 21218, USA


1 Introduction

The simplicity of “end-to-end” models and their recent success in neural machine-translation (NMT) have prompted considerable research into replacing conventional ASR architectures with a single “end-to-end” model, which trains the acoustic and language models jointly rather than separately. Recently, [1] achieved state-of-the-art results using an attention-based encoder-decoder model trained on over 12K hours of speech data. While this result is promising, we should note that even the largest publicly available speech corpus “Librispeech”, is still an order of magnitude smaller than the training data used above [2]. Our goal is to leverage much larger text corpora alongside limited amounts of speech datasets to improve the performance of end-to-end ASR systems.

Various methods of leveraging these text corpora have shown improvement in the context of end-to-end ASR. [3], for instance, compose RNN-output lattices with a lexicon and word-level language model, while [4] simply re-score beams with an external language model. [5, 6] incorporate a character level language model during beam search, possibly disallowing character sequences absent from a dictionary, while [7] include a full word level language model in decoding by simultaneously keeping track of word histories and word prefixes. As our approach does not change any aspect of the traditional decoding process in end-to-end ASR, the methods mentioned above can still be used in conjunction with our MMDA network.

An alternative method, proposed for NMT, augments the source (input) with “synthetic” data obtained via back-translation from monolingual target-side data [8]. We draw inspiration from this approach and attempt to augment the ASR input with text-based synthesized input generated from large text corpora.

Figure 3: Overview of our Multi-modal Data Augmentation (MMDA) model. Figure (a)a highlights the network engaged when acoustic features are given as input to an acoustic encoder (shaded blue). Alternatively, when synthetic input is supplied the network (Figure (b)b) uses an augmenting encoder (green). In both cases a shared attention mechanism and decoder are used to predict the output sequence. For simplicity we show 2 layers without down-sampling in the acoustic encoder and omit the input embedding layer in the augmenting encoder.

2 Approach

While text-based augmenting input data is a natural fit for NMT, it cannot be directly used in end-to-end ASR systems which expect acoustic input. To utilize text-based input, we use two separate encoders in our ASR architecture: one for acoustic input and another for synthetic text-based augmenting input. Figure 3 gives an overview of our proposed architecture.

2.1 MMDA Architecture

Figure (a)a shows a sequence of acoustic frames fed into an acoustic encoder shown with blue cross hatching. The attention mechanism takes the output of the encoder and generates a context vector (gray cross hatching) which is utilized by the decoder (red cross hatching) to generate each token in the output sequence . In figure (b)b, the network is given a sequence of “synthetic” input tokens, , where and the set is the vocabulary of the synthetic input. The size and items in depend on the type of synthetic input scheme used (see Table 1 for examples and Section 5.2 for more details). As the synthetic inputs are categorical, we use an input embedding layer which learns a vector representation of each symbol in . The vector representation is then fed into an augmenting encoder (shown in green cross hatching). Following this, the same attention mechanism and decoder are used to generate an output sequence. Note that some details such as the exact number of layers, down-sampling in the acoustic encoder, and the embedding layer in the augmenting encoder are omitted in Figure 3 for sake of clarity.

2.2 Synthetic Inputs

A desirable synthetic input should be easy to construct from plain text corpora, and should be as similar as possible to acoustic input. We propose three types of synthetic inputs that can be easily generated from text corpora and with varied similarity to acoustic inputs (see Table 1).

  1. Charstream: The output character sequence is supplied as synthetic input without word boundaries.

  2. Phonestream: We make use of a pronunciation lexicon to expand words into phonemes where unknown pronunciations are recovered via grapheme-to-phoneme transduction (G2P).

  3. Rep-Phonestream: We explicitly model phoneme duration by repeating each phoneme such that the relative durations of phonemes to each other mimics what is observed in data (e.g. vowels last much longer than stops consonants).

Synthetic Input Example Sequence
Charstream J O H N B L A R E A N D C O M P A N Y
Table 1: Examples of sequences under different synthetic input generation schemes. The original text for these examples is the phrase JOHN BLARE AND COMPANY.

2.3 Multi-task Training

Let be the ASR dataset, with acoustic input and character sequence output pairs where . Using a text corpus with sentences where , we can generate synthetic inputs , where is one of the synthetic input creation schemes. Under the assumption that both and are sequences with the same character vocabulary and from the same language, our augmenting dataset is comprised of training pairs . Typically the corpus is much larger than the original ASR training set . During training, we alternate between batches from training data and augmenting data . While our network accepts synthetic input in addition to acoustic input, it should be noted that each training instance contains only one of the two inputs. We evaluate our model on a held out ASR dataset . Note that using phoneme-based augmenting inputs corresponds to a secondary task of phoneme-to-grapheme conversion, in addition to the primary task of ASR.

In the remainder of the paper we place our work in context of other multi-modal, multi-task, and data-augmentation schemes for ASR. We propose a novel architecture to seamlessly train on both text (with synthetic inputs) and speech corpora. We analyze the merit of these approaches on WSJ, and finally report the performance of our best performing architecture on WSJ [9] and CHIME4 [10].

3 Related Work

Augmenting the ASR source with synthetically generated data is already a widely used technique. Generally, label-preserving perturbations are applied to the ASR source to ensure that the system is robust to variations in source-side data not seen in training. Such perturbations include Vocal Tract Length Perturbations (VTLN), as in [11] to expose the ASR to a variety of synthetic speaker variations, as well as speed, tempo and volume perturbations [12]. Speech is also commonly corrupted with synthetic noise or reverberation [13, 10].

Importantly, these perturbations are added to help learn more robust acoustic representations, but not to expose the ASR system to new output utterances. They do not explicitly help the decoder, nor do they alter the network architecture. By contrast, our proposed method for data augmentation from external text exposes the ASR system to new output utterances, rather than to new acoustic inputs.

Another line of work involves data-augmentation for NMT. In [14], improvements in low-resource settings were obtained by simply copying the source-side (input) monolingual data to the target side (output). Our approach is loosely based off of [8], which improves NMT performance by creating pseudo parallel data using an auxiliary translation model in the reverse direction on target-side text.

Previous work has also tried to incorporate other modalities during both training and testing, but have focused primarily on learning better feature representations via correlative objective functions or on fusing representations across modalities [15, 16]. The fusion methods require both modalities to be present at test time, while the multiview methods require both views to be conditionally independent given a common source. Our method has no such requirements and only makes use of the alternate modality during training.

Lastly, we note that considerable work has applied multi-task training to “end-to-end” ASR. In [17], the CTC objective is used as an auxiliary task to force the attention to learn monotonic alignments between input and output. In [18], a multi-task framework is used to jointly perform language-id and speech-to-text in a multilingual ASR setting. In this work our use of phoneme-based augmenting data is effectively using G2P (P2G) as an auxiliary task in end-to-end ASR, though only implicitly.

4 Method

Our MMDA architecture is a straightforward extension to Attention-Based Encoder-Decoder network [19]. In addition to the traditional acousting encoder, we also use a augmenting data encoder. As the name suggests, it consists of an encoder capable of learning a meaningful representation of the input, a decoder taking the representations learned by the encoder and transforms them into the desired output sequences, and an attention mechanism, which is trained to help the decoder learn to focus on relevant portions of the encoded input.

4.1 Acoustic Encoder

For a single utterance, the acoustic frames form a matrix, where is the length of the utterance in frames and is the number of acoustic features per frame. The acoustic frames are encoder by a multi-layer bi-directional LSTM.


After each layers’ encoding, the bi-directional hidden units are concatenated and passed through a linear projection layer, with parameters , where is the index of the layer and is the hidden state dimension. The speech sequence is generally down-sampled to capture a coarser grained resolution. Note that Eqs 15 do not reflect any sub-sampling, but in the actual implementation we follow the pyramidal encoder [20].

4.2 Augmenting Encoder

The synthetic input is encoded by an augmenting encoder, which is a shallower version of the acoustic encoder that just has the bi-directional LSTM layers described in Equation 1 and Equation 2. The major difference between the synthetic input and the acoustic input is that the synthetic input is in the form of tokens (e.g. phoneme, character), and hence is categorical rather than continuous. In our implementation, this categorical input is represented with an one-hot encoding and is transformed into a matrix by multiplying with a weight matrix, known as embedding matrix author=shuoyang,color=pink,size=,fancyline,caption=,author=shuoyang,color=pink,size=,fancyline,caption=,todo: author=shuoyang,color=pink,size=,fancyline,caption=,Can I put some citation here? Or do I need to?, where and are the length of the augmenting input sequence and the embedding size respectively. To ensure that the acoustic and augmenting encoders work smoothly with the attention mechanism, we enforce the hidden dimension of the augmenting encoder to be the same as the acoustic encoder.

4.3 Decoder

We follow [21][19] and use a variant of an LSTM decoder


where is the embedding of the last output token, is the LSTM hidden state in the previous time step, and is the attention-based context vector which will be discussed in the following section. Like the encoder, more standard LSTM layers could be stacked after this variant LSTM layer to learn more complicated transformations. We omitted all the layer index notations for simplicity.

The hidden state of the final LSTM layer is passed through another linear transformation followed by a softmax layer generating a probability distribution over the set of output tokens.


4.4 Attention Mechanism

We follow [22] and use Location-aware attention mechanism.


This mechanism extends the content-based attention mechanism [19] by using the attention weights from the previous output time-step , when computing the attention weights for the current output . The previous attention weights are “smoothed” by a convolution operation and fed into the attention weight computation (Equation 8). After the attention weights are computed, the context vector is computed by a weighted sum over the encoder hidden states.


5 Experiments

5.1 Data

For the main result, we train our network on the Wall Street Journal corpus (LDC93S6B and LDC94S13B) using the standard SI-284 set containing 37K utterances or 80 hours of speech. The “dev93” set is used as development set and as a selection criteria for the best model which is then evaluated on the “eval92” dataset. Each frame of audio is represented by a vector of 83 dimensions (80 Mel-filter bank coefficients 3 pitch features).

For CHIME4 experiments, we use the same audio features. We use the “dt05-multi-isolated-1ch-track” section as dev and report performance on “et05-real-isolated-1ch” (eval1) and “dt05-real-isolated-1ch” (eval2).

5.2 Generating Synthetic Input

We use the same augmenting data for both WSJ and CHIME4 experiments. The data we use is section (13-32.1 87,88,89) of WSJ that is typically used for training language models applied at decoding stage. We make different synthetic inputs for this section of WSJ. For Charstream synthetic input the target-side character sequence is copied to the input while omitting word boundaries, which we remove to account for the lack of explicit word boundary information in the acoustic signal. The Phonestream synthetic input is a more natural choice for synthetic input. Phonemes are well-suited as augmenting data since they define the semantically meaningful phonetic distinctions of a language and indicate to the decoder which phonetic invariance to learn. We use the Phonetisaurus toolkit [23] to train a G2P on CMUDICT [24], to which 46k words from the WSJ corpus were added. For certain words consisting only of rare graphemes, we are unable to infer pronunciations and simply assign to these words a single unk phoneme. Finally, we filter out sentences with more than unk phoneme symbol, and those above characters in length. The resulting augmenting dataset contains sentences.

In the Rep-Phonestream scheme, we modify the augmenting input phonemes to further emulate the ASR input by modeling the variable durations of phonemes. We assume that a phoneme’s duration in frames is normally distributed and we estimate these distributions for each phoneme from frame-level phoneme transcripts in the TIMIT dataset. Thus, for a phoneme sequence like JH AA N (for the word “John”), we sample a sequence of frame durations and repeat each phoneme times, where . Dividing by accounts for the down-sampling performed by the pyramidal scheme in the acoustic encoder.

5.3 Training

Our implementation of MMDA model is based on ESPNET with a PyTorch backend [17],[25]. The acoustic encoder comprises 4 BLSTM layers. We use a “pyramidal” encoder scheme where the first two layers down-sample the input by a factor of 2 [4]. 320 forward + 320 backward LSTM units are used in each layer, and the resulting 640 output units are projected down to 320 before passed on to higher level layers. For the augmenting encoder we use a single BLSTM layer with the same number of units and projection scheme as the acoustic encoder. No down-sampling is done on the augmenting input. We employ location-aware attention in all our experiments [22]. For WSJ experiments the decoder is a 2-layer LSTM with 300 hidden units, while a single layer is used for CHIME4. We use Adadelta to optimize all our models for 15 epochs [26]. During training we compute training/validation accuracies at the end of every epoch and save a checkpoint model, and use the best scoring model for evaluation.

For decoding, we use beam-search with a beam-size of 10 for WSJ and 20 for CHIME4. In both cases we restrict the output using a minimum-length and maximum-length threshold. The min and max output lengths are set as and , where denotes the length of down-sampled input. For RNNLM integration, we trained a 2-layer LSTM language model with hidden units. The RNNLM was trained on the same sentences that were used for augmentation.

5.4 Results

(eval, dev)
(eval, dev)
No-Augmentation 7.0, 9.9 19.5, 24.8
Charstream 7.5, 10.5 20.3, 25.7
Phonestream 7.4, 10.1 20.4, 25.3
Rep-Phonestream 7.1, 9.8 17.5, 22.7
No-Augmentation + RNNLM 7.0, 9.8 17.2, 22.2
Rep-Phonestream + RNNLM 6.7, 9.4 16.0, 20.8
Table 2: Experiments on WSJ corpus using different synthetic input types.

Table 2 shows the ASR results on WSJ. Although neither Charstream nor Phonestream augmentation beats the baseline, the Rep-Phonestream augmentation improves over the baseline WER by a margin of 2%. This verifies our aforementioned intuition that data augmentation works best when the augmenting data is most similar to the real training data. While the Rep-Phonestream scheme beats the baseline in WER, there it performs very similarly in terms of CER. Furthermore, we continue to observe gains in WER when an RNNLM was incorporated in the decoding process [6] (See Table 2 second section).

As for CHIME4 experiments, note that this model uses a relatively shallow encoder compared to state-of-the-art end-to-end models, with almost no parameter tuning. While the CER and WER are considerably higher in CHIME4 than in WSJ, we see still observe similar trend in Rep-Phonestream performance compared to the baseline. (Table 3). The same augmenting data from WSJ was used in this experiment as well.

(eval1, eval2, dev)
(eval1, eval2, dev)
No-Augmentation 40.9, 29.3, 29.2 66.0, 51.4, 50.8
Rep-Phonestream 40.0, 28.9,28.5 65.0, 50.6, 49.8
Table 3: Experiments on CHIME4 corpus using Rep-Phonestream synthetic input.

We find the Rep-Phonestream MMDA system tends to replace entire words when incorrect, while the baseline system tends to incorrectly change a few characters in a word, even if the resulting word does not exist in English. Consequently, the baseline system tends to create nonsense words while the Rep-Phonestream MMDA generates valid (“legal”) words, which often results in the MMDA to be penalized more (in terms of CER). For example the baseline model on one occasion substitutes QUOTA with COLOTA, while the Rep-Phonestream MMDA system predicts COLORS. We verify this hypothesis by computing the ratio of errors resulting nonsense words to the total number of word errors on the development data for both systems on the WSJ development set(see Table 4).

Nonsense error % Legal error %
No-Augmentation 31.48 68.51
Rep-Phonestream 24.13 75.86
Table 4: Error type differences between the Rep-Phonestream MMDA trained system and the baseline system. “Nonsense errors” are substitutions or insertions that result in predicted words that are not legal English words, e.g. CASINO being substituted with ACCINO . “Legal errors” are substitutions or insertions that result in incorrect but legal English words, e.g. BOEING substituted with BOLDING.

6 Conclusion & Future Work

We proposed the MMDA framework which exposes our end-to-end ASR system to a much wider range of training data. To the best of our knowledge, this the first attempt at truly end-to-end multi-modal data augmentation for ASRȮur framework is easily expandable to other end-to-end sequence transduction applications. Preliminary experiments show promising results for our MMDA architecture under several settings.

In the future, we would like to experiment with using alternate and deeper acoustic and augmenting encoders as they may yield more significant gains. We also recognize that our phoneme duration model for the Rep-Phonestream experiments assumes that durations are independent of phoneme context. We would like to break this assumption by learning a phoneme duration model that sees a wider phonemic context in training.

Two future applications of our MMDA framework would be in ASR adaptation to a new domain, speech-translation. To adapt ASR to a new domain, or even language we can train on additional augmenting data derived from the new domain or language. We believe the MMDA framework may be well suited to speech-translation due to its similarity to back-translation.


  • [1] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017.
  • [2] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP).    IEEE, 2015.
  • [3] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on.    IEEE, 2015, pp. 167–174.
  • [4] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
  • [5] A. Maas, Z. Xie, D. Jurafsky, and A. Ng, “Lexicon-free conversational speech recognition with neural networks,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 345–354.
  • [6] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017, pp. 949–953.
  • [7] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in International Conference on Machine Learning (ICML), 2014, pp. 1764–1772.
  • [8] R. Sennrich, B. Haddow, and A. Birch, “Improving Neural Machine Translation Models with Monolingual Data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).    Berlin, Germany: Association for Computational Linguistics, August 2016, pp. 86–96. [Online]. Available:
  • [9] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proceedings of the workshop on Speech and Natural Language.    Association for Computational Linguistics, 1992, pp. 357–362.
  • [10] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
  • [11] A. Ragni, K. M. Knill, S. P. Rath, and M. J. Gales, “Data augmentation for low resource languages,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [12] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [13] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  • [14] A. Currey, A. V. M. Barone, and K. Heafield, “Copied monolingual data improves low-resource neural machine translation,” in Proceedings of the Second Conference on Machine Translation, 2017, pp. 148–156.
  • [15] R. Arora and K. Livescu, “Multi-view cca-based acoustic features for phonetic recognition across speakers and domains,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.    IEEE, 2013, pp. 7135–7139.
  • [16] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for audio-visual speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.    IEEE, 2015, pp. 2130–2134.
  • [17] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [18] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, and K. Rao, “Multilingual speech recognition with a single end-to-end model,” arXiv preprint arXiv:1711.01694, 2017.
  • [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  • [20] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on.    IEEE, 2016, pp. 4960–4964.
  • [21] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014.
  • [22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems (NIPS), 2015, pp. 577–585.
  • [23] J. R. Novak, N. Minematsu, and K. Hirose, “WFST-based grapheme-to-phoneme conversion: Open source tools for alignment, model-building and decoding.” in FSMNLP, 2012, pp. 45–49.
  • [24] J. Kominek and A. W. Black, “The cmu arctic speech databases,” in Fifth ISCA Workshop on Speech Synthesis, 2004.
  • [25] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.    IEEE, 2017, pp. 4835–4839.
  • [26] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description