Non-Autoregressive Transformer Automatic Speech Recognition

Non-Autoregressive Transformer Automatic Speech Recognition


Recently very deep transformers start showing outperformed performance to traditional bi-directional long short-term memory networks by a large margin. However, to put it into production usage, inference computation cost and latency are still serious concerns in real scenarios. In this paper, we study a novel non-autoregressive transformers structure for speech recognition, which is originally introduced in machine translation. During training input tokens fed to the decoder are randomly replaced by a special mask token. The network is required to predict those mask tokens by taking both context and input speech into consideration. During inference, we start from all mask tokens and the network gradually predicts all tokens based on partial results. We show this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to difficult ones. Some preliminary results on Aishell and CSJ benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed Kaldi nnet3 and chain model setup and is quite closed to the performance of the start-of-the-art end-to-end model.


Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak \address Center for Language and Speech Processing
Johns Hopkins University, Baltimore, MD, USA {keywords} automatic speech recognition, transformer, non-autoregressive, end-to-end

1 Introduction

In recent studies very deep end-to-end automatic speech recognition (ASR) starts to match and outperform conventional ASR systems [1, 2, 3]. It mainly used encoder-decoder based structure based on long short-term memory recurrent neural network[4] and transformer network [1, 3]. Those systems have common characteristics: they rely on probabilistic chain-rule based factorization combined with left-to-right training and decoding. During training, the ground truth history tokens are fed to the decoder to predict the next token. In inference, the ground truth history tokens are replaced by previous predictions from the decoder. While this combination allows tractable log-likelihood computation, maximum likelihood training, and beam-search based approximation, it is more difficult to do parallel computation in decoding. The left-right beam search algorithm needs to run decoder computation multiple times depends on the output sequence length and beam size, which makes total computational extremely large.

Non-autoregressive end-to-end models start to attract more attention recently in neural machine translation [5, 6, 7, 8, 9, 10]. The idea is that the system predicts the whole sequence within constant number of iterations which does not depend on output sequence length. In [5] the author introduced a hidden ’fertilities’ variables which are integers correspond to the number of words in the target sentence that can be aligned to that source word. The fertilities predictor is trained by taking predictions from another external aligner. [6] used multiple iterations of refinement starting from some ”corrupted” predictions. Instead of predicting fertilities for each word in source sequence they only need to predict target sequence total length. Another direction explored in previous studies is to allow the output sequence to grow dynamically [7, 8, 9]. All those works insert words to output sequence iteratively based on insertion order or explicit tree structure. This allows arbitrary output sequence length avoiding deciding before decoding. However, since any sub-sequence can be partial results during decoding training requires some sampling or approximation. Among all those studies of different directions, a common procedure for neural machine translation is to perform knowledge distillation [5]. In machine translation, for a given input sentence, it can always exist multiple correct translations. A pre-trained autoregressive model is used to provide a unique target sequence for training data.

Our work is based on mask-predict proposed recently for neural machine translation[10]. Mask-predict is a conditional language model similar to BERT [11]. During training some random words are replaced by a special mask token and the network is trained to predict original tokens. The difference between our approach and BERT is that our system makes predictions conditioned on input speech. During inference, network decoder can condition on any subsets to predict the rest given input speech. In reality, we start from an empty set (all mask tokens) and gradually complete the whole sequence. The subset we chose can be quite flexible so it makes any decoding order possible. In ASR there is no need for knowledge distillation since in most cases unique transcript exists.

This paper is organized as follows. Section 2 introduces the autoregressive end-to-end model and how to adapt it to non-autoregressive. Different decoding strategies are also discussed. Section 3 introduces the experimental setup and presents results on different corpus. Further analysis is also included discussing the difference between autoregressive and non-autoregressive ASR. Section 4 summarizes this paper and provides several directions for future research in this area.

2 non-autoregressive End-to-end ASR

Figure 1: Comparison between normal transformer network and mask-predict transformer network. The transformer uses ground truth history tokens during training while during inference previous predictions are used as shown in the dash line. For mask-predict transformer training, random tokens in decoder input are replaced by a special <mask> token and the network is required to predict for those positions. Both networks conditions on encoder outputs of the whole sequence.

To study non-autoregressive end-to-end ASR, it is important to understand how the current autoregressive speech recognition system works. As shown in Figure 1 top part, general sequence-to-sequence model consists of encoder and decoder. The encoder takes speech features like Filter Banks as input and produces hidden representations . Decoder predicts next token based on the previous history and all hidden representations:


where is a function on all hidden representations . A common choice of is attention mechanism which can be considered to be a weighted combination of all representations:


The weights are usually determined by similarities between the current state and all representations.

During training ground truth history tokens are usually used as input to the decoder for two reasons. First of all, it is faster since the computation of all can be done in parallel. Secondly, training can be very difficult and slow if predictions are used instead especially for very long sequence cases[12, 13]. The expanded computation graph becomes very deep similar to recurrent neural networks without truncating.

During inference since no ground truth is given, predictions need to be used instead. This means equation 1 needs to be computed sequentially for every token in output and each prediction needs to perform decoder computation once. Depends on output sequence length and unit used, this procedure can be very slow for certain cases, like character-based Transformer models.

The key insight to make it non-autoregressive is to replace with some other input. One cannot simply ignore it because the conditional independence assumption is too strong for ASR. This work mainly gets inspired by [10]. The idea is to replace with partial decoding results we got from previous computations. A new token <mask> is introduced for training and decoding, similar to the idea of BERT [11]. The formula is given in equation 3.


As shown in Figure 1 bottom part, during training some random tokens are replaced by this special token <mask>. The network is asked to predict original unmasked tokens based on input speech and context. We randomly sample the number of mask tokens from a uniform distribution of whole utterance length and randomly replace ground truth tokens with this <mask> token. Theoretically, if we mask more tokens model will rely more on input speech and if we mask fewer tokens context will be utilized similar to the language model. This combines the advantages of both speech recognition and language modeling. Also, those predictions can be done simultaneously since we assume they are conditionally independent.

2.1 Decoding Strategies

During inference, a multi-iteration process is considered. Other than traditional left-to-right, two different strategies are studied: easy first and mask-predict.

2.1.1 Easy first

The idea of this strategy is to predict the most obvious ones first. In the first iteration, the decoder is fed with all tokens since we don’t have any partial results. After getting decoding results we keep those most confident ones and update them in :


where is the largest number of predictions we keep, L is sequence length and K is the total number of iterations. Conditioned on this new , the network is required to make new predictions if the sentence is not finished.

2.1.2 Mask-predict

This one is studied in [10]. We still starts with . In each iteration, we check confidence score for each output token and replace those least confident ones by <mask> tokens. The number of masked tokens is for k-th iteration:


where . For instance, if , we mask tokens in the first iteration, in second and so on. After getting prediction results we update all tokens previously masked in :


The difference between mask-predict and easy first is that mask-predict will accept all decisions but it reverts decisions made earlier if it is less confident. Easy first is more conservative and it gradually adopts decisions with the highest confidence. For both strategies, predictions become more and more accurate since it can utilize context information from two directions. This is achieved by replacing input with all since left-to-right decoding is no longer necessary.

2.2 Example

One example is given in Figure 2. The left part shows mask-predict and the right part demonstrates easy first. In this example sequence length is 4 but after adding <eos> token to the end of the sequence we have and . In the first iteration, the network is inputted with all <mask>. Top tokens get kept in each iteration and based on partial results network predicts again on all rest <mask> tokens. We demonstrate these two different decoding strategies.

For easy first, it always ranks confidence from the last iteration and then keep top-2 confident predictions. Based on partial results it will complete the rest.

For mask-predict it maintains confidence scores from multiple iterations. It chooses the least confident ones from all scores to mask. In the last iteration it chooses to change its previous prediction of ”so” because its confidence is less than other predictions from the second iteration.

Figure 2: Illustration of inference procedure. To predict the whole sequence with K=3 passes, initially, the network is fed with all <mask> tokens. Shade here presents the certainties from network outputs. The left part shows mask-predict process. In the last iteration, it goes back to the word ”so” because it is less confident in the first iteration compared to other predictions in other iterations. The right part shows easy first process. Since ”so” is confident enough in the first iteration to be decided it will never change in the future.

Normal inference procedure can be considered as a special case when and instead of taking the most confident one, prediction of the next token is always adopted. In general, this approach is flexible enough to support different decoding strategies: left-to-right, right-to-left, easy first, mask-predict and other unexplored strategies.

2.3 Output sequence length prediction

In [10] they introduced a special token <length> in input to predict output sequence length. For word sequence this is reasonable but for end-to-end speech recognition, it can be pretty difficult since character or BPE sequence length varies a lot. In this paper a simpler approach is proposed: we asked the network to predict end-of-sequence token <eos> at the end of the sequence as shown in Figure 2.

In inference, we still need to specify the initial length. We manually specify it to some constant value for the first iteration. After that, we change it to the predicted length in the first iteration to speed things up.

In Section 3 we discuss potential problems for this approach under certain conditions.

3 Experiments

For experiments, we mainly use Aishell [14] and Corpus of Spontaneous Japanese(CSJ) [15]. Espnet [4] is used for all experiments.

For the non-autoregressive baseline, we use state-of-the-art transformer end-to-end systems[1]. In Aishell experiments encoder includes 12 transformer blocks with convolutional layers at the beginning for downsampling. The decoder consists of 6 transformer blocks. For all transformer blocks, 4 heads are used for attention. The network is trained for 50 epochs and warmup [16] is used for early iterations.

The results of Aishell is given in Table 1. All decoding methods result in performance very closed to state-of-the-art autoregressive models. It outperformed two different hybrid systems in Kaldi by 17% and 5% respectively. However, considering the real-time factor it reduced from 7.4 to 0.7 with mask-predict, which is 11x speedup. The reason is that our non-autoregressive systems only perform decoder computation constant number of times, comparing to the autoregressive model which depends on length of output sequence.

Baseline(Transformer) 6.0 6.7 7.4
Baseline(Kaldi nnet3) - 8.6 -
Baseline(Kaldi chain) - 7.5 -
Left-to-right 6.5 7.2 5.6
Easy first(K=1) 6.8 7.6 0.7
Easy first(K=3) 6.4 7.1 0.9
Easy first(K=10) 6.4 7.2 0.9
Mask-predict(K=1) 6.8 7.6 0.7
Mask-predict(K=3) 6.4 7.2 0.7
Mask-predict(K=10) 6.4 7.2 0.8
Table 1: Results comparison on Aishell

We further investigated connections between the character error rate (CER) and output sequence length. It is natural to consider longer sequences to be more challenging since it is more difficult to align predictions simultaneously with all positions. Here we compare results from the autoregressive model and non-autoregressive model and show character error rate with different output sequence length. Decoding results are sorted by their ground truth length and character error rate for different target sequence length is reported as curves in Figure 3. The performance of autoregressive model and non-autoregressive model is very closed in most cases but for very long output sequence deletion error increases a lot for the non-autoregressive system. The problem is similar to words skipping problem in text-to-speech [17].

Figure 3: Error analysis of autoregressive and non-autoregressive on different output sequence length. Dot lines indicate different errors: substitute(S), deletion(D), insertion(I)

CSJ results are given in Table 2 with default setup in Espnet [4]. Here we observed a larger difference between non-autoregressive models and autoregressive models. Multiple iterations of different decoding strategies are not helping to improve. We emphasize again this is the problem of long output sequence which results in an increment of deletion error.

Baseline(Transformer) 5.7 4.1 4.5
Baseline(Kaldi) 7.5 6.3 6.9 -
Left-to-right 8.8 6.4 7.3
Easy first(K=1) 9.1 6.9 7.7 3.7
Easy first(K=3) 9.8 7.4 8.2 3.5
East first(K=10) 10.6 8.0 9.1 5.2
Mask-predict(K=1) 9.1 6.9 7.7 4.1
Mask-predict(K=3) 11.3 9.0 9.6 3.9
Mask-predict(K=10) 13.7 11.5 11.3 4.6
Table 2: Results comparison on CSJ

To understand why multiple iterations cannot help, here we demonstrate one testing example of mask-predict in Figure 4. The model makes perfect predictions at the beginning of sequence but it missed one character. To insert that character the model has to mask all following characters and re-estimate them. Even though our model is not confident about that position since there are actually two characters there, it is not possible to mask all the following tokens since some of them are very confident. In summary, predicting absolute positions simultaneously is very difficult which suggests the possibilities to apply insertion based models[7, 8, 9].

Figure 4: Example of easy first on very long sequence. Underscore indicates tokens model chooses to mask based on confidence

4 Conclusion

In this paper, we study a novel non-autoregressive framework for transformer-based automatic speech recognition (ASR). Under this framework different decoding strategies become possible and two of them are discussed: mask-predict and easy first. Comparing to classical left-to-right order these two show great speedup with reasonable performance. Especially on Aishell, the speedup is up to 11 times while performance is pretty closed to the autoregressive model. We further analyze the problem of the non-autoregressive model for ASR on long output sequences. This suggests several possibilities for future research.


  • [1] Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al., “A comparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317, 2019.
  • [2] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 2613–2617.
  • [3] Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Rwth asr systems for librispeech: Hybrid vs attention,” Interspeech, Graz, Austria, pp. 231–235, 2019.
  • [4] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson-Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit,” Proc. Interspeech 2018, pp. 2207–2211, 2018.
  • [5] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher, “Non-autoregressive neural machine translation,” arXiv preprint arXiv:1711.02281, 2017.
  • [6] Jason Lee, Elman Mansimov, and Kyunghyun Cho, “Deterministic non-autoregressive neural sequence modeling by iterative refinement,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1173–1182.
  • [7] Jiatao Gu, Qi Liu, and Kyunghyun Cho, “Insertion-based decoding with automatically inferred generation order,” arXiv preprint arXiv:1902.01370, 2019.
  • [8] Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit, “Insertion transformer: Flexible sequence generation via insertion operations,” in International Conference on Machine Learning, 2019, pp. 5976–5985.
  • [9] Sean Welleck, Kianté Brantley, Hal Daumé Iii, and Kyunghyun Cho, “Non-monotonic sequential text generation,” in International Conference on Machine Learning, 2019, pp. 6716–6726.
  • [10] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer, “Mask-predict: Parallel decoding of conditional masked language models,” arXiv preprint arXiv:1904.09324, 2019.
  • [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
  • [12] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer, “Scheduled sampling for sequence prediction with recurrent neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 1171–1179.
  • [13] Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio, “Professor forcing: A new algorithm for training recurrent networks,” in Advances In Neural Information Processing Systems, 2016, pp. 4601–4609.
  • [14] Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA). IEEE, 2017, pp. 1–5.
  • [15] Kikuo Maekawa, “Corpus of spontaneous japanese: Its design and evaluation,” in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003.
  • [16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [17] Jing-Xuan Zhang, Zhen-Hua Ling, and Li-Rong Dai, “Forward attention in sequence-to-sequence acoustic modeling for speech synthesis,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4789–4793.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description