AN IMPROVED HYBRID CTC-ATTENTION MODEL FOR SPEECH RECOGNITION

An Improved Hybrid Ctc-Attention Model for Speech Recognition

Abstract

Recently, end-to-end speech recognition with a hybrid model consisting of connectionist temporal classification(CTC) and the attention-based encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder structure based on the experiments we conducted and explore the relation between decoding performance and the depth of encoder. We also apply attention smoothing mechanism to acquire more context information for subword-based decoding. Taken together, these strategies allow us to achieve a word error rate(WER) of 4.43% without LM and 3.34% with RNN-LM on the test-clean subset of the LibriSpeech corpora, which by far are the best reported WERs for end-to-end ASR systems on this dataset.

AN IMPROVED HYBRID CTC-ATTENTION MODEL FOR SPEECH RECOGNITION

Zhe Yuan, Zhuoran Lyu, Jiwei Li and Xi Zhouthanks: {yuanzhe, lvzhuoran, lijiwei, zhouxi}@cloudwalk.cn
Cloudwalk Technology Inc, Shanghai, China


Index Terms—  Automatic speech recognition, attention, CTC, RNN-LM

1 Introduction and Background

Automatic speech recognition (ASR), the technology that enables the recognition and translation of spoken language into text by computers, has been widely used in different applications. In the past few decades, ASR relied on complicated traditional techniques including Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) [1]. Besides, these traditional models also require hand-made pronunciation dictionaries and predefined alignments between audio and phoneme. Although these traditional models achieve state-of-the-art accuracy on most audio corpora, it is quite a challenge to develop ASR models without enough acoustics knowledge. Therefore, benefiting from rapid development of deep learning, a few end-to-end ASR models were raised in recent years.

Connectionist temporal classification(CTC) and sequence-to-sequence(seq2seq) with attention model are two major approaches in end-to-end ASR systems. Both methods address the problem of variable-length input audios and output texts. Deep Speech 2, which was came up with by Baidu Silicon Valley AI Lab in 2016 [2], making full use of CTC and RNN, achieved a state-of-the-art recognition accuracy. As for seq2seq model, Chorowski et al utilized seq2seq model with attention mechanism to perform speech recognition [3]. However, the accuracy of the model is unsatisfactory since alignment estimation in the attention mechanism is easily corrupted by noise, especially in real environment tasks.

To overcome the above misalignment problem, a combination of CTC and attention-based seq2seq model were proposed by Watanabe in 2017 [4]. The key to this joint CTC-attention model is training a shared encoder, with both CTC and attention decoder as objective functions simultaneously. This novel approach improves the performance in both training speed and recognition accuracy.

This paper is partly inspired by the above method. Our main contributions in this paper include exploring different encoder and decoder network architecture and adopting several optimization methods such as attention smoothing and L2 regularization. We demonstrate that our system outperforms other published end-to-end ASR models in WER on LibriSpeech dataset.

The paper is organized as follows. Section 2 briefly introduces the related works, mainly focusing on the hybrid CTC/Attention method. Section 3 details our model architecture and section 4 presents our training methods and experiment results. Finally, section 5 concludes this work.

2 Related Work

In this section, we review the Joint CTC-Attention architecture in Section 2.1 and unit selection methods in Section 2.2.

2.1 Hybrid CTC-attention architecture

The idea of this architecture is to use CTC as an auxiliary objective function to train the attention-based seq2seq network. Fig. 1 illustrates the architecture of the network, where the encoder has several convolutional neural network(CNN) layers followed by bidirectional long short-term memory (BiLSTM) layers. While the decoder network includes a CTC network and an attention-based decoder network. According to [5], using CTC along with attention decoder brings more robustness to the network since CTC helps acquiring appropriate alignments in noisy conditions. Moreover, CTC also assists the network in learning speed.

CTC, which is introduced by [6], provides a method to train RNNs without any prior alignments between inputs and outputs. Suppose the length of the input sequence is , then the probability of a CTC path can be computed as follow:

(1)

where denotes the the softmax probability of outputting label at frame t and denotes the CTC path. Hence the likelihood of the label sequence can be computed as follow:

(2)

where is the set of all possible CTC paths that can be mapped to . Therefore, we have CTC loss to be:

(3)

As for decoder part, the possibility of label at each step depends on input feature and previous labels . The overall possibility of the entire sequence can be obtained as follow:

(4)

where

(5)
(6)
(7)

denotes hidden states while is the context vector based on input features and attention weight in the above equation. The loss function of this part is defined as:

(8)

where denotes the weight of different loss, .

Fig. 1: Architecture of the hybrid CTC-Attention model

2.2 Unit selection

Methods based on large lexicon, such as phoneme-based ASR systems or word-based ASR systems, are not able to resolve out-of-vocabulary (OOV) problems. Thus, starting from LAS [7], such seq2seq model raises new character-based method. By combining frame information in audio clips and the corresponding characters together, the OOV problem is resolved to some extent. Since many characters in English words are silent and same characters in different sentences may pronounce differently (e.g. “a” in “apple” and “and”), decoding procedure on character level relies heavily on the sentence sequence relationship given by RNN rather than the acoustic information given by the audio clip frames, which leads to the uncertainty of decoding procedure on character level. Considering all the issues mentioned above, subword-based structure can resolve OOV problems when decoding on one hand, and can also learn the relationship between acoustic information and character information on the other hand. An effective and fast method for generating subwords is byte-pair encoding (BPE) [8]. Which is a compression algorithm that iteratively replaces the most frequent pair of units (or bytes) with an unused unit, and eventually generates new units that are consistent with the number of iterations.

3 Methodology

In this section, we detail our optimization and improvements based on the previous hybrid CTC-attention architecture. We show our improvements to encoder-decoder architecture and attention mechanism in section 3.1 and section 3.2.

3.1 Encoder-Decoder architecture

The authors in Espnet [8], stacked several BiLSTM layers above a few convolutional layers. The outputs of the last BiLSTM layer sever as inputs to both CTC and attention-decoder as shown in Fig. 1. Our major improvements conclude inserting a BiLSTM layer, which is solely occupied by the CTC branch, between the top shared encoder layer and FC layer connected to CTC. The entire hybrid architecture is shown in Fig. 2.

According to our experiments in Section 4, setting in (8) to a smaller value makes the network perform better. However, when the weight is low, a new problem is raised. Since lower brings smaller gradient descent in back propagation in CTC loss part, the shared decoder focus more on the attention module than the CTC module, which limits the performance of the CTC decoder. Considering this limitation, we introduce a solely BiLSTM layer linking to the CTC decoder, which can compensate the problem we mentioned above.

3.2 Attention smoothing

Inspired by [3], we use a location-based attention mechanism in our implementation. Specifically, the location based attention energies can be computed by the following equation:

(9)

where

(10)

and .

In our speech recognition system, subwords are chosen as the model units, which require more sequence context information than character-based units. However, the attention score distribution is usually very sharp when computed using above equations. Hence, we apply attention smoothing mechanism instead, which can be computed by

(11)

The above method successfully smooths attention score distributions and then keep more context information for subword-based decoding.

4 Experiments

4.1 Experimental Setup

We train and test our implementation over LibriSpeech dataset [9]. Specificlly, we use train-clean-100, train-clean-360, train-other-500 as our training set and dev-clean as our validation set. For evaluation, we report the word error rates (WERs) on the subsets test-clean, test-other, dev-clean and dev-other. We also adopt 3-fold speed perturbation(0.9x/1.0x/1.1x) for data augmentation. 80 dimensional Mel-filterbank features are generated using a sliding window of length 25 ms with 10ms stride, and the feature extraction is performed by KALDI toolkit [10]. Subword units are extracted using all the transcripts of training data by BPE algorithm. The number of subword units is set to 5000.

Fig. 2: Encoder and decoder architecture of our model

We use a 2-layer CNN architecture followed by a 7-layer BiLSTM where each layer is a BiLSTM with 1024 cell units per direction as encoder. In the CNN part, input features are downsampled to 1/4 through two max-pooling layers. The decoder consists of two branches where one branch is a one-layer BiLSTM followed by a CTC decoder and the other branch is a 2-layer LSTM with 1024 cell units per layer.

The AdaDelta algorithm [11] with initial hyper-parameter epsilon=1e-8 is used for optimization, and L2 regularization and gradient clipping are applied. We measure the accuracy of the verification set every 1000 iterations and apply a strategy that eps is decayed by 0.1 when the validation accuracy drops. All experiments are performed on 4 Tesla P40 GPUs with batchsize on each GPU.

Our language model is a two-layer LSTM with units=1536 trained on large text data of 14500 public domain books, which is commonly used as training material for the LibriSpeech’s LM. The SGD algorithm is used for optimization, with initial learning-rate 1.0 and lr-decay 0.9 per 2 epochs. For decoding, we use the beam search algorithm with the beam size 20.

4.2 Results

Fig. 3 shows the accuracy curve during training process, from which we can see that the model converges after 35000 iterations. The perplexity of our trained RNN-LM is 50.4 on the training set, and 46.9 on the test-clean subset. We conduct experiments with different number of layers in encoder and the addition of BiLSTM layer on the CTC branch. Results are shown in Table. 1, from which we have the following comments. Both increasing the number of BiLSTM layers in encoder and the addition of BiLSTM layer on the CTC branch lead to better WERs. Moreover, WER can be reduced by about 25% using our trained RNN-LM.

Fig. 3: Accuracy curves with the number of iterations on both the train set and the validation set during training

Fig. 4: WER performance as a function of alpha on test-clean subset
Encoder Layers WER(no LM) WER(RNN-LM)
5 5.01 3.73
5+CTC-BiLSTM 4.73 3.59
6 4.82 3.64
6+CTC-BiLSTM 4.57 3.43
7 4.64 3.51
7+CTC-BiLSTM 4.43 3.34
Table 1: Comparison of test-clean-subset WERs under different structures
Model Test Dev
clean other clean other
Baidu DS2[2] + LM 5.15 12.73 - -
Espnet[12] + LM 4.6 13.7 4.5 13.0
I-Attention[13] + LM 3.82 12.76 3.54 11.52
Ours + no LM 4.43 13.5 4.37 13.1
Ours + LM 3.34 10.54 3.15 9.98
Table 2: Performance of different networks on the LibriSpeech dataset

After that, we compare the different weight between CTC loss and attention loss by step 0.1. The result is shown in Fig.4. When we use pure attention-based system or pure ctc-based system, it produces inferior performance. The curve also shows that decreasing leads to better WER in hybrid system, which is consistent with the purpose of using CTC-decoder at the beginning. The CTC module is mainly used to assist the monotonic alignment and increase the convergence speed of training, and the hybrid system decoding effect mainly relies on the attention-decoder section. As fig. 4 shows, we find that the best tuned is 0.1.

Finally, we compare our results with other state-of-the-art end-to-end systems reported on the LibriSpeech dataset in Table. 2. The results show that our system achieves better WERs than other known end-to-end ASR models.

5 Conclusions

In summary, we explore a variety of structural improvements and optimization methods on the hybrid CTC-attention-based ASR system. By applying the CTC-decoder BiLSTM, attention smoothing and some other tricks, our system achieves a word error rate(WER) of 4.43% without LM and 3.34% with RNN-LM on the test-clean subset of LibriSpeech corpus.

Future work will concentrate on the optimization of both the decoder structure and the training method, such as finetuning the CTC-decoder-branch after training the shared encoder. Another future work is to apply this technique to other languages like Mandarin, in which there are many polyphonic words that need to be solved in the decoding process.

References

  • [1] Lawrence R Rabiner and Biing-Hwang Juang, Fundamentals of speech recognition, vol. 14, PTR Prentice Hall Englewood Cliffs, 1993.
  • [2] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  • [3] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
  • [4] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [5] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 4835–4839.
  • [6] Alex Graves and Faustino Gomez, “Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks,” in International Conference on Machine Learning, 2006, pp. 369–376.
  • [7] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
  • [8] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
  • [9] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
  • [10] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.
  • [11] Matthew D Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
  • [12] Tomoki Hayashi, Shinji Watanabe, Suyoun Kim, Takaaki Hori, and John R. Hershey, “Espnet: end-to-end speech processing toolkit,” https://github.com/espnet/espnet/pull/407/commits/, 2018.
  • [13] Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” arXiv preprint arXiv:1805.03294, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
313674
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description