Improving Gated Recurrent Unit Based Acoustic Modeling with Batch Normalization and Enlarged Context

Improving Gated Recurrent Unit Based Acoustic Modeling
with Batch Normalization and Enlarged Context

Abstract

The use of future contextual information is typically shown to be helpful for acoustic modeling. Recently, we proposed a RNN model called minimal gated recurrent unit with input projection (mGRUIP), in which a context module namely temporal convolution, is specifically designed to model the future context. This model, mGRUIP with context module (mGRUIP-Ctx), has been shown to be able of utilizing the future context effectively, meanwhile with quite low model latency and computation cost.

In this paper, we continue to improve mGRUIP-Ctx with two revisions: applying BN methods and enlarging model context. Experimental results on two Mandarin ASR tasks (8400 hours and 60K hours) show that, the revised mGRUIP-Ctx outperform LSTM with a large margin (11% to 38%). It even performs slightly better than a superior BLSTM on the 8400h task, with 33M less parameters and just 290ms model latency.

Improving Gated Recurrent Unit Based Acoustic Modeling

with Batch Normalization and Enlarged Context

Jie Li, Yahui Shan, Xiaorui Wang, Yan Li

Kwai, Beijing, P.R. China

School of Information and Electronics, Beijing Institute of Technology, Beijing, P.R.China

{lijie03, wangxiaorui, liyan}@kuaishou.com, 2120160735@bit.edu.cn

Index Terms: speech recognition, acoustic modeling, gated recurrent unit, batch normalization

1 Introduction

It is typically shown to be beneficial for acoustic modeling to make full use of the future contextual information. In the literature, there are variety of methods to realize this idea for different model architectures. For feed-forward neural network (FFNN), this context is usually provided by splicing a fixed set of future frames in the input representation[1]. The authors in [2, 3, 4] proposed a model called feedforward sequential memory networks (FSMN), which is a standard FFNN equipped with some learnable memory blocks in the hidden layers to encode the long context information into a fixed-size representation. The time delay neural network (TDNN) [5, 6] is another FFNN architecture which has been shown to be effective in modeling long range dependencies through temporal convolution over context.

As for unidirectional recurrent neural network (RNN), this is usually accomplished using a delayed prediction of the output labels[7]. However, this method only provides quite limited modeling power of future context[8]. While for bidirectional RNN, this is accomplished by processing the data in the backward direction using a separate RNN layer [9, 10, 11]. Although the bidirectional versions have been shown to outperform the unidirectional ones with a large margin [12, 13], the latency of bidirectional models is significantly larger, making them unsuitable for online speech recognition. To overcome this limitation, chunk based training and decoding schemes such as context-sensitive-chunk (CSC) [14, 15] and latency-controlled (LC) BLSTM [12, 16] have been investigated. However, the model latency is still quite high, for example, the decoding latency of LC-BLSTM in [16] is about 600 ms. To overcome the shortcomings of the chunk-based methods, Peddinti et al. [8] proposed the use of temporal convolution, in the form of TDNN layers, for modeling the future temporal context while affording inference with frame-level increments. The proposed model is called TDNN-LSTM, and is designed by interleaving of temporal convolution (TDNN layers) with unidirectional long short-term memory (LSTM) [17, 18, 19, 20] layers. This model was shown to outperform bidirectional LSTM in two automatic speech recognition (ASR) tasks, while enabling online decoding with a maximum latency of 200 ms [8]. However, TDNN-LSTM’s ability to model the future context comes from the TDNN part, whereas the LSTM itself is incapable of utilizing the future information effectively.

Recently, in [21], we proposed a RNN model called minimal gated recurrent unit with input projection (mGRUIP), in which the inserted input projection forms a bottleneck and a context module called temporal convolution, is specifically designed on it to model the future context. This model, mGRUIP with context module (mGRUIP-Ctx, for short in this paper), is able of utilizing the future context effectively and directly, meanwhile with quite low model latency and computation cost.

In this work, we continue to improve the proposed model mGRUIP-Ctx. The revision is two-fold: First, we investigate how to use batch normalization (BN) [22] in this model. The starting point is that we found the proposed mGRUIP-Ctx model performs not quite well if the training data contains a lot of noisy speech, for example, if the original clean data is perturbed with noise and reverberation. This prompts us to use batch normalization to improve the convergence of optimization process. In our previous work[21], batch normalization is only used on the cell ReLU activation to deal with numerical instabilities originating from the unbounded ReLU functions. In this work we find it is also beneficial to apply BN to the update gate as well. In the literature, batch normalization has been applied to RNN in different ways. In [23], the authors suggest to apply it to input-to-hidden (ItoH) connections only, whereas in [24] the normalization step is extended to hidden-to-hidden (HtoH) transitions. Our finding in this work is slightly different with [23] and [24]. It is shown that the best method to apply BN in mGRUIP-Ctx is a hybrid way, that is, applying BN to both ItoH and HtoH connections for the cell ReLU activation, while for the update gate, applying BN to ItoH only. Experimental results on several ASR tasks clearly demonstrate that doing so can speed up optimization and improve generalization, especially when the training data is augmented with perturbation.

The second revision is to enlarge the model context. In our previous work[21], the context module in mGRUIP-Ctx is restricted to just modeling the future context, leaving the history to be modeled by the RNN structure. In this work, we release this restriction and allow the context module to model the future and history information simultaneously. It is empirically shown that this method is beneficial to the performance. Besides that, we also find that enlarging the order of future context can future improve the performance. It should be noted that this context extension brings quite limited additional parameters, thanks to the small dimensionality of the input projection.

With these two revisions, mGRUIP-Ctx’s performance is improved significantly. On a 8400 hours Mandarin ASR task (1400 hours original data with 6-fold augmentation), the revised mGRUIP-Ctx provides 6% to 10% relative CER reduction over the previous one in [21], demonstrating the effectiveness of the revision methods. Compared to LSTM, the relative improvement is 18% to 37%, and the gain over TDNN-LSTM is about 5% to 12%. This model even outperforms a very strong baseline, BLSTM, with much less parameters and only 290 ms decoding latency. The proposed model’s superiority is further verified on a much larger Mandarin ASR task, which contains 60K hours of speech in total (10K hours original data with 6-fold augmentation). On this task, the gain of mGRUIP-Ctx over LSTM and TDNN-LSTM is 11% to 20% and 3% to 11%, respectively. Compared with a much stronger baseline, TDNN-BLSTM, mGRUIP-Ctx just shows 2% to 6% relative loss, with advantages of much less parameters and much lower latency.

2 Prerequisites

2.1 mGRU

mGRU, short for minimal gated recurrent unit, is a revised version of GRU model. It is proposed by [25, 26] and contains two modifications: removing the reset gate and replacing the hyperbolic tangent function with ReLU activation. The model is defined by the following equations (the layer index has been omitted for simplicity):

(1)
(2)
(3)

In particular, is a vector corresponding to the update gate, of which the activation is element-wise logistic sigmoid functions . represents the output state vector for the current frame , and is the candidate state obtained with a ReLU function. The parameters of the model are , (the ItoH connections), , (the HtoH weights), and the bias vector .

2.2 mGRUIP

mGRUIP is obtained by inserting a linear input projection layer into mGRU, leading to the following update equations:

(4)
(5)
(6)
(7)

where the current input vector and the previous output state vector , are compressed into a lower dimensional space by weight matrix and respectively and added together to get a projected vector . Then the update gate activation and the candidate state vector are calculated based on it.

2.3 mGRUIP-Ctx

The input projection forms a bottleneck in mGRUIP, on which we design a context module, temporal convolution, to effectively model the future context[21]. For the th layer (), equation (4) now becomes:

(8)

where is the concatenation of the current input vector and the output state vector of preceding layer from serval future frames:

(9)

In particular, is the input vector of layer ( is actually since ), and is the output state vector of layer on the th frame. is the step stride, is order of future context, and is loop index.

3 Proposed Revisions

3.1 Applying Batch Normalization

Our first refinement is to figure out the best way to apply BN in the proposed models. There are two possible locations to apply BN in the structure: the update gate and the cell activation.

3.1.1 BN for the update gate

Three different BN methods for update gate are experimented:

  • No BN
    Just as defined by equation (1) and (5) in Section 2 for model mGRU and mGRUIP(-Ctx), respectively.

  • BN on ItoH only
    For mGRU, equation (1) now becomes:

    (10)

    For mGRUIP(-Ctx), equation (5) now becomes:

    (11)

    where and are the projected vector from (or for mGRUIP-Ctx) and , respectively.

  • BN on ItoH and HtoH
    For mGRU, equation (1) now becomes:

    (12)

    For mGRUIP(-Ctx), equation (5) now becomes:

    (13)

3.1.2 BN for the cell activation

For the cell activation, No BN can probably cause numerical instabilities for ReLU, thus only two methods are tried:

  • BN on ItoH only
    We first try this method for mGRU model. Equation (2) now becomes:

    (14)

    It degrades the performance significantly (shown in Table 1), thus no further trial is needed for mGRUIP(-Ctx).

  • BN on ItoH and HtoH
    Just as defined by equation (2) and (6) in Section 2 for model mGRU and mGRUIP(-Ctx), respectively.

3.2 Enlarging Model Context

Our second revision is to enlarge the model context. The context module is used to model not only the future context but also the history information. Equation (9) now becomes:

where and is the step stride for the history and future context, respectively. is the order of history and is the order of future context.

Gate, BN on Cell, BN on WER (%)
ItoH HtoH ItoH HtoH SWB CHM Total
mGRU
N N Y Y 10.2 20.6 15.5
Y Y Y Y diverge
Y N Y Y 10.1 19.9 15.1
Y N Y N 11.1 22.1 16.7
mGRUIP
N N Y Y 9.8 19.0 14.5
Y Y Y Y 19.8 27.1 23.5
Y N Y Y 9.5 18.6 14.2
Table 1: Comparison of BN methods with mGRU and mGRUIP on Switchboard task.

4 Experimental Settings

This section will introduce the three ASR tasks used in this work. All the models in this paper are trained with LF-MMI objective function computed on 33Hz outputs [27].

4.1 Switchboard ASR Task

The training data is 309-hour Switchboard-I data. Evaluation is performed in terms of WER on the full Switchboard Hub5’00 test set, consisting of two subsets: Switchboard (SWB) and CallHome (CHE). The experimental setup follows [27].WER results are reported after 4-gram LM rescoring of lattices generated using a trigram LM. Please refer to [27] for more details.

4.2 Medium-Scale Mandarin ASR Task

The second task is an internal medium-scale mandarin ASR task, which actually has two sub-tasks. The first one contains 1400 hours original mobile recording data. The second one is with 6 times data augmentation using speed perturbation (x3) [28] and noise/reverberation perturbation (x2) [29], resulting in 8400 hours training data in total.

4.3 Large-Scale Mandarin ASR Task

The large-scale mandarin ASR task contains 10K hours original data and is augmented 6 times, resulting in 60K hours training data in total. The performance of mandarin ASR tasks are evaluated on five public-available test sets, including three clean and two noisy ones. The clean sets are:

  • AiShell_dev and AiShell_test: the development and test set of the released corpus AiShell-1[30], containing 14326 and 7176 utterances, respectively.

  • THCHS-30_Clean: the clean test set of THCHS-30 database[31], containing 2496 utterances.

The two noisy test sets are THCHS-30_Car and THCHS-30_Cafe, which are the corrupted version of THCHS-30_Clean by car and cafeteria noise, respectively. The noise level is 0db.

5 Experimental Results

5.1 Applying Batch Normalization

5.1.1 Switchboard ASR Task

On this task, we compare different BN methods for two models: mGRU and mGRUIP, and the results are shown in Table 1. Both of them contain 5 layers, and each layer consists 1024 cells. The input projection layer for mGRUIP has 512 units.

Several observations can be obtained from Table 1. Firstly, comparing the first two lines of each model’s results, we can see that applying BN to both ItoH and HtoH for update gate is quite harmful. The training of mGRU model even diverges. We think the reason is as below. The update gate is controlled by a sigmoid function, and is expected to learn to open or close at the right time. BN on ItoH and HtoH will keep the input away from the saturated regime of this nonlinearity. This may be helpful when sigmoid serves as a hidden node activation. However, when it’s used to control a gate, this will be harmful since it will make the gate half-closed or half-opened, which is opposite to the responsibility of the gate control. To verify this, we investigate the evolution of update activation when performing recognition. A speech segment is chosen from the test set (the utterance id is en_4170-B_064608-064704 with the text content That is quite a difference), and the features are sent to the three mGRUIP models trained with different BN methods. The average activation of the update gate over cells (the 5th layer) for each frame is shown in Figure 1. It’s very clear that applying BN to both ItoH and HtoH for the update gate will cause the gate activation fluctuating around 0.5, meaning the gate unable to effectively control the dependency between the history and candidate state . Another interesting finding is that, the average gate activations of No BN and BN on ItoH models are almost always greater than 0.5. We think it’s because the speech signal is a sequence that evolves rather slowly, in which the past history can virtually always be helpful.

Model Layerwise Context Setting Latency CER
AiShell THCHS-30
2 3 4 5 (ms) dev test Clean Car Cafe
mGRUIP-Ctx-A 0; 0; 0; 0; 170 4.61 5.56 10.34 10.67 41.89
mGRUIP-Ctx-B ; ; ; ; 200 4.49 5.46 10.11 10.46 39.79
mGRUIP-Ctx-C ; ; ; ; 200 4.49 5.50 10.08 10.45 38.41
mGRUIP-Ctx-D ; ; ; ; 290 4.47 5.47 9.95 10.36 38.76
Table 2: Performance comparison of mGRUIP-Ctx with various settings of context module on 1400h Mandarin ASR task.

The second observation from Table 1 is that we should apply BN to both ItoH and HtoH for the cell ReLU activation (comparing the last two lines of mGRU’s results). It’s possibly because applying BN this way will provide the best numerical stabilities for ReLU. Finally, we can conclude from Table 1 that, the best method to apply BN in mGRU related models is a hybrid way, that is, applying BN to just ItoH for update gate, and for cell ReLU activation, applying BN to both ItoH and HtoH.

Model Gate CER
AiShell THCHS-30
BN dev test Clean Car Cafe
1400 Hours Original Data
TDNN-LSTM - 4.81 5.98 10.97 11.38 44.20
mGRUIP-Ctx No[21] 4.66 5.71 10.38 10.77 40.26
ItoH 4.61 5.56 10.34 10.67 41.89
8400 Hours Augmented Data
TDNN-LSTM - 4.50 5.42 10.11 10.30 23.86
mGRUIP-Ctx No 4.45 5.42 10.24 10.34 23.43
ItoH 4.33 5.26 10.00 10.02 22.35
Table 3: Performance of different models on Medium-Scale Mandarin ASR task.

5.1.2 Medium-Scale Mandarin ASR Task

According to the findings in Section 5.1.1, we apply the optimal BN methods to the model mGRUIP-Ctx on the medium-scale mandarin ASR task, and the results are shown in Table 3. The baseline model is TDNN-LSTM [8] which has interleaving 7 TDNN and 3 LSTM layers. The model mGRUIP-Ctx contains 5 hidden layers, each one has 2560 cells and a 256-dimensional input projection layer. The settings of context order and step stride for each layer is same as our previous work [21].

According to Table 3, with 1400 hours original training data, mGRUIP-Ctx with no BN on update gate gives 3% to 5% relative gain over TDNN-LSTM. However, this gain disappears after the training data is augmented with perturbation. It can be attributed to the different difficulties of these two sub-tasks, of which model learning with perturbed data is harder than with original data. After BN is correctly applied, mGRUIP-Ctx performs better than TDNN-LSTM on the harder sub-task, demonstrating the effectiveness of the BN method. Moreover, the training curve reveals that the optimization process becomes much stable (not shown in this paper).

5.2 Enlarging Model Context

5.2.1 Medium-Scale Mandarin ASR Task

Our second revising method is to enlarge the model context. We compare the performance of mGRUIP-Ctx with different settings of context order , and step stride , (BN methods are correctly used). For fast experiments, all the models are trained with 1400 hours original data, and the results are reported in Table 2. The setting of context order and stride for each layer is represented with a format of ;.

From Table 2, comparing mGRUIP-Ctx-A and mGRUIP-Ctx-B, it can be seen that allowing context module to model the history information as well is quite beneficial to the performance (mGRUIP-Ctx-A is just the model used in Section 5.1.2). In addition, increasing the order of future context can further improve the performance, with a cost of additional model latency ( mGRUIP-Ctx-D vs. mGRUIP-Ctx-B).

Next, the best model mGRUIP-Ctx-D is compared with several baselines, including BLSTM, TDNN-LSTM and LSTM, on the 8400 hours data-augmented task, and the results are presented in Table 4. BLSTM and LSTM both contain 5 hidden layers. Each layer of LSTM and each directional sub-layer of BLSTM, have 1024 cells and 512 linear projection units. TDNN-LSTM is the model used in Section 5.1.2.

Model Latency CER
AiShell THCHS-30
(ms) dev test Clean Car Cafe
BLSTM 2020 4.24 5.05 9.79 9.87 22.82
TDNN-LSTM 210 4.50 5.42 10.11 10.3 23.86
LSTM 70 5.07 6.23 12.06 11.82 33.67
mGRUIP-Ctx-D 290 4.13 5.03 9.56 9.71 20.94
CERR (%) - 18.5 19.3 20.7 17.9 37.8
Table 4: Performance of different models on 8400h Mandarin ASR task.

The CERR line in Table 4 is the relative gain of mGRUIP-Ctx-D over LSTM, ranging from 18% to 38% for different test sets, indicating that the proposed model performs significantly better than LSTM. Compared with TDNN-LSTM, mGRUIP-Ctx-D gives 5% to 12% relative CER reduction with 12.4M less parameters (22.4M vs. 34.8M) and 80 ms additional latency (290ms vs. 210ms). mGRUIP-Ctx-D even outperforms BLSTM with much lower model latency (290ms vs. 2020ms) and much less model parameters (22.4M vs. 55.3M). Compared to the performance of mGRUIP-Ctx before revision (the next-to-last line of Table 3), mGRUIP-Ctx-D shows 6% to 10% relative gain, which demonstrates the effectiveness of the two proposed revising methods.

5.2.2 Large-Scale Mandarin ASR Task

Finally, mGRUIP-Ctx-D’s superiority is verified on a much larger mandarin ASR task, which contains 60K hours speech data. We train a TDNN-BLSTM model as one baseline, which is stronger than BLSTM and contains 3 TDNN layers and 5 BLSTM layers. All of the other models have the same structure as Section 5.2.1. Results are shown in Table 5.

Model CER
AiShell THCHS-30
dev test Clean Car Cafe
TDNN-BLSTM 3.55 4.21 8.72 8.85 18.73
TDNN-LSTM 3.90 4.68 9.55 9.65 21.53
LSTM 4.32 5.26 10.3 10.33 23.84
mGRUIP-Ctx-D 3.71 4.52 9.09 9.16 19.06
CERR (%) 14.1 14.1 11.8 11.3 20.1
Table 5: Performance of different models on 60K hour Mandarin ASR task.

According to Table 5, the relative improvement of mGRUIP-Ctx-D over LSTM is 11% to 20%, and the gain over TDNN-LSTM is 3% to 11%. mGRUIP-Ctx-D performs slightly worse (2% to 6% relative) than the strong baseline, TDNN-BLSTM, but has two advantages: much less parameters and much lower latency.

6 Conclusions

In this paper, we improve our previous proposed model mGRUIP-Ctx with two revisions: applying BN methods and enlarging model context. After revising, mGRUIP-Ctx outperform LSTM with a large margin. It even performs slightly better than a superior BLSTM on one task, with much less parameters and much lower latency.

References

  • [1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio Speech & Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
  • [2] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward sequential memory networks: A new structure to learn long-term dependency,” Computer Science, 2015.
  • [3] S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. R. Dai, “Compact feedforward sequential memory networks for large vocabulary continuous speech recognition,” in INTERSPEECH, 2016, pp. 3389–3393.
  • [4] S. Zhang, M. Lei, Z. Yan, and L. Dai, “Deep-fsmn for large vocabulary continuous speech recognition,” arXiv preprint arXiv:1803.05030, 2018.
  • [5] A. W. M. Ieee, T. Hanazawa, G. Hinton, K. S. M. Ieee, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” Readings in Speech Recognition, vol. 1, no. 2, pp. 393–404, 1990.
  • [6] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH, 2015.
  • [7] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” Computer Science, pp. 338–342, 2014.
  • [8] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and lstms,” IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1–1, 2017.
  • [9] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks.    IEEE Press, 1997.
  • [10] A. Graves, S. Fernández, and J. Schmidhuber, Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition.    Springer Berlin Heidelberg, 2005.
  • [11] A. Graves, N. Jaitly, and A. R. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding, 2014, pp. 273–278.
  • [12] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, “Highway long short-term memory rnns for distant speech recognition,” Computer Science, pp. 5755–5759, 2015.
  • [13] A. Zeyer, R. Schlüter, and H. Ney, “Towards online-recognition with deep bidirectional lstm acoustic models,” in INTERSPEECH, 2016, pp. 3424–3428.
  • [14] K. Chen and Q. Huo, Training deep bidirectional LSTM acoustic model for LVCSR by a context-sensitive-chunk BPTT approach.    IEEE Press, 2016.
  • [15] K. Chen, Z. J. Yan, and Q. Huo, “A context-sensitive-chunk bptt approach to training deep lstm/blstm recurrent neural networks for offline handwriting recognition,” in International Conference on Document Analysis and Recognition, 2016, pp. 411–415.
  • [16] S. Xue and Z. Yan, “Improving latency-controlled blstm acoustic models for online speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 5340–5344.
  • [17] S. Hochreiter and J. Schmidhuber, Long short-term memory.    Springer Berlin Heidelberg, 1997.
  • [18] G. F. A., J. Schmidhuber, and F. Cummins, Learning to Forget: Continual Prediction with LSTM.    Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 1999.
  • [19] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in Ieee-Inns-Enns International Joint Conference on Neural Networks, 2000, pp. 189–194 vol.3.
  • [20] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Netw, vol. 18, no. 5-6, p. 602, 2005.
  • [21] J. Li, X. Wang, Y. Zhao, and Y. Li, “Gated recurrent unit based acoustic modeling with future context,” arXiv preprint arXiv:1805.07024, 2018.
  • [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [23] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch normalized recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on.    IEEE, 2016, pp. 2657–2661.
  • [24] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville, “Recurrent batch normalization,” arXiv preprint arXiv:1603.09025, 2016.
  • [25] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Improving speech recognition by revising gated recurrent units,” INTERSPEECH, pp. 1308–1312, 2017.
  • [26] ——, “Light gated recurrent units for speech recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92–102, 2018.
  • [27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi,” in INTERSPEECH, 2016, pp. 2751–2755.
  • [28] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [29] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on.    IEEE, 2017, pp. 5220–5224.
  • [30] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” 2017.
  • [31] Z. Z. Dong Wang, Xuewei Zhang, “Thchs-30 : A free chinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
320229
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description