Improving Gated Recurrent Unit Based Acoustic Modeling
with Batch Normalization and Enlarged Context
Abstract
The use of future contextual information is typically shown to be helpful for acoustic modeling. Recently, we proposed a RNN model called minimal gated recurrent unit with input projection (mGRUIP), in which a context module namely temporal convolution, is specifically designed to model the future context. This model, mGRUIP with context module (mGRUIPCtx), has been shown to be able of utilizing the future context effectively, meanwhile with quite low model latency and computation cost.
In this paper, we continue to improve mGRUIPCtx with two revisions: applying BN methods and enlarging model context. Experimental results on two Mandarin ASR tasks (8400 hours and 60K hours) show that, the revised mGRUIPCtx outperform LSTM with a large margin (11% to 38%). It even performs slightly better than a superior BLSTM on the 8400h task, with 33M less parameters and just 290ms model latency.
Improving Gated Recurrent Unit Based Acoustic Modeling
with Batch Normalization and Enlarged Context
Jie Li, Yahui Shan, Xiaorui Wang, Yan Li
Kwai, Beijing, P.R. China
School of Information and Electronics, Beijing Institute of Technology, Beijing, P.R.China
{lijie03, wangxiaorui, liyan}@kuaishou.com, 2120160735@bit.edu.cn
Index Terms: speech recognition, acoustic modeling, gated recurrent unit, batch normalization
1 Introduction
It is typically shown to be beneficial for acoustic modeling to make full use of the future contextual information. In the literature, there are variety of methods to realize this idea for different model architectures. For feedforward neural network (FFNN), this context is usually provided by splicing a fixed set of future frames in the input representation[1]. The authors in [2, 3, 4] proposed a model called feedforward sequential memory networks (FSMN), which is a standard FFNN equipped with some learnable memory blocks in the hidden layers to encode the long context information into a fixedsize representation. The time delay neural network (TDNN) [5, 6] is another FFNN architecture which has been shown to be effective in modeling long range dependencies through temporal convolution over context.
As for unidirectional recurrent neural network (RNN), this is usually accomplished using a delayed prediction of the output labels[7]. However, this method only provides quite limited modeling power of future context[8]. While for bidirectional RNN, this is accomplished by processing the data in the backward direction using a separate RNN layer [9, 10, 11]. Although the bidirectional versions have been shown to outperform the unidirectional ones with a large margin [12, 13], the latency of bidirectional models is significantly larger, making them unsuitable for online speech recognition. To overcome this limitation, chunk based training and decoding schemes such as contextsensitivechunk (CSC) [14, 15] and latencycontrolled (LC) BLSTM [12, 16] have been investigated. However, the model latency is still quite high, for example, the decoding latency of LCBLSTM in [16] is about 600 ms. To overcome the shortcomings of the chunkbased methods, Peddinti et al. [8] proposed the use of temporal convolution, in the form of TDNN layers, for modeling the future temporal context while affording inference with framelevel increments. The proposed model is called TDNNLSTM, and is designed by interleaving of temporal convolution (TDNN layers) with unidirectional long shortterm memory (LSTM) [17, 18, 19, 20] layers. This model was shown to outperform bidirectional LSTM in two automatic speech recognition (ASR) tasks, while enabling online decoding with a maximum latency of 200 ms [8]. However, TDNNLSTM’s ability to model the future context comes from the TDNN part, whereas the LSTM itself is incapable of utilizing the future information effectively.
Recently, in [21], we proposed a RNN model called minimal gated recurrent unit with input projection (mGRUIP), in which the inserted input projection forms a bottleneck and a context module called temporal convolution, is specifically designed on it to model the future context. This model, mGRUIP with context module (mGRUIPCtx, for short in this paper), is able of utilizing the future context effectively and directly, meanwhile with quite low model latency and computation cost.
In this work, we continue to improve the proposed model mGRUIPCtx. The revision is twofold: First, we investigate how to use batch normalization (BN) [22] in this model. The starting point is that we found the proposed mGRUIPCtx model performs not quite well if the training data contains a lot of noisy speech, for example, if the original clean data is perturbed with noise and reverberation. This prompts us to use batch normalization to improve the convergence of optimization process. In our previous work[21], batch normalization is only used on the cell ReLU activation to deal with numerical instabilities originating from the unbounded ReLU functions. In this work we find it is also beneficial to apply BN to the update gate as well. In the literature, batch normalization has been applied to RNN in different ways. In [23], the authors suggest to apply it to inputtohidden (ItoH) connections only, whereas in [24] the normalization step is extended to hiddentohidden (HtoH) transitions. Our finding in this work is slightly different with [23] and [24]. It is shown that the best method to apply BN in mGRUIPCtx is a hybrid way, that is, applying BN to both ItoH and HtoH connections for the cell ReLU activation, while for the update gate, applying BN to ItoH only. Experimental results on several ASR tasks clearly demonstrate that doing so can speed up optimization and improve generalization, especially when the training data is augmented with perturbation.
The second revision is to enlarge the model context. In our previous work[21], the context module in mGRUIPCtx is restricted to just modeling the future context, leaving the history to be modeled by the RNN structure. In this work, we release this restriction and allow the context module to model the future and history information simultaneously. It is empirically shown that this method is beneficial to the performance. Besides that, we also find that enlarging the order of future context can future improve the performance. It should be noted that this context extension brings quite limited additional parameters, thanks to the small dimensionality of the input projection.
With these two revisions, mGRUIPCtx’s performance is improved significantly. On a 8400 hours Mandarin ASR task (1400 hours original data with 6fold augmentation), the revised mGRUIPCtx provides 6% to 10% relative CER reduction over the previous one in [21], demonstrating the effectiveness of the revision methods. Compared to LSTM, the relative improvement is 18% to 37%, and the gain over TDNNLSTM is about 5% to 12%. This model even outperforms a very strong baseline, BLSTM, with much less parameters and only 290 ms decoding latency. The proposed model’s superiority is further verified on a much larger Mandarin ASR task, which contains 60K hours of speech in total (10K hours original data with 6fold augmentation). On this task, the gain of mGRUIPCtx over LSTM and TDNNLSTM is 11% to 20% and 3% to 11%, respectively. Compared with a much stronger baseline, TDNNBLSTM, mGRUIPCtx just shows 2% to 6% relative loss, with advantages of much less parameters and much lower latency.
2 Prerequisites
2.1 mGRU
mGRU, short for minimal gated recurrent unit, is a revised version of GRU model. It is proposed by [25, 26] and contains two modifications: removing the reset gate and replacing the hyperbolic tangent function with ReLU activation. The model is defined by the following equations (the layer index has been omitted for simplicity):
(1)  
(2)  
(3) 
In particular, is a vector corresponding to the update gate, of which the activation is elementwise logistic sigmoid functions . represents the output state vector for the current frame , and is the candidate state obtained with a ReLU function. The parameters of the model are , (the ItoH connections), , (the HtoH weights), and the bias vector .
2.2 mGRUIP
mGRUIP is obtained by inserting a linear input projection layer into mGRU, leading to the following update equations:
(4)  
(5)  
(6)  
(7) 
where the current input vector and the previous output state vector , are compressed into a lower dimensional space by weight matrix and respectively and added together to get a projected vector . Then the update gate activation and the candidate state vector are calculated based on it.
2.3 mGRUIPCtx
The input projection forms a bottleneck in mGRUIP, on which we design a context module, temporal convolution, to effectively model the future context[21]. For the th layer (), equation (4) now becomes:
(8) 
where is the concatenation of the current input vector and the output state vector of preceding layer from serval future frames:
(9) 
In particular, is the input vector of layer ( is actually since ), and is the output state vector of layer on the th frame. is the step stride, is order of future context, and is loop index.
3 Proposed Revisions
3.1 Applying Batch Normalization
Our first refinement is to figure out the best way to apply BN in the proposed models. There are two possible locations to apply BN in the structure: the update gate and the cell activation.
3.1.1 BN for the update gate
Three different BN methods for update gate are experimented:

No BN
Just as defined by equation (1) and (5) in Section 2 for model mGRU and mGRUIP(Ctx), respectively. 
BN on ItoH only
For mGRU, equation (1) now becomes:(10) For mGRUIP(Ctx), equation (5) now becomes:
(11) where and are the projected vector from (or for mGRUIPCtx) and , respectively.

BN on ItoH and HtoH
For mGRU, equation (1) now becomes:(12) For mGRUIP(Ctx), equation (5) now becomes:
(13)
3.1.2 BN for the cell activation
For the cell activation, No BN can probably cause numerical instabilities for ReLU, thus only two methods are tried:

BN on ItoH only
We first try this method for mGRU model. Equation (2) now becomes:(14) It degrades the performance significantly (shown in Table 1), thus no further trial is needed for mGRUIP(Ctx).

BN on ItoH and HtoH
Just as defined by equation (2) and (6) in Section 2 for model mGRU and mGRUIP(Ctx), respectively.
3.2 Enlarging Model Context
Our second revision is to enlarge the model context. The context module is used to model not only the future context but also the history information. Equation (9) now becomes:
where and is the step stride for the history and future context, respectively. is the order of history and is the order of future context.
Gate, BN on  Cell, BN on  WER (%)  
ItoH  HtoH  ItoH  HtoH  SWB  CHM  Total 
mGRU  
N  N  Y  Y  10.2  20.6  15.5 
Y  Y  Y  Y  diverge  
Y  N  Y  Y  10.1  19.9  15.1 
Y  N  Y  N  11.1  22.1  16.7 
mGRUIP  
N  N  Y  Y  9.8  19.0  14.5 
Y  Y  Y  Y  19.8  27.1  23.5 
Y  N  Y  Y  9.5  18.6  14.2 
4 Experimental Settings
This section will introduce the three ASR tasks used in this work. All the models in this paper are trained with LFMMI objective function computed on 33Hz outputs [27].
4.1 Switchboard ASR Task
The training data is 309hour SwitchboardI data. Evaluation is performed in terms of WER on the full Switchboard Hub5’00 test set, consisting of two subsets: Switchboard (SWB) and CallHome (CHE). The experimental setup follows [27].WER results are reported after 4gram LM rescoring of lattices generated using a trigram LM. Please refer to [27] for more details.
4.2 MediumScale Mandarin ASR Task
The second task is an internal mediumscale mandarin ASR task, which actually has two subtasks. The first one contains 1400 hours original mobile recording data. The second one is with 6 times data augmentation using speed perturbation (x3) [28] and noise/reverberation perturbation (x2) [29], resulting in 8400 hours training data in total.
4.3 LargeScale Mandarin ASR Task
The largescale mandarin ASR task contains 10K hours original data and is augmented 6 times, resulting in 60K hours training data in total. The performance of mandarin ASR tasks are evaluated on five publicavailable test sets, including three clean and two noisy ones. The clean sets are:
The two noisy test sets are THCHS30_Car and THCHS30_Cafe, which are the corrupted version of THCHS30_Clean by car and cafeteria noise, respectively. The noise level is 0db.
5 Experimental Results
5.1 Applying Batch Normalization
5.1.1 Switchboard ASR Task
On this task, we compare different BN methods for two models: mGRU and mGRUIP, and the results are shown in Table 1. Both of them contain 5 layers, and each layer consists 1024 cells. The input projection layer for mGRUIP has 512 units.
Several observations can be obtained from Table 1. Firstly, comparing the first two lines of each model’s results, we can see that applying BN to both ItoH and HtoH for update gate is quite harmful. The training of mGRU model even diverges. We think the reason is as below. The update gate is controlled by a sigmoid function, and is expected to learn to open or close at the right time. BN on ItoH and HtoH will keep the input away from the saturated regime of this nonlinearity. This may be helpful when sigmoid serves as a hidden node activation. However, when it’s used to control a gate, this will be harmful since it will make the gate halfclosed or halfopened, which is opposite to the responsibility of the gate control. To verify this, we investigate the evolution of update activation when performing recognition. A speech segment is chosen from the test set (the utterance id is en_4170B_064608064704 with the text content That is quite a difference), and the features are sent to the three mGRUIP models trained with different BN methods. The average activation of the update gate over cells (the 5th layer) for each frame is shown in Figure 1. It’s very clear that applying BN to both ItoH and HtoH for the update gate will cause the gate activation fluctuating around 0.5, meaning the gate unable to effectively control the dependency between the history and candidate state . Another interesting finding is that, the average gate activations of No BN and BN on ItoH models are almost always greater than 0.5. We think it’s because the speech signal is a sequence that evolves rather slowly, in which the past history can virtually always be helpful.
Model  Layerwise Context Setting  Latency  CER  

AiShell  THCHS30  
2  3  4  5  (ms)  dev  test  Clean  Car  Cafe  
mGRUIPCtxA  0;  0;  0;  0;  170  4.61  5.56  10.34  10.67  41.89 
mGRUIPCtxB  ;  ;  ;  ;  200  4.49  5.46  10.11  10.46  39.79 
mGRUIPCtxC  ;  ;  ;  ;  200  4.49  5.50  10.08  10.45  38.41 
mGRUIPCtxD  ;  ;  ;  ;  290  4.47  5.47  9.95  10.36  38.76 
The second observation from Table 1 is that we should apply BN to both ItoH and HtoH for the cell ReLU activation (comparing the last two lines of mGRU’s results). It’s possibly because applying BN this way will provide the best numerical stabilities for ReLU. Finally, we can conclude from Table 1 that, the best method to apply BN in mGRU related models is a hybrid way, that is, applying BN to just ItoH for update gate, and for cell ReLU activation, applying BN to both ItoH and HtoH.
Model  Gate  CER  
AiShell  THCHS30  
BN  dev  test  Clean  Car  Cafe  
1400 Hours Original Data  
TDNNLSTM    4.81  5.98  10.97  11.38  44.20 
mGRUIPCtx  No[21]  4.66  5.71  10.38  10.77  40.26 
ItoH  4.61  5.56  10.34  10.67  41.89  
8400 Hours Augmented Data  
TDNNLSTM    4.50  5.42  10.11  10.30  23.86 
mGRUIPCtx  No  4.45  5.42  10.24  10.34  23.43 
ItoH  4.33  5.26  10.00  10.02  22.35 
5.1.2 MediumScale Mandarin ASR Task
According to the findings in Section 5.1.1, we apply the optimal BN methods to the model mGRUIPCtx on the mediumscale mandarin ASR task, and the results are shown in Table 3. The baseline model is TDNNLSTM [8] which has interleaving 7 TDNN and 3 LSTM layers. The model mGRUIPCtx contains 5 hidden layers, each one has 2560 cells and a 256dimensional input projection layer. The settings of context order and step stride for each layer is same as our previous work [21].
According to Table 3, with 1400 hours original training data, mGRUIPCtx with no BN on update gate gives 3% to 5% relative gain over TDNNLSTM. However, this gain disappears after the training data is augmented with perturbation. It can be attributed to the different difficulties of these two subtasks, of which model learning with perturbed data is harder than with original data. After BN is correctly applied, mGRUIPCtx performs better than TDNNLSTM on the harder subtask, demonstrating the effectiveness of the BN method. Moreover, the training curve reveals that the optimization process becomes much stable (not shown in this paper).
5.2 Enlarging Model Context
5.2.1 MediumScale Mandarin ASR Task
Our second revising method is to enlarge the model context. We compare the performance of mGRUIPCtx with different settings of context order , and step stride , (BN methods are correctly used). For fast experiments, all the models are trained with 1400 hours original data, and the results are reported in Table 2. The setting of context order and stride for each layer is represented with a format of ;.
From Table 2, comparing mGRUIPCtxA and mGRUIPCtxB, it can be seen that allowing context module to model the history information as well is quite beneficial to the performance (mGRUIPCtxA is just the model used in Section 5.1.2). In addition, increasing the order of future context can further improve the performance, with a cost of additional model latency ( mGRUIPCtxD vs. mGRUIPCtxB).
Next, the best model mGRUIPCtxD is compared with several baselines, including BLSTM, TDNNLSTM and LSTM, on the 8400 hours dataaugmented task, and the results are presented in Table 4. BLSTM and LSTM both contain 5 hidden layers. Each layer of LSTM and each directional sublayer of BLSTM, have 1024 cells and 512 linear projection units. TDNNLSTM is the model used in Section 5.1.2.
Model  Latency  CER  

AiShell  THCHS30  
(ms)  dev  test  Clean  Car  Cafe  
BLSTM  2020  4.24  5.05  9.79  9.87  22.82 
TDNNLSTM  210  4.50  5.42  10.11  10.3  23.86 
LSTM  70  5.07  6.23  12.06  11.82  33.67 
mGRUIPCtxD  290  4.13  5.03  9.56  9.71  20.94 
CERR (%)    18.5  19.3  20.7  17.9  37.8 
The CERR line in Table 4 is the relative gain of mGRUIPCtxD over LSTM, ranging from 18% to 38% for different test sets, indicating that the proposed model performs significantly better than LSTM. Compared with TDNNLSTM, mGRUIPCtxD gives 5% to 12% relative CER reduction with 12.4M less parameters (22.4M vs. 34.8M) and 80 ms additional latency (290ms vs. 210ms). mGRUIPCtxD even outperforms BLSTM with much lower model latency (290ms vs. 2020ms) and much less model parameters (22.4M vs. 55.3M). Compared to the performance of mGRUIPCtx before revision (the nexttolast line of Table 3), mGRUIPCtxD shows 6% to 10% relative gain, which demonstrates the effectiveness of the two proposed revising methods.
5.2.2 LargeScale Mandarin ASR Task
Finally, mGRUIPCtxD’s superiority is verified on a much larger mandarin ASR task, which contains 60K hours speech data. We train a TDNNBLSTM model as one baseline, which is stronger than BLSTM and contains 3 TDNN layers and 5 BLSTM layers. All of the other models have the same structure as Section 5.2.1. Results are shown in Table 5.
Model  CER  

AiShell  THCHS30  
dev  test  Clean  Car  Cafe  
TDNNBLSTM  3.55  4.21  8.72  8.85  18.73 
TDNNLSTM  3.90  4.68  9.55  9.65  21.53 
LSTM  4.32  5.26  10.3  10.33  23.84 
mGRUIPCtxD  3.71  4.52  9.09  9.16  19.06 
CERR (%)  14.1  14.1  11.8  11.3  20.1 
According to Table 5, the relative improvement of mGRUIPCtxD over LSTM is 11% to 20%, and the gain over TDNNLSTM is 3% to 11%. mGRUIPCtxD performs slightly worse (2% to 6% relative) than the strong baseline, TDNNBLSTM, but has two advantages: much less parameters and much lower latency.
6 Conclusions
In this paper, we improve our previous proposed model mGRUIPCtx with two revisions: applying BN methods and enlarging model context. After revising, mGRUIPCtx outperform LSTM with a large margin. It even performs slightly better than a superior BLSTM on one task, with much less parameters and much lower latency.
References
 [1] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Contextdependent pretrained deep neural networks for largevocabulary speech recognition,” IEEE Transactions on Audio Speech & Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
 [2] S. Zhang, C. Liu, H. Jiang, S. Wei, L. Dai, and Y. Hu, “Feedforward sequential memory networks: A new structure to learn longterm dependency,” Computer Science, 2015.
 [3] S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. R. Dai, “Compact feedforward sequential memory networks for large vocabulary continuous speech recognition,” in INTERSPEECH, 2016, pp. 3389–3393.
 [4] S. Zhang, M. Lei, Z. Yan, and L. Dai, “Deepfsmn for large vocabulary continuous speech recognition,” arXiv preprint arXiv:1803.05030, 2018.
 [5] A. W. M. Ieee, T. Hanazawa, G. Hinton, K. S. M. Ieee, and K. J. Lang, “Phoneme recognition using timedelay neural networks,” Readings in Speech Recognition, vol. 1, no. 2, pp. 393–404, 1990.
 [6] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in INTERSPEECH, 2015.
 [7] H. Sak, A. Senior, and F. Beaufays, “Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition,” Computer Science, pp. 338–342, 2014.
 [8] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and lstms,” IEEE Signal Processing Letters, vol. PP, no. 99, pp. 1–1, 2017.
 [9] M. Schuster and K. K. Paliwal, Bidirectional recurrent neural networks. IEEE Press, 1997.
 [10] A. Graves, S. Fernández, and J. Schmidhuber, Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. Springer Berlin Heidelberg, 2005.
 [11] A. Graves, N. Jaitly, and A. R. Mohamed, “Hybrid speech recognition with deep bidirectional lstm,” in Automatic Speech Recognition and Understanding, 2014, pp. 273–278.
 [12] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, “Highway long shortterm memory rnns for distant speech recognition,” Computer Science, pp. 5755–5759, 2015.
 [13] A. Zeyer, R. Schlüter, and H. Ney, “Towards onlinerecognition with deep bidirectional lstm acoustic models,” in INTERSPEECH, 2016, pp. 3424–3428.
 [14] K. Chen and Q. Huo, Training deep bidirectional LSTM acoustic model for LVCSR by a contextsensitivechunk BPTT approach. IEEE Press, 2016.
 [15] K. Chen, Z. J. Yan, and Q. Huo, “A contextsensitivechunk bptt approach to training deep lstm/blstm recurrent neural networks for offline handwriting recognition,” in International Conference on Document Analysis and Recognition, 2016, pp. 411–415.
 [16] S. Xue and Z. Yan, “Improving latencycontrolled blstm acoustic models for online speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 5340–5344.
 [17] S. Hochreiter and J. Schmidhuber, Long shortterm memory. Springer Berlin Heidelberg, 1997.
 [18] G. F. A., J. Schmidhuber, and F. Cummins, Learning to Forget: Continual Prediction with LSTM. Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 1999.
 [19] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in IeeeInnsEnns International Joint Conference on Neural Networks, 2000, pp. 189–194 vol.3.
 [20] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Netw, vol. 18, no. 56, p. 602, 2005.
 [21] J. Li, X. Wang, Y. Zhao, and Y. Li, “Gated recurrent unit based acoustic modeling with future context,” arXiv preprint arXiv:1805.07024, 2018.
 [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [23] C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch normalized recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 2657–2661.
 [24] T. Cooijmans, N. Ballas, C. Laurent, Ç. Gülçehre, and A. Courville, “Recurrent batch normalization,” arXiv preprint arXiv:1603.09025, 2016.
 [25] M. Ravanelli, P. Brakel, M. Omologo, and Y. Bengio, “Improving speech recognition by revising gated recurrent units,” INTERSPEECH, pp. 1308–1312, 2017.
 [26] ——, “Light gated recurrent units for speech recognition,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92–102, 2018.
 [27] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequencetrained neural networks for asr based on latticefree mmi,” in INTERSPEECH, 2016, pp. 2751–2755.
 [28] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 [29] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5220–5224.
 [30] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell1: An opensource mandarin speech corpus and a speech recognition baseline,” 2017.
 [31] Z. Z. Dong Wang, Xuewei Zhang, “Thchs30 : A free chinese speech corpus,” 2015. [Online]. Available: http://arxiv.org/abs/1512.01882