DEEP TRIPHONE EMBEDDING IMPROVES PHONEME RECOGNITION

# Deep Triphone Embedding Improves Phoneme Recognition

## Abstract

In this paper, we present a novel Deep Triphone Embedding (DTE) representation derived from Deep Neural Network (DNN) to encapsulate the discriminative information present in the adjoining speech frames. DTEs are generated using a four hidden layer DNN with nodes in each hidden layer at the first-stage. This DNN is trained with the tied-triphone classification accuracy as an optimization criterion. Thereafter, we retain the activation vectors () of the last hidden layer, for each speech MFCC frame, and perform dimension reduction to further obtain a dimensional representation, which we termed as DTE. DTEs along with MFCC features are fed into a second-stage four hidden layer DNN, which is subsequently trained for the task of tied-triphone classification. Both DNNs are trained using tri-phone labels generated from a tied-state triphone HMM-GMM system, by performing a forced-alignment between the transcriptions and MFCC feature frames. We conduct the experiments on publicly available TED-LIUM speech corpus. The results show that the proposed DTE method provides an improvement of absolute in phoneme recognition, when compared with a competitive hybrid tied-state triphone HMM-DNN system.

## 1Introduction

With the advent of deep learning, various difficult machine learning tasks such as speech recognition, image classification, and requiring Natural Language Processing (NLP) have seen notable advances in their respective performances [1]. As is well known, speech is produced by modulating a small number of parameters of a dynamical system and this implies that speech features may reside in a low dimensional manifold , which is a non-linear subspace of original high dimensional feature space . DNNs with multiple hidden layers, trained with the Stochastic Gradient Descent (SGD) over large amount of labeled data, are able to learn these manifolds much better than purely generative models such as Context Dependent tied-triphone Hidden Markov Model - Gaussian Mixture Model (CD-HMM-GMM) [2].

Speech recognition is not a static pattern classification problem and entails recognition of a time-series of speech frames in terms of linguistic symbols such as phonemes or words. It is very well known that the probability of current phoneme or triphone or word is highly dependent on previous phonemes or triphones or words; due to pronunciation and grammatical constraints of a natural language. Traditionally, this property has been used very successfully, though only at a latter last stage of speech recognition pipeline (Viterbi decoding[4]), in the form of probabilistic language models.

Unfortunately, a DNN acoustic model [1], does not harness this useful dependency information since it lacks a memory state and does not remember the triphone or phoneme classification identities of the previous few frames. On the other hand, Reccurent Neural Networks (RNN) and Long-short-term-memory (LSTM) recurrent neural network [6], through their explicit memory cell state, are able to retain information in long-term dependencies of the input time-series data. In particular LSTM have been more successful since their memory cell consists of several gates (forget gate, input gate, output gate) which regulate the ability to add or remove information from the previous frames and the current frame [9].

Despite the widespread usage of recurrent neural networks, effective training and regularization of these networks is still a non-trivial task, and continues to pose numerous challenges for research [10]. The vanishing gradient problem is among the first known issue with the training of recurrent neural network; which is mildly mitigated by the invention of LSTM networks [9]. The other challenges that arise in LSTM recurrent neural networks are – a critical dependence on weight initialization, difficulty in optimizing and regularizing, difficulties in parallel computation due to the sequential dependence on the training, and high memory requirement of the Back-Propagation Through Time (BPTT) algorithm which is used in training recurrent networks [10]. On the other hand, feed-forward DNNs, owing to their simpler architecture, are not constrained by most of these challenges and are relatively easier to train with Stochastic Gradient Descent and random initialization of the weights [13]. Therefore, it is desirable to develop new DNN like architecture, which combines the memorization ability of the LSTM to capture useful information present in the previous and next frames for the sequence recognition tasks such as phoneme recognition.

One of the common solution is to provide a large number of adjoining speech feature frames (MFCC) as input to train a DNN. As reported in [5], indeed the hybrid HMM-DNN phoneme recognition accuracy improved from to as the input MFCC frames were increased from to frames; on Wall Street Journal speech corpus. However, this has a major shortcoming – this information about the adjoining speech frames is being provided as raw MFCC feature frames, which is not handled in a manner that is distinctive from the centred raw speech frames. On the other hand, the presence of memory cell and recurrent connections in LSTM allows to learn an adaptive and succinct representation, to encapsulate information present in the adjoining speech frames and likely to be relevant for phoneme/triphone classification [6].

Our solution to these challenges is moderately inspired by the success of word2vec [14] approach, developed in the NLP research community. The word2vec algorithm learns a distributed and compositional representation of text words by optimizing a likelihood function that maximizes the probability of co-occurring words and their respective context words in a text corpus. With similar motivations, we first learn DTE for triphones utilizing frames on a speech corpus, and then use DTE as additional features, along with MFCC features to finally train a second-stage DNN, to output the posterior probability for current speech frame. DTE embeddings are learnt in a discriminative manner and can also be interpreted as an alternate to memory (fixed number of embedding frames are held and utilized latter on) and thereby improve the second-stage DNN’s classification accuracy. Similar to the workings of language models, which assign high probability to those words which are likely to co-occur with the context words, the DTEs when used as an additional input features along with MFCC, increase the DNN output probability of triphone class that is most likely to be true.

The main contribution of our paper is to encapsulate relevant information from the previous and next speech frames in form of DTE. We train a first-stage four hidden layer DNN with the tied-triphone classification criterion [1] and utilize the representations learned by its last hidden layer after dimensionality reduction to generate DTE. We then input adjoining DTE frames along with MFCC features to second-stage DNN, to perform the task of tri-phoneme classification. Training both the DNNs on the same objective function (tied-state triphone classification) renders the generated DTE representation aware of the followed task and thus improves the learning of final second-stage DNN. Furthermore, we experiments for phoneme recognition to ensure that the accuracy improved in triphone/phoneme classification helps further for phoneme recognition.

The remainder of this paper is as follows: In Section 2, we describe the proposed method to generate DTE representations. An introduction to hybrid HMM-DNN system for phoneme recognition is presented in Section 3. Experiments and results are provided next in Section 4. We summarize our conclusions and suggest future directions of research in Section 6.

## 2the proposed method

The proposed method involves two stages of learning: i) training and generating a DTE representation for the speech frames arising from both left and right context around few centered speech frames, and ii) the use of adjoining frames’ DTEs along with MFCC feature vectors of few center frames as input to second-stage DNN. This approach is illustrated schematically in Figure 1. More formally, one can describe this task as to learn a mapping function , between tied-triphone labels and the MFCC feature vectors arising from left context, center frames, and right-context frames () of speech, as mentioned below in Equation 1:

where , , represent first-stage DNN and learnt function of second-stage DNN’s respectively. Here, denotes a functional operator which extracts the last hidden layer’s activation vector ( dimensional corresponding to each node in last hidden layer) of the first-stage DNN. After that we reduces dimension through Principal Component Analysis (PCA), to obtain the final dimensional vector; which is the succinct representation, named as DTE.

We next describe the reasons why we choose the DNN’s last hidden layer’s activation vector as the DTE. Consider a DNN with hidden layers (where in this paper) and with output vector , which models the class1 posterior probability given the input (MFCC features). This is achieved by using a softmax non-linearity at the output layer. Whereas the nonlinearity at hidden layers nodes could be tanh, or Rectified Linear Units (ReLU). Posterior probability that input feature belongs to class , is:

where is weight vector connecting output unit to all the hidden units in the last hidden layer , which has hidden units. As explained in [15], the workings of a DNN can also be interpreted as follows:

• A DNN with hidden layers maps the input feature to a high-dimensional and sparse non-linear feature space which is the activation vector at the hidden layer. This new feature space makes the classification problem easier to solve than the original raw feature space .

• Then, the Softmax non-linearity at the output layer can also be interpreted as a set of Logistic Regression based classifiers; acting on the last hidden layer’s activation vector () and exploiting the mapping learnt in the hidden layers to generate new and effective features.

Based on this reasoning, we retain activation vector at the last hidden layer of the first-stage DNN and followed by dimensionality reduction as the Deep Triphone Embedding (DTE) - as a representation for previous and next frames to provide relevant information for the task of phoneme classification. The activation vectors at the hidden layers are known to be highly sparse and hence we could further compress them to a low -dimensional representation by using either a Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) transform. Dimensionality reduction also helps in reducing the number of model parameters for the second-stage DNN. Both the first-stage DNN and second-stage DNN are trained using the cross-entropy loss; which is defined with respect to tri-phoneme labels obtained after force alignment using a well trained tied-state triphone HMM-GMM acoustic model.

## 3Hybrid Hmm-Dnn System

The speech recognition performance have improved tremendously when the DNN is trained on the tied-state triphones states as the output labels than the monophone labels [1]. This is due to the improved co-articulation modelling by the triphone states. However, the triphone states are much larger in number than the monophone states ( and respectively in our experiments) and training a DNN with triphones as labels is a harder learning problem than training with the monophone labels.

We have trained a strong baseline tied-state triphone GMM-HMM acoustic model with tied states on a subset of TED-LIUM corpus [16]. TED-LIUM corpus consist of TED talks given in English language by different speakers covering a wide variety of speaking styles, and demographic and linguistic backgrounds (US English, British English, Continental European English, Indian English, Chinese English and Australian English speakers), making it a challenging multi-accented conversational speech corpus for speech recognition. We have developed a C++ library to train CD-HMM-GMM, which uses the standard Expectation-Maximization (EM) algorithms for the parameter estimation and a decision tree based tying that resulted into tied-triphone states in our experiments. This system is also used to produce tied-state triphone labels for the entire train, dev and test-sets via forced alignment in order to train our hidden layer DNNs. We use the traditional Mel filter bands, followed by DCT coefficients of the log Mel filterbank energies and their delta and delta-delta coefficients, resulting into dimensional MFCC feature vector for each speech frame. This DNN is provided a context of adjoining MFCC frames (center frame frames), resulting into dimensional input feature. The network processes the input through a sequence of ReLU non-linearities [13]. In particular at the layer, the network computes: , where is a matrix of trainable weights, is vector of trainable biases, and is the hidden layer and is the hidden layer (or the input if ).

Theano library [17] is used to train the DNNs using Stochastic Gradient Descent (SGD) with progressively decaying learning rate which starts at and is validated with a small dev-set to reduce the learning rate. All parameters in the weight and bias matrices are initialized randomly.

## 4Experiments and Results

### 4.1Dataset Description

Experiments were performed on the talks subset of the TED-LIUM corpus which originally consists of TED talks[16]. We took the first talks from this data-set to form our train-set and dev-set of and hours duration respectively (these durations are after removing excessive silence frames which dominate in any speech corpus). We use the official TED-LIUM test-set ( of speech and consists of talks). We use the official pronunciation dictionary which consists of about words and comprises of phonemes/monophones. The baseline tied-state triphone HMM-GMM system has tied-states, and each triphone is modeled by three state left-to-right HMM with a GMM based emission distribution with Gaussian mixture components. We used our customized C++ library for estimating these parameters through EM algorithm and decision tree based state tying.

### 4.2Phoneme Classification & Recognition

We have used the optimal Viterbi decoding with all the acoustic models (HMM-GMM, HMM-DNN variants, & proposed methods), where in case of the HMM-DNN acoustic model, scaled likelihoods, which are just the posterior probabilities of tied-triphone states obtained at the DNN output layer, are used for the triphone state likelihoods. A bigram phoneme language model is learned on the train-set phoneme transcripts. The LM factor and phone insertion penalty are empirically tuned on the dev-set. It is worth emphasizing that we measure performance only on non-silence frames for both the phoneme classification and recognition experiments; as silence frames are dominating (typically ) in a speech corpus and can inflate the results. All of our baselines and the proposed DTE based methods are trained for the tied-triphone labels. In order to obtain the phoneme classification and recognition accuracies, we map the center phone of triphone label as the predicted phoneme.

For comparison, we have prepared four baselines, namely, HMM+GMM, HMM+DNN, HMM+DNN-W and HMM+DNN-W+D. First one is based on traditional CD-HMM-GMM based generative systems and remaining three are variants of HMM+DNN type od system. HMM+DNN, HMM+DNN-W and HMM+DNN-W+D are with P=10 & 4 hidden layers, P=24 & 4 hidden layers, and P=24 & 8 hidden layers; where P represents number of previous or next frames used. Last two baselines are as wide as our method and this is to ensure that the improvement seen in proposed is not merely due to the high number of previous or next frames consumed. The last baseline which is HMM+DNN-W+D, is made of 8 hidden layers and ensures that the improvement is not achieved merely due to increase in the partial depth of DTE in the proposed methods.

Table 1 shows the results for tri-phoneme and phoneme classification with various methods used to train an acoustic model. The performance improvement by HMM+DNN variants based acoustic model over previous HMM+GMM method validates the discriminative strength of DNN based models. On the top of that, both the variants of our proposed DTE based methods, namely, HMM+DTE-LDA+DNN and HMM+DTE-PCA+DNN achieves an absolute improvement of / and / , in tri-phoneme / phoneme classification accuracy, respectively. The relative improvements of and in tri-phoneme / phoneme classification accuracy are achieved by HMM+DTE-PCA+DNN, when compared against all variants of HMM+DNN based baselines.

The results on phoneme recognition are presented in Table 2, for various training methods used to learn the acoustic models. We report phoneme recognition accuracy by taking into account all the substitution, deletion and insertion errors. The performance of variants of the proposed methods, namely, HMM+DTE-LDA+DNN and HMM+DTE-PCA+DNN indicates that the proposed method delivers superior performance, hence offers a huge potential to build efficient and robust speech recognition systems. More precisely, variants of proposed methods achieve as high as an absolute improvement of and relatively improvement of ; when compared with a very strong baseline. It is important to note that the further optimization of dimension of DTE might results in further improvement which was fixed to 300 in our experiments.

To investigate the effect of the proposed DTE representation closely, we compute DTE representation corresponding three tied-state triphone labels [1045,205,443] generated by decision tree. The DTE representation was reduced to two dimension using t-SNE method to be able to visualize them [18]; as shown in Figure 2. Visualization makes it clearly apparent that on addition of DTE representation, two of the more confounded triphone ([205,443]) are much more disentangled, in comparison to only raw MFCC features.

## 5Related Work

Deep neural networks have played an important in the resurgence of newly advanced speech recognition systems [19]. One of the popular direction of research is to train neural networks in an end-to-end fashion, aimed to generate transcriptions using RNN [7]. These methods enjoy the advantage of direct training over target linguistic sequence conditioned on the input acoustic sequence and do not require an explicit step of force alignment to segment the acoustic data. However, these methods stumble to integrate with large vocabulary speech recognition systems; due to their inability to combine easily with word-level models [6]. Contrary to these methods, hybrid HMM-GMM based methods can embed word-level information seemingly; which is vital in developing real-world systems. Hence, the focus of this paper limits to only hybrid HMM-GMM systems.

Difficulty in training deep models has been one of the most critical factor to impede in widely spreading their usage until very recently. Few of the most critical challenges include but not limited to, initializing weights of the network, choice of optimal algorithm for optimization and hyper-parameters, and regularizing to achieve higher generalization performance [20]. Specially, training RNNs is a hard problem; when compared with DNNs for multiple reasons [12]. Inspired from these issues, this paper seeks for a method based on only DNNs. Though, in past, RNNs has been studied quite extensively for speech recognition [23]. Research studies on RNNs has revealed that bidirectional LSTM recurrent neural networks can be utilized as an acoustic model in a manner similar to hybrid HMM-DNN systems. However, the improvement in recognition accuracy over DNNs is reported to be modest [6]. Interestingly, bidirectional nature of these networks can capture both left and right contextual frames of speech, similar to this paper.

Recently, highway networks have been also proposed; motivated by the fact that the network depth is crucial to learn better models [24]. While highway networks looks conceptually similar, this work does not advocate information flow across through direct connections between the layers of network. Although, our method also enjoy the advantage of selective higher depth; as in data from previous and next frames go through 8 hidden layers effectively.

## 6Conclusion and Future Work

RNN and more specifically LSTM[6], owing to their explicit memory cell, have shown great promise in sequence recognition tasks. Traditional DNN does not have such a memory cell to persist useful leaned information from the adjoining speech frames, and hence is at a disadvantage [1]. However, training a DNN is relatively easier as compared to training and tuning a LSTM which has complex interactions of the various input, output, and memory gates which pose significant challenges in SGD based parameter learning [10].

This paper has presented a novel DTE representation which augments the classical DNN and imparts it partial LSTM like capability by enabling them to retain information from context; without comprising on the simplicity of the DNN training procedures. The proposed method yields superior improvements of absolute , , and in tri-phoneme classification, phoneme classification, and phoneme recognition respectively; when compared the baseline classical hybrid HMM-DNN system.

Finally, we note that the results offered in this work are achieved on the challenging TED-LUM [16] corpus which consists of multi-accented English TED talks given by speakers from a very diverse demographic and linguistic backgrounds (US English, British English, Continental European English, Indian English, Chinese English and Australian English speakers). In future, we would investigate and extend our analysis with the full set of available data ( hours) and Large Vocabulary Speech Recognition experiments.

### Footnotes

1. Tied-state triphones are used as output classes.

### References

1. “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,”
George E Dahl, Dong Yu, Li Deng, and Alex Acero, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp. 30–42, 2012.
2. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,”
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82–97, 2012.
3. “Recurrent neural network with backpropagation through time for speech recognition,”
Abdul Manan Ahmad, Saliza Ismail, and DF Samaon, in IEEE International Symposium on Communications and Information Technology, 2004, vol. 1, pp. 98–102.
4. “Dynamic programming search for continuous speech recognition,”
Hermann J Ney and Stefan Ortmanns, Signal Processing Magazine, IEEE, vol. 16, no. 5, pp. 64–83, 1999.
5. “Hybrid context dependent cd-dnn-hmm keyword spotting (kws) in speech conversations,”
Vivek Tyagi, in To Appear In IEEE International Workshop on Machine Learning for Signal Processing, Sept. 13–16, 2016, Salerno, Italy, 2016.
6. “Hybrid speech recognition with deep bidirectional lstm,”
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed, in ASRU, 2013 IEEE Workshop on. IEEE, 2013, pp. 273–278.
7. “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,”
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, in Proceedings of the 23rd ICML. 2006, ICML ’06, pp. 369–376, ACM.
8. “Framewise phoneme classification with bidirectional lstm and other neural network architectures,”
Alex Graves and Jürgen Schmidhuber, Neural Networks, pp. 5–6, 2005.
9. “The vanishing gradient problem during learning recurrent neural nets and problem solutions,”
Sepp Hochreiter, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 6, no. 02, pp. 107–116, 1998.
10. “On the difficulty of training recurrent neural networks.,”
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, ICML, vol. 28, pp. 1310–1318, 2013.
11. “Learning long-term dependencies with gradient descent is difficult,”
Yoshua Bengio, Patrice Simard, and Paolo Frasconi, IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994.
12. “Advances in optimizing recurrent networks,”
Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu, in ICASSP, 2013, pp. 8624–8628.
13. “Deep sparse rectifier neural networks,”
Xavier Glorot, Antoine Bordes, and Yoshua Bengio, in Proc. of International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
14. “Distributed representations of words and phrases and their compositionality,”
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean, in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., pp. 3111–3119. 2013.
15. “Deep learning,”
Ian Goodfellow Yoshua Bengio and Aaron Courville, Book in preparation for MIT Press, 2016.
16. “Ted-lium: an automatic speech recognition dedicated corpus,”
Anthony Rousseau, Paul Deléglise, and Yannick Estève, in In Proceedings of the Eight International Conference on Language Resources and Evaluation LREC’12, 2012, pp. 125–129.
17. “Theano: new features and speed improvements,”
Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio, arXiv preprint arXiv:1211.5590, 2012.
18. “Visualizing high-dimensional data using t-sne,”
L.J.P van der Maaten and G.E. Hinton, Journal of Machine Learning Research, vol. 9: 2579–2605, Nov 2008.
19. “Deep speech 2: End-to-end speech recognition in english and mandarin,”
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu, arXiv preprint arXiv:1512.02595, 2015.
20. “On the importance of initialization and momentum in deep learning,”
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton, in Proceedings of the 30th ICML. May 2013, vol. 28, pp. 1139–1147, JMLR Workshop and Conference Proceedings.
21. “Recurrent batch normalization,”
Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville, arXiv preprint arXiv:1603.09025, 2016.
22. “Rnndrop: A novel dropout for rnns in asr,”
Taesup Moon, Heeyoul Choi, Hoshik Lee, and Inchul Song, in 2015 IEEE Workshop on ASRU. IEEE, 2015, pp. 65–70.
23. “An application of recurrent nets to phone probability estimation,”
A. J. Robinson, IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, Mar 1994.
24. “Training very deep networks,”
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber, arXiv preprint arXiv:1507.06228, 2015.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters