Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this paper, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect of the problem that we address is the problem of how to prune entries from such a lexicon (since, empirically, lexicons with too many entries do not tend to be good for ASR performance). Experiments on various ASR tasks show that, with the proposed framework, starting with an initial lexicon of several thousand words, we are able to learn a lexicon which performs close to a full expert lexicon in terms of WER performance on test data, and is better than lexicons built using G2P alone or with a pruning criterion based on pronunciation probability.
Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework
Xiaohui Zhang, Vimal Manohar, Daniel Povey, Sanjeev Khudanpur ††thanks: This work was partially supported by DARPA LORELEI Grant No HR0011-15-2-0024, NSF Grant No CRI-1513128 and IARPA Contract No 2012-12050800010. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government. Thanks Dr. Jack Godfrey, Dr. Alan MaCree, Dr. Mengyang Gu, Chloe Haviland for useful discussions.
Center for Language and Speech Processing
Human Language Technology Center of Excellence
The Johns Hopkins University, Baltimore, MD 21218, USA
email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Index Terms: speech recognition, pronunciation lexicon learning
In the past few years, there has been an growing interest in investigating acoustic data-driven lexicon learning for continuous speech recognition, i.e. automatically obtaining pronunciations of words for which pronunciations are not available , but for which transcribed acoustic data exists. In order to develop ASR systems under limited lexicon resources, one solution is to adopt a graphemic lexicon [1, 2] or acoustic unit discovery methods [3, 4], which totally eliminate the expert efforts for developing a phonetic pronunciation lexicon. In real applications, however, a more common scenario is that we already have a phonetic inventory, and a small expert lexicon for a specific language. Our work focuses on this case, i.e. given a small expert lexicon, we want to derive pronunciations for Out-of-Vocabulary (OOV) words, for which we know the text form and have acoustic examples.
Given a small expert lexicon, the most straightforward way to generate pronunciation candidates for OOV words is to train a Grapheme-to-Phoneme (G2P)  model using the seed lexicon and apply it to these OOV words [6, 7, 8]. But for languages like English, and for proper names and abbreviations, G2P does not always give high quality pronunciations. Pronunciations from phonetic decoding can help to fill this gap. Previous work has combined these with G2P-generated pronunciations [9, 10], or added into G2P training examples [7, 11, 12]. In the work we describe here, we use candidates from both G2P and phonetic decoding.
The aspect of the problem that we focus on is candidate pruning. That is, given a set of pronunciation candidates from G2P and phonetic decoding (and maybe some from a manually created lexicon), which subset should we keep? Keeping all the pronunciations is impractical because it would make decoding slow, and also because too many pronunciations tend to hurt ASR performance, even when pronunciation probabilities are used .
Previous work on candidate pruning has relied on estimated pronunciation probabilities to determine which candidates should be cut [11, 6, 8, 7, 12]. The main defect with this is that for words with multiple pronunciations, it tends to give us too many minor pronunciation variants (e.g. reflecting co-articulation effects), which is undesirable for ASR. If we rely on pronunciation probabilities alone it is hard to discard those types of variants while keeping variants that come from different meanings of the word.
The core idea of this paper is a likelihood-based criterion for pronunciation-candidate pruning that naturally keeps candidates that are “far apart”.
This paper is organized as follows. We discuss how we generate pronunciation candidates in Section 2; we explain how we collect acoustic evidence from training data in Section 3. We explain our likelihood-based pronunciation selection strategy in Section 4. Experimental results on various ASR tasks are provided in Section 5, and we conclude in Section 6.
2 Collecting pronunciation candidates from multiple sources
In our framework, like , we first extend the seed lexicon to include OOV words in the training data, using a G2P model trained on the seed lexicon, and then train an acoustic model (AM) using the G2P-extended lexicon. Then we generate alignments for all training data, based on which we then train a bi-gram phone language model (LM). Using this phone LM and the AM, we construct a phonetic decoder and use it to generate phonetic transcription of training data. For each individual word token in the transcript, we can align it with a phone sequence using timing information from the alignments and phonetic transcriptions. Then for each specific word , we can compute the relative frequency of each phone sequence being aligned to it, by normalizing each phone sequence’s count by the most frequent phone sequence’s count. Then we filter out those phone sequences whose relative frequency is too low (e.g. smaller than 0.1) and keep the left ones as the alternative pronunciations generated from phonetic decoding. Then we combine these alternative pronunciation candidates with the G2P-extended lexicon into a large lexicon (called combined lexicon). For each word from the combined lexicon, let denote the set of pronunciation candidates collected from multiple sources, and denote one pronunciation (baseform) candidate. The source of (denoted as ) could be one of the three: G2P/phonetic decoding. In the next section we will specify how we collect acoustic evidence for all pronunciation candidates in .
3 Acoustic evidence collection
First we introduce some notations. Let denote acoustic sequences; denote the number of utterances in O which contain the word 111we assume that each word appears in each utterance’s transcript at most once. In practice, if a word appears multiple times in an utterance, we divide the utterance into sub-utterances where each one only contains one token of the word.; Then we further define as the pronunciation probability of a pronunciation for a word (), and as the pronunciation model for word . We define as the conditional data likelihood given the pronunciation of being , which is determined by the acoustic model. This is the ”acoustic evidence” we want to derive from lattice statistics, which is needed by our pronunciation selection algorithm.
With the combined lexicon and an existing AM (the one we used for phonetic decoding in the candidate collection phase), we generate lattices for each training utterance. This lattice generation treats distinct pronunciations of words as distinct symbols for the purposes of lattice determinization, unlike our standard procedure described in . This is achieved by putting both phone symbols and word symbols as the input sequence on the FST prior to lattice determinization. From the lattices, we can obtain per-utterance lattice pronunciation-posterior statistics .
When the lattices were generated, we assign uniform priors over all pronunciation candidates of each word in the combined lexicon. By Bayes’ rule, we can directly use the posterior statistics as the likelihoods 222Strictly speaking, Bayes’ rule only gives us , i.e. can only be treated as up to a constant, but the constant doesn’t affect the objective (1) we want to optimize.. Because lattices are pruned, a posterior could be zero even if actually appears in a utterance . So we always floor to a small positive scalar (In practice it’s set between and ), so that we have .
Based on , we can obtain another useful statistic, the average pronunciation posterior , where the summation is only taken over those utterances where the word actually appears.
After the lattices were dumped, for each word, we prune away its pronunciations whose average posterior is too low (e.g. only keeping the top 10), construct a new combined lexicon, and then re-generate the lattices and re-collect acoustic evidence in the same way. We found this pruning is always helpful as it improves the accuracy of the posteriors.
4 Data-likelihood-reduction based greedy pronunciation selection
We formulate the pronunciation selection process as a greedy model selection procedure, with data-likelihood-reduction as the selection criterion. In this section, we’ll first specify how to compute the optimal data likelihood given a set of pronunciation candidates using EM and propose a pronunciation selection criterion based on likelihood reduction, and then use an illustrative example to compare the proposed selection criterion against other criteria. At last we talk about some practical issues in our algorithm, and summarize the whole iterative framework of pronunciation selection.
4.1 A pronunciation selection criterion based on per-utterance likelihood reduction
Given a set of pronunciation candidates for a specific word , and the conditional likelihood (acoustic evidence) for each utterance , we want to maximize the total data likelihood over the pronunciation model 333When we optimize the pronunciation probabilities for a specific word, we consider the pronunciation probabilities for other words as fixed.:
where the summation is only taken over utterances where the word actually appears. Since maximizing this objective doesn’t have a closed form solution, like , we use EM which maximizes the following auxiliary function instead ( stands for the iteration index, is the pronunciation posterior computed at the th iteration)
Maximizing the above function with the constraint gives the M-step:
According to Bayes’ rule, we compute the updated posteriors as the following:
which is the E-step. By running (3) and (4) iteratively until convergence, we can find an optimal pronunciation model , and evaluate the optimal log-likelihood (1) (denoted as for simplicity). In order to evaluate the importance of a specific pronunciation, say, , we remove from the pronunciation candidate set , re-initialize the pronunciation model on top of , and run EM to optimize (1) with the model . Writing the likelihood at convergence after removing as , we can compute the per-utterance likelihood reduction associated with the pronunciation as:
This metric reflects the contribution of each pronunciation to the total data likelihood. With this metric, we can iteratively remove least important pronunciations in a greedy fashion, which is efficient. The complete iterative framework is given in Section 4.4.
4.2 An illustrative example
Here we show an example to illustrate the advantage of pronunciation selection based on the per-utterance log likelihood reduction over the learned pronunciation probabilities , in terms of dealing with confusability of pronunciation variants.
In Table 1, we listed the pronunciation candidates, average pronunciation posteriors, learned pronunciation probabilities, and the per-utterance log likelihood reduction of two English words ‘machine’ and ‘us’ taken from the TED-LIUM  training corpus. Note that the two pronunciations of ‘machine’ only differ in one vowel, while the two pronunciations of ‘us’ represent two distinct meanings.
We want a selection criterion under which it’s possible to put a threshold to rule out the reduction ‘M IH SH IY N’ (generated from phonetic-decoding) in the ‘machine’ case, while keeping the acronym ‘Y UW EH S’ in the ‘us’ case. Looking at the learned pronunciation probabilities , it gives lower values for ‘Y UW EH S’ than ‘M IH SH IY N’, and thereby cannot serve as the criterion we need. However, the per-utterance log likelihood reduction of ‘AH S’ is much larger than ‘M IH SH IY N’ (0.034 v.s. 0.004). Thus it’s possible to set a proper threshold on to keep ‘AH S’ and remove ‘M IH SH IY N’.
The underlying reason is that the confusability between pronunciations is reflected in the sharpness of the per-utterance pronunciation posteriors . In the ‘us’ case, the two pronunciation variants cannot easily model each other, and therefore the posteriors are very sharp for most examples. Thereby removing the minor pronunciation ‘Y UW EH S’ would result in a greater reduction in the data likelihood. Thus, beyond reflecting the relative frequency, the proposed criterion is capable of modeling the confusability between pronunciation candidates, which is preferable from the Maximum Likelihood point of view and therefore could help us to select an informative set of pronunciations.
|[‘M AH SH IY N’, ‘M IH SH IY N’]||[‘AH S’, ‘Y UW EH S’]|
|[0.987, 0.013]||[0.992, 0.008]|
|[3.575, 0.004]||[15.576, 0.034]|
4.3 Refining the pronunciation selection criterion
One difficulty of directly using in an iterative pronunciation selection framework is that, we need to develop an interpretable threshold in order to decide when to stop removing pronunciations. However, we notice the upper bound of can be achieved in an extreme case, where we remove an absolutely dominating pronunciation (meaning: the observed conditional likelihoods satisfy: ). Before removing , it’s obvious from (1) that the maximum can be reached with being a one-hot vector s.t. . After removing , with the constraint , the log-likelihood is a constant: . Then we have: . According to this, we scale this upper bound by a scalar between to get an interpretable threshold: , where corresponds to the above extreme case, which means, for a pronunciation to be not removed, it would have to be present with probability 1 in 100 instances of the word, and means we will never remove any pronunciation candidates. In practice, it’s set between and . We also make dependent on the source of the pronunciation, which enables us to use a more conservatively threshold for selecting pronunciations from a source where the candidates’ quality is lower in general, like phonetic-decoding (pd), e.g. by setting . So, we define the “score” of a pronunciation candidate as “how far away” its is to the corresponding threshold, i.e.:
In our framework we iteratively prune the pronunciation with the lowest score and terminate pruning when all pronunciation have positive scores. Note that the count is smoothed with a source-dependent scalar (5-15 in practice). The purpose is to keep the score from being to high when is small, so that in general we select fewer pronunciations if we only have a few acoustic examples of a word.
4.4 Summary: an iterative framework
The proposed pronunciation selection algorithm, which iteratively prunes pronunciations from the initial candidate set , is summarized as Algorithm 1 ( stands for the selected subset of pronunciation candidates at iteration ).
In order to evaluate the performance of the proposed lexicon learning framework, a small seed lexicon is built by randomly sampling a small portion () of words from the vocabulary of the expert lexicon of each task. With the seed lexicon, we train a G2P model using Sequitur  and apply it to all OOV (w.r.t the seed lexicon) words in the vocabulary of the expert lexicon, to get the ”G2P-extended” lexicons.444In this paper we focus on lexicon learning for alphabetic languages. Thereby a G2P model trained with a small seed lexicon is able to generate pronunciations for most words in the expert lexicon. A baseline system called G2P-ext is built using a G2P-extended lexicon with the optimal number of variants per-word tuned on dev data, and another baseline system called G2P-1best is built using a G2P-extended lexicon where we only take the top G2P pronunciation for each word. With this G2P model and acoustic training data for each task, we can build a learned lexicon using the proposed framework, and then train an ASR system called “Lex-learn”. Besides, we have an ASR system trained using the full expert lexicon as the “Oracle” system. Note that the training recipes of three ASR systems (G2P-ext, G2P-1best, Oracle, and Lex-learn) for each task only differ in the lexicons (with the same vocabulary). All experiments were done with Kaldi .
|SAT||11.32 %||13.11 %||14.57 %||11.53 %|
|LF-MMI||6.44 %||6.76 %||7.15 %||6.64 %|
We conduct experiments on the Librispeech-460 task . For each lexicon condition, we use the 460h training data subset to build speaker-adaptive trained GMM (SAT) models (the same AM training recipe as the ”SAT 460” from ), on top of which we then train sub-sampled time-delay neural networks (TDNNs)  with the lattice-free MMI (LF-MMI)  criterion. The WERs are shown in Table 2. It can be seen that the learned lexicon performs better than G2P-extended lexicons, and is close to the oracle lexicon. And the LF-MMI systems are much more robust to the lexicon quality than SAT systems, i.e. the G2P-extended and learned lexicons perform closer to the expert lexicon. The learned lexicon closes (SAT)/ (LF-MMI) of the WER gap between the G2P-ext system and the oracle system. Also, looking at the average number of pronunciations per word, the learned lexicon () is much more compact than the G2P-extended lexicon (), and is very close to the G2P-1best lexicon (), though it performs much better than the G2P-1best lexicon by a large gap: (SAT) / (LF-MMI) relatively in WER.
In Table 3, we compare the proposed framework with more baseline lexicon expansion approaches, on the Librispeech-460 task (WER of SAT systems), with a smaller seed lexicon containing only randomly sampled words from the same expert lexicon, in order to make the performance gap between different systems more noticeable. “G2P-ext”, as described before, is a baseline built with a G2P-extended lexicon (with a tuned size). “-based selection on G2P candidates” means, we first align acoustic training data with a large G2P-extended lexicon containing all G2P generated candidates (up to candidates per word), and then use max-normalized pronunciation probabilities  to prune those candidates for each OOV word, with a tuned threshold (). The pronunciation candidate pool here is the same as the G2P-ext system (i.e. G2P candidates only). “-based selection on G2P+PD candidates” uses the same lexicon expansion approach as the former one but we also add candidates from phonetic decoding (PD) before selection. Therefore this baseline has the same candidate pool as the proposed framework. The last system “likelihood-reduction-based selection on G2P+PD candidates” is the proposed framework (i.e. the “Lex-learn” systems listed before). For fair comparison, under different lexicon conditions, the acoustic models were re-trained on top of the same acoustic model (the one used in the shown G2P-ext system). It can be seen that adding PD candidates to the candidate pool is crucial to the lexicon quality ( WER improvement), and the proposed pronunciation selection method solely brings WER gain and lowers the number of pronunciations per word from to .
|Lexicon condition (avg. #pronunciations per word)||WER|
|G2P-ext (6.57)||13.72 %|
6 Conclusion and future work
In this paper, we propose an acoustic-data driven lexicon learning framework using a likelihood-reduction based criterion for selecting pronunciation candidates from multiple sources, i.e. G2P and phonetic decoding. With the proposed criterion, the pronunciation candidates are pruned iteratively in a greedy way, based on the acoustic data likelihood reduction caused by removing each candidate. This approach enables us to construct a compact yet informative lexicon. Experiments on different ASR tasks show that, with the proposed framework, starting with a small expert lexicon (containing to words), we are able to learn a lexicon which performs closer to a full expert lexicon in terms of WER performance on test data, than lexicons built using G2P alone or with a pruning criterion based on pronunciation probabilities. As future work, we’d like to investigate how the amount of training data affects the lexicon learning performance.
-  M. J. Gales, K. M. Knill, and A. Ragni, “Unicode-based graphemic systems for limited resource languages,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5186–5190.
-  D. F. Harwath and J. R. Glass, “Speech recognition without a lexicon-bridging the gap between graphemic and phonetic systems.” in INTERSPEECH, 2014, pp. 2655–2659.
-  C.-y. Lee, T. J. O’Donnell, and J. Glass, “Unsupervised lexicon discovery from acoustic input,” Transactions of the Association for Computational Linguistics, vol. 3, pp. 389–403, 2015.
-  C.-y. Lee, Y. Zhang, and J. R. Glass, “Joint learning of phonetic units and word pronunciations for asr.” in EMNLp, 2013, pp. 182–192.
-  M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-phoneme conversion,” Speech communication, vol. 50, no. 5, pp. 434–451, 2008.
-  L. Lu, A. Ghoshal, and S. Renals, “Acoustic data-driven pronunciation lexicon for large vocabulary speech recognition,” in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 374–379.
-  N. Goel, S. Thomas, M. Agarwal, P. Akyazi, L. Burget, K. Feng, A. Ghoshal, O. Glembek, M. Karafiát, D. Povey et al., “Approaches to automatic lexicon learning with limited training examples,” in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. IEEE, 2010, pp. 5094–5097.
-  I. McGraw, I. Badr, and J. R. Glass, “Learning lexicons from speech using a pronunciation mixture model,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 2, pp. 357–366, 2013.
-  R. Rasipuram et al., “Combining acoustic data driven g2p and letter-to-sound rules for under resource lexicon generation,” in Proceedings of INTERSPEECH, no. EPFL-CONF-192596, 2012.
-  A. Laurent, S. Meignier, T. Merlin, P. Deléglise, and F. Spécinov-Trélazé, “Acoustics-based phonetic transcription method for proper nouns.” in INTERSPEECH, 2010, pp. 2286–2289.
-  G. Chen, D. Povey, and S. Khudanpur, “Acoustic data-driven pronunciation lexicon generation for logographic languages,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 5350–5354.
-  S. Tsujioka, S. Sakti, K. Yoshino, G. Neubig, and S. Nakamura, “Unsupervised joint estimation of grapheme-to-phoneme conversion systems and acoustic model adaptation for non-native speech recognition,” Interspeech 2016, pp. 3091–3095, 2016.
-  T. Hain, “Implicit pronunciation modelling in asr,” in ISCA Tutorial and Research Workshop (ITRW) on Pronunciation Modeling and Lexicon Adaptation for Spoken Language Technology, 2002.
-  D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, M. Karafiát, S. Kombrink, P. Motlíček, Y. Qian et al., “Generating exact lattices in the wfst framework,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4213–4216.
-  A. Rousseau, P. Deléglise, and Y. Estève, “Ted-lium: an automatic speech recognition dedicated corpus.” in LREC, 2012, pp. 125–129.
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” Proc. ASRU, 2011.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 5206–5210.
-  V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts.” in INTERSPEECH, 2015, pp. 3214–3218.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for asr based on lattice-free mmi,” INTERSPEECH, 2016.