Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is self-training, in which a language model’s parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide self-training, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1-best, we improve self-training efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topic-based language modeling adaptation results over both 1-best and lattice self-training using our ASR channel confusion estimates on telephone conversations.
The Johns Hopkins University
[0.4cm] Estimating Confusions in the ASR Channel for
Improved Topic-based Language Model Adaptation
Damianos Karakos, Mark Dredze, Sanjeev Khudanpur
Technical Report 8
[0.5cm] July 4, 2019
Acknowledgements This work is supported, in part, by the Human Language Technology Center of Excellence. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor.
810 Wyman Park Drive
Baltimore, Maryland 21211
Damianos Karakos Raytheon BBN Technologies Cambridge, MA email@example.com Mark Dredze and Sanjeev Khudanpur Human Language Technology Center of Excellence Center for Language and Speech Processing Johns Hopkins University Baltimore, MD, 21211 mdredze,firstname.lastname@example.org
Modern statistical automatic speech recognition (ASR) systems rely on language models for ranking hypotheses generated from acoustic data. Language models are trained on millions of words (or more) taken from text that matches the spoken language, domain and style of interest. Reliance on (static) training data makes language models brittle [Bellegarda, 2001] to changes in domain. However, for many problems of interest, there are numerous hours of spoken audio but little to no written text for language model training. In these settings, we must rely on language model adaptation using the spoken audio to improve performance on data of interest. One common approach to language model adaptation is self-training [Novotney et al., 2009], in which the language model is retrained on the output from the ASR system run on the new audio. Unfortunately, self-training learns from both correct and error ridden transcriptions that mislead the system, a particular problem for high word error rate (WER) domains, such as conversational speech. Even efforts to consider the entire ASR lattice in self-training cannot account for the ASR error bias. Worse still, this is particularly a problem for rare content words as compared to common function words; the fact that content words are more important for understandability exacerbates the problem.
Confusions in the ASR output pose problems for other applications, such as speech topic classification and spoken document retrieval. In high-WER scenarios, their performance degrades, sometimes considerably [Hazen and Richardson, 2008].
In this work, we consider the problem of topic adaptation of a speech recognizer [Seymore and Rosenfeld, 1997], in which we adapt the language model to the topic of the new speech. Our novelty lies in the fact that we correct for the biases present in the output by estimating ASR confusions. Topic proportions are estimated via a probabilistic graphical model which accounts for confusions in the transcript and provides a more accurate portrayal of the spoken audio. To demonstrate the utility of our model, we show improved results over traditional self-training as well as lattice based self-training for the challenging setting of conversational speech transcription. In particular, we show statistically significant improvements for content words.
Note that ?) also consider the problem of language model adaptation as an error correction problem, but with supervised methods. They train an error-correcting perceptron model on reference transcriptions from the new domain. In contrast, our approach does not assume the existence of transcribed data for training a confusion model; rather, the model is trained in an unsupervised manner based only on the ASR output.
The paper proceeds as follows: Section 2 describes our setting of language model adaptation and our topic based language model. Section 3 presents a language model adaptation process based on maximum-likelihood and maximum-aposteriori self-training, while Section 4 introduces adaptation that utilizes ASR channel estimates. Section 5 describes experiments on conversational speech.
2 Language Model Adaptation
We are given a trained speech recognizer, topic-based language model and a large collection of audio utterances (many conversations) for a new domain, i.e. a change of topics, but not manually transcribed text needed for language model training. Our goal is to adapt the language model by learning new topic distributions using the available audio. We consider self-training that adapts the topic distributions based on automatically transcribed audio.
A conversation is composed of speech utterances are represented by lattices (confusion networks) – annotated with words and posterior probabilities – produced by the speech recognizer. 111We assume conversations but our methods can be applied to non-conversational genres. Each confusion network consists of a sequence of bins, where each bin is a set of words hypothesized by the recognizer at a particular time. The -th bin is denoted by and contains words , where is the vocabulary. 222Under specific contexts, not all vocabulary items are likely to be truly spoken in a bin. Although we use summations over all , we practically use only words which are acoustically confusable and consistent with the lexical context of the bin. When obvious from context, we omit subscript . The most likely word in bin is denoted by . Let be the total number of bins in the confusion networks in a single conversation.
We use unigram topic-based language models (multinomial distributions over ), which capture word frequencies under each topic. Such models have been used in a variety of ways, such as in PLSA [Hofmann, 2001] and LDA [Blei et al., 2003], and under different training scenarios. Topic-based models provide a global context beyond the local word history in modeling a word’s probability, and have been found especially useful in language model adaptation [Tam and Schultz, 2005, Wang and Stolcke, 2007, Hsu and Glass, 2006]. Each topic has a multinomial distribution denoted by . These topic-word distributions are learned from conversations labeled with topics, such as those in the Fisher speech corpus.333Alternative approaches estimate topics from unlabeled data, but we use labeled data for evaluation purposes.
Adapting the topic-based language model means learning a set of conversation specific mixture weights , where indicates the likelihood of seeing topic in conversation .444Unless required we drop the superscript . While topic compositions remain fixed, the topics selected change with each conversation.555This assumption can be relaxed to learn as well. These mixture weights form the true distribution of a word:
3 Learning from ASR Output
We begin by following previous approaches to self-training, in which model parameters are re-estimated based on ASR output. We consider self-training based on 1-best and lattice maximum-likelihood estimation (MLE) as well as maximum-aposteriori (MAP) training. In the next section, we modify these approaches to incorporate our confusion estimates.
For estimating the topic mixtures using self-training on the 1-best ASR output, i.e. the 1-best path in the confusion network , we write the log-likelihood of the observations :
where is a mixture of topic models (1), for mixture weights . We expect topic-based distributions will better estimate the true word distribution than the empirical estimate as the latter is biased due to ASR errors. Maximum-likelihood estimation of in (2) is given by:
Using the EM algorithm, is the expected log-likelihood of the “complete” data:
where is the posterior distribution of the topic in the -th bin, given the 1-best word ; this is computed in the E-step of the -th iteration:
In the M-step of the iteration, the new estimate of the prior is computed by maximizing (4), i.e.,
3.1 Learning from Expected Counts
Following ?) we next consider using the entire bin in self-training by maximizing the expected log-likelihood of the ASR output:
where is a random variable which takes the value with probability equal to the confidence (posterior probability) of the recognizer. The maximum-likelihood estimation problem becomes:
where denotes the expected count of word in the conversation, given by the sum of the posteriors of in all the confusion network bins of the conversation [Karakos et al., 2011] (note that for text documents, it is equivalent to term-frequency). We again use the EM algorithm, with objective function:
where is the posterior distribution of the topic given word computed using (5) (but with instead of ). In the M-step of the iteration, the new estimate of the prior is computed by maximizing (9), i.e.,
3.2 Maximum-Aposteriori Estimation of
In addition to a maximum likelihood estimate, we consider a maximum-aposteriori (MAP) estimate by placing a Dirichlet prior over [Bacchiani and Roark, 2003]:
where is the pdf of the Dirichlet distribution with parameter :
This introduces an extra component in the optimization. It is easy to prove that the update equation for becomes:
for the case where only the ASR 1-best is used, and
for the case where expected counts are used. Note that the notation stands for .
4 Confusion Estimation in ASR Output
Self-training on ASR output can mislead the language model through confusions in the ASR channel. By modeling these confusions we can guide self-training and recover from recognition errors.
The ASR channel confusion model is represented by a conditional probability distribution , which denotes the probability that the most likely word in the output of the recognizer (i.e., the “1-best” word) is , given that the true (reference) word spoken is . Of course, this conditional distribution is just an approximation as many other phenomena – coarticulation, non-stationarity of speech, channel variations, lexical choice in the context, etc. – cause this probability to vary. We assume that is an “average” of the conditional probabilities under various conditions.
We use the following simple procedure for estimating the ASR channel, similar to that of [Xu et al., 2009] for computing cohort sets:
Create confusion networks [Mangu et al., 1999] with the available audio.
Count , the number of times words appear in the same bin.
The conditional ASR probability is computed as .
Prune words whose posterior probabilities are lower than 5% of the max probability in a bin.
Keep only the top 10 words in each bin.
The last two items above were implemented as a way of reducing the search space and the complexity of the task. We did not observe significant changes in the results when we relaxed these two constraints.
4.1 Learning with ASR Confusion Estimates
We now derive a maximum-likelihood estimate of based on the 1-best ASR output but relies on the ASR channel confusion model . The log-likelihood of the observations (most likely path in the confusion network) is:
where is the induced distribution on the observations under the confusion model and the estimated distribution of (1):
Recall that while we sum over , in practice the summation is limited to only likely words. One could argue that the ASR channel confusion distribution should discount unlikely confusions. However, since is not bin specific, unlikely words in a specific bin could receive non-negligible probability from if they are likely in general. This makes the truncated summation over problematic.
One solution would be to reformulate so that it becomes a conditional distribution given the left (or even right) lexical context of the bin. But this formulation adds complexity and suffers from the usual sparsity issues. The solution we follow here imposes the constraint that only the words already existing in are allowed to be candidate words giving rise to . This uses the “pre-filtered” set of words in to condition on context (acoustic and lexical), without having to model such context explicitly. We anticipate that this conditioning on the words of leads to more accurate inference. The likelihood objective then becomes:
with defined as:
i.e., the induced probability conditioned on bin . Note that although we estimate a conversation level distribution , it has to be normalized in each bin by dividing by , in order to condition only on the words in the bin. The maximum-likelihood estimation for becomes:
Note that the appearance of in the denominator makes the maximization harder.
As before, we rely on an EM-procedure to maximize this objective. Let us assume that at the -th iteration of EM we have an estimate of the prior distribution , denoted by . This induces an observation probability in bin based on equation (17), as well as a posterior probability that a word is the reference word, given . The goal is to come up with a new estimate of the prior distribution that increases the value of the log-likelihood. If we define:
then we have:
Thus, , as a function of , can be written as:
Equation (LABEL:eq:q2) cannot be easily maximized with respect to by simple differentiation, because the elements of appear in the denominator, making them coupled. Instead, we will derive and maximize a lower bound for the -difference (20).
For the rest of the derivation we assume that , where . We can thus express as a function of as follows:
Interestingly, the fact that the sum appears in both numerator and denominator in the above expression allows us to discard it.
At iteration , the goal is to come up with an update , such that results in a higher value for ; i.e., we require:
where is the weight vector that resulted after optimizing in the -th iteration. Obviously,
Let us consider the difference:
We use the well-known inequality and obtain:
Next, we apply Jensen’s inequality to obtain:
It now suffices to find a that maximizes the lower bound of (30), as this will guarantee that the -difference will be greater than zero. Note that the lower bound in (30) is a concave function of , and it thus has a global maximum that we will find using differentiation. Let us set equal to the right-hand-side of (30). Then,
By setting (LABEL:eq:partial1) equal to 0 and solving for , we obtain the update for (or, equivalently, ):
4.2 Learning from Expected Counts
As before, we now consider maximizing the expected log-likelihood of the ASR output:
where takes the value with probability equal to the confidence (posterior probability) of the recognizer. The modified maximum-likelihood estimation problem now becomes:
By following a procedure similar to the one described earlier in this section, we come up with the lower bound on the -difference:
whose optimization results in the following update equation for :
4.3 Maximum-Aposteriori Estimation of
Finally, we consider a MAP estimation of with the confusion model. The optimization contains the term and the -difference (20) is:
where are the corresponding normalizing constants. In order to obtain a lower bound on the difference in the second line of (LABEL:eq:q_diff_map), we consider the following chain of equalities:
We distinguish two cases:
(i) Case 1: .
We apply Jensen’s inequality to obtain the following lower bound on (38),
The lower bound (39) now becomes part of the lower bound in (30) (for the case of the 1-best) or the lower bound (35) (for the case of expected counts), and after differentiating with respect to and setting the result equal to zero we obtain the update equation:
where in (40) is equal to
in the case of using just the 1-best, and
in the case of using the lattice.
(ii) Case 2: .
We apply the well-known inequality and obtain the following lower bound on (38),
As before, the lower bound (43) now becomes part of the lower bound in (30) (for the case of the 1-best) or the lower bound (35) (for the case of expected counts), and after differentiating with respect to and setting the result equal to zero we obtain the update equation:
We summarize our methods as graphical models in Figure 1. We estimate the topic distributions for a conversation by either self-training directly on the ASR output, or including our ASR channel confusion model. For training on the ASR output, we rely on MLE 1-best training ((5),(6), left figure), or expected counts from the lattices (10). For both settings we also consider MAP estimation: 1-best (12) and expected counts (13). When using the ASR channel confusion model, we derived parallel cases for MLE 1-best ((32), middle figure) and expected counts (36), as well as the MAP (right figure) training of each ((41),(42)).
5 Experimental Results
Speech data is taken from the Fisher telephone conversation speech corpus, which has been split into 4 parts: set is one hour of speech used for training the ASR system (acoustic modeling). We chose a small training set to simulate a high WER condition (approx. 56%) since we are primarily interested in low resource settings. While conversational speech is a challenging task even with many hours of training data, we are interested in settings with a tiny amount of training data, such as for new domains, languages or noise conditions. Set , a superset of , contains 5.5mil words of manual transcripts, used for training the topic-based distribution . Conversations in the Fisher corpus are labeled with 40 topics so we create 40 topic-based unigram distributions. These are smoothed based on the vocabulary of the recognizer using the Witten-Bell algorithm [Chen and Goodman, 1996]. Set B, which consists of 50 hours and is disjoint from the other sets, is used as a development corpus for tuning the MAP parameter . Finally, set C (44 hours) is used as a blind test set. The ASR channel and the topic proportions are learned in an unsupervised manner on both sets B and C. The results are reported on approximately 5-hour subsets of sets B and C, consisting of 35 conversations each.
BBN’s ASR system, Byblos, was used in all ASR experiments. It is a multi-pass LVCSR system that uses state-clustered Gaussian tied-mixture models at the triphone and quinphone levels [Prasad et al., 2005]. The audio features are transformed using cepstral normalization, HLDA and VTLN. Only ML estimation was used. Decoding performs three passes: a forward and backward pass with a triphone acoustic model (AM) and a 3-gram language model (LM), and rescoring using quinphone AM and a 4-gram LM. These three steps are repeated after speaker adaptation using CMLLR. The vocabulary of the recognizer is 75k words. References of the dev and test sets have vocabularies of 13k and 11k respectively.
Our focus on estimating confusions suggests that improvements would be manifest for content words, as opposed to frequently occurring function words. This would be a highly desirable improvement as more accurate content words lead to improved readability or performance on downstream tasks, such as information retrieval and spoken term detection. As a result, we care more about reducing perplexity on these content words than reducing overall scores, which give too much emphasis to function words, the most frequent tokens in the reference transcriptions.
To measure content word (low-frequency words) improvements we use the method of ?), who compute a constrained version of perplexity focused on content words. We restrict the computation to only those words whose counts in the reference transcripts are at most equal to a threshold thr. Perplexity [Jelinek, 1997] is measured on the manual transcripts of both dev and test data based on the formula , where represents the estimated language model, and is the size of the (dev or test) corpus. We emphasize that constrained perplexity is not an evaluation metric, and directly optimizing it would foolishly hurt overall perplexity. However, if overall perplexity remains unchanged, then improvements in content word perplexity reflect a shift of the probability mass, emphasizing corrections in content words over the accuracy of function words, a sensible choice for improving output quality.
First we observe that overall perplexity (far right of Figure 2) remains unchanged; none of the differences between models with and without confusion estimates are statistically significant. However, as expected, the confusion estimates significantly improves the performance on content words (left and center of Figure 2.) The confusion model gives modest (4-6% relative) but statistically significant () gains in all conditions for content (low-frequency) word. Additionally, the MAP variant (which was tuned based on low-frequency words) gives gains over the MLE version in all conditions, for both the self-supervised and the confusion model cases. This indicates that modeling confusions focuses improvements on content words, which improve readability and downstream applications.
Finally, we consider how our adapted language models can improve WER of the ASR system. We used each of the language models with the recognizer to produce transcripts of the dev and test sets. Overall, the best language model (including the confusion model) yielded no change in the overall WER (as we observed with perplexity). However, in a rescoring experiment, the adapted model with confusion estimates resulted in a 0.3% improvement in content WER (errors restricted to words that appear at most 3 times) over the unadapted model, and a 0.1% improvement over the regular adapted model. This confirms that our improved language models yield better recognition output, focused on improvements in content words.
We have presented a new model that captures the confusions (errors) of the ASR channel. When incorporated with adaptation of a topic-based language model, we observe improvements in modeling of content words that improve readability and downstream applications. Our improvements are consistent across a number of settings, including 1-best and lattice self-training on conversational speech. Beyond improvements to language modeling, we believe that our confusion model can aid other speech tasks, such as topic classification. We plan to investigate other tasks, as well as better confusion models, in future work.
- [Bacchiani and Roark, 2003] M. Bacchiani and B. Roark. 2003. Unsupervised language model adaptation. In Proceedings of ICASSP-2003, pages 224–227.
- [Bacchiani et al., 2004] M. Bacchiani, B. Roark, and M. Saraclar. 2004. Language model adaptation with MAP estimation and the perceptron algorithm. In Proceedings of HLT-2004, pages 21–24.
- [Bellegarda, 2001] J. R. Bellegarda. 2001. An overview of statistical language model adaptation. In Proceedings of ISCA Tutorial and Research Workshop (ITRW), Adaptation Methods for Speech Recognition.
- [Blei et al., 2003] D. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3, January.
- [Chen and Goodman, 1996] S. F. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the ACL, pages 310–318.
- [Cover and Thomas, 1996] T. M. Cover and J. A. Thomas. 1996. Elements of Information Theory. John Wiley and Sons, Inc.
- [Hazen and Richardson, 2008] T. J. Hazen and F. Richardson. 2008. A hybrid svm/mce training approach for vector space topic identification of spoken audio recordings. In Proceedings of Interspeech-2008, pages 2542–2545.
- [Hofmann, 2001] T. Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1/2):177–196.
- [Hsu and Glass, 2006] B-J. Hsu and J. Glass. 2006. Style & topic language model adaptation using HMM-LDA. In Proceedings of EMNLP-2006.
- [Jelinek, 1997] F. Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press.
- [Karakos et al., 2011] D. Karakos, M. Dredze, K. Church, A. Jansen, and S. Khudanpur. 2011. Estimating document frequencies in a speech corpus. In Proceedings of ASRU-2011.
- [Mangu et al., 1999] L. Mangu, E. Brill, and A. Stolcke. 1999. Finding consensus among words: Lattice-based word error minimization. In Proceedings of Eurospeech-1999.
- [Novotney et al., 2009] S. Novotney, R. Schwartz, and J. Ma. 2009. Unsupervised acoustic and language model training with small amounts of labelled data. In Proceedings of ICASSP-2009.
- [Prasad et al., 2005] R. Prasad, S. Matsoukas, C.-L. Kao, J. Z. Ma, D.-X. Xu, T. Colthurst, O. Kimball, R. Schwartz, J-L. Gauvain, L. Lamel, H. Schwenk, G. Adda, and F. Lefevre. 2005. The 2004 BBN/LIMSI 20xRT english conversational telephone speech recognition system. In Proceedings of Interspeech-2005, pages 1645–1648.
- [Seymore and Rosenfeld, 1997] K. Seymore and R. Rosenfeld. 1997. Using story topics for language model adaptation. In Proceedings of Eurospeech-1997.
- [Tam and Schultz, 2005] Y-C. Tam and T. Schultz. 2005. Dynamic language model adaptation using variational Bayes inference. In Proceedings of Eurospeech-2005.
- [Wang and Stolcke, 2007] W. Wang and A. Stolcke. 2007. Integrating MAP, marginals, and unsupervised language model adaptation. In Proceedings of Interspeech-2007, pages 618–621.
- [Wu and Khudanpur, 2000] J. Wu and S. Khudanpur. 2000. Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling. Computer Speech and Language, 14:355–372.
- [Xu et al., 2009] P. Xu, D. Karakos, and S. Khudanpur. 2009. Self-supervised discriminative training of statistical language models. In Proceedings of ASRU-2009.