Abstract
Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is selftraining, in which a language model’s parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide selftraining, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1best, we improve selftraining efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topicbased language modeling adaptation results over both 1best and lattice selftraining using our ASR channel confusion estimates on telephone conversations.
The Johns Hopkins University
[0.5cm]
[1cm]
[0.4cm] Estimating Confusions in the ASR Channel for
Improved Topicbased Language Model Adaptation
[0.4cm]
[0.5cm]
Damianos Karakos, Mark Dredze, Sanjeev Khudanpur
[0.9cm]
Technical Report 8
[0.5cm] July 4, 2019
©HLTCOE, 2013
Acknowledgements This work is supported, in part, by the Human Language Technology Center of Excellence. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsor.
HLTCOE
810 Wyman Park Drive
Baltimore, Maryland
21211
http://hltcoe.jhu.edu
Estimating Confusions in the ASR Channel for
Improved Topicbased Language Model Adaptation
Damianos Karakos Raytheon BBN Technologies Cambridge, MA dkarakos@bbn.com Mark Dredze and Sanjeev Khudanpur Human Language Technology Center of Excellence Center for Language and Speech Processing Johns Hopkins University Baltimore, MD, 21211 mdredze,khudanpur@jhu.edu
1 Introduction
Modern statistical automatic speech recognition (ASR) systems rely on language models for ranking hypotheses generated from acoustic data. Language models are trained on millions of words (or more) taken from text that matches the spoken language, domain and style of interest. Reliance on (static) training data makes language models brittle [Bellegarda, 2001] to changes in domain. However, for many problems of interest, there are numerous hours of spoken audio but little to no written text for language model training. In these settings, we must rely on language model adaptation using the spoken audio to improve performance on data of interest. One common approach to language model adaptation is selftraining [Novotney et al., 2009], in which the language model is retrained on the output from the ASR system run on the new audio. Unfortunately, selftraining learns from both correct and error ridden transcriptions that mislead the system, a particular problem for high word error rate (WER) domains, such as conversational speech. Even efforts to consider the entire ASR lattice in selftraining cannot account for the ASR error bias. Worse still, this is particularly a problem for rare content words as compared to common function words; the fact that content words are more important for understandability exacerbates the problem.
Confusions in the ASR output pose problems for other applications, such as speech topic classification and spoken document retrieval. In highWER scenarios, their performance degrades, sometimes considerably [Hazen and Richardson, 2008].
In this work, we consider the problem of topic adaptation of a speech recognizer [Seymore and Rosenfeld, 1997], in which we adapt the language model to the topic of the new speech. Our novelty lies in the fact that we correct for the biases present in the output by estimating ASR confusions. Topic proportions are estimated via a probabilistic graphical model which accounts for confusions in the transcript and provides a more accurate portrayal of the spoken audio. To demonstrate the utility of our model, we show improved results over traditional selftraining as well as lattice based selftraining for the challenging setting of conversational speech transcription. In particular, we show statistically significant improvements for content words.
Note that ?) also consider the problem of language model adaptation as an error correction problem, but with supervised methods. They train an errorcorrecting perceptron model on reference transcriptions from the new domain. In contrast, our approach does not assume the existence of transcribed data for training a confusion model; rather, the model is trained in an unsupervised manner based only on the ASR output.
The paper proceeds as follows: Section 2 describes our setting of language model adaptation and our topic based language model. Section 3 presents a language model adaptation process based on maximumlikelihood and maximumaposteriori selftraining, while Section 4 introduces adaptation that utilizes ASR channel estimates. Section 5 describes experiments on conversational speech.
2 Language Model Adaptation
We are given a trained speech recognizer, topicbased language model and a large collection of audio utterances (many conversations) for a new domain, i.e. a change of topics, but not manually transcribed text needed for language model training. Our goal is to adapt the language model by learning new topic distributions using the available audio. We consider selftraining that adapts the topic distributions based on automatically transcribed audio.
A conversation is composed of speech utterances are represented by lattices (confusion networks) – annotated with words and posterior probabilities – produced by the speech recognizer. ^{1}^{1}1We assume conversations but our methods can be applied to nonconversational genres. Each confusion network consists of a sequence of bins, where each bin is a set of words hypothesized by the recognizer at a particular time. The th bin is denoted by and contains words , where is the vocabulary. ^{2}^{2}2Under specific contexts, not all vocabulary items are likely to be truly spoken in a bin. Although we use summations over all , we practically use only words which are acoustically confusable and consistent with the lexical context of the bin. When obvious from context, we omit subscript . The most likely word in bin is denoted by . Let be the total number of bins in the confusion networks in a single conversation.
We use unigram topicbased language models (multinomial distributions over ), which capture word frequencies under each topic. Such models have been used in a variety of ways, such as in PLSA [Hofmann, 2001] and LDA [Blei et al., 2003], and under different training scenarios. Topicbased models provide a global context beyond the local word history in modeling a word’s probability, and have been found especially useful in language model adaptation [Tam and Schultz, 2005, Wang and Stolcke, 2007, Hsu and Glass, 2006]. Each topic has a multinomial distribution denoted by . These topicword distributions are learned from conversations labeled with topics, such as those in the Fisher speech corpus.^{3}^{3}3Alternative approaches estimate topics from unlabeled data, but we use labeled data for evaluation purposes.
Adapting the topicbased language model means learning a set of conversation specific mixture weights , where indicates the likelihood of seeing topic in conversation .^{4}^{4}4Unless required we drop the superscript . While topic compositions remain fixed, the topics selected change with each conversation.^{5}^{5}5This assumption can be relaxed to learn as well. These mixture weights form the true distribution of a word:
(1) 
3 Learning from ASR Output
We begin by following previous approaches to selftraining, in which model parameters are reestimated based on ASR output. We consider selftraining based on 1best and lattice maximumlikelihood estimation (MLE) as well as maximumaposteriori (MAP) training. In the next section, we modify these approaches to incorporate our confusion estimates.
For estimating the topic mixtures using selftraining on the 1best ASR output, i.e. the 1best path in the confusion network , we write the loglikelihood of the observations :
(2) 
where is a mixture of topic models (1), for mixture weights . We expect topicbased distributions will better estimate the true word distribution than the empirical estimate as the latter is biased due to ASR errors. Maximumlikelihood estimation of in (2) is given by:
(3)  
Using the EM algorithm, is the expected loglikelihood of the “complete” data:
(4) 
where is the posterior distribution of the topic in the th bin, given the 1best word ; this is computed in the Estep of the th iteration:
(5) 
In the Mstep of the iteration, the new estimate of the prior is computed by maximizing (4), i.e.,
(6) 
3.1 Learning from Expected Counts
Following ?) we next consider using the entire bin in selftraining by maximizing the expected loglikelihood of the ASR output:
(7) 
where is a random variable which takes the value with probability equal to the confidence (posterior probability) of the recognizer. The maximumlikelihood estimation problem becomes:
where denotes the expected count of word in the conversation, given by the sum of the posteriors of in all the confusion network bins of the conversation [Karakos et al., 2011] (note that for text documents, it is equivalent to termfrequency). We again use the EM algorithm, with objective function:
(9) 
where is the posterior distribution of the topic given word computed using (5) (but with instead of ). In the Mstep of the iteration, the new estimate of the prior is computed by maximizing (9), i.e.,
(10) 
3.2 MaximumAposteriori Estimation of
In addition to a maximum likelihood estimate, we consider a maximumaposteriori (MAP) estimate by placing a Dirichlet prior over [Bacchiani and Roark, 2003]:
(11) 
where is the pdf of the Dirichlet distribution with parameter :
This introduces an extra component in the optimization. It is easy to prove that the update equation for becomes:
(12) 
for the case where only the ASR 1best is used, and
(13) 
for the case where expected counts are used. Note that the notation stands for .
4 Confusion Estimation in ASR Output
Selftraining on ASR output can mislead the language model through confusions in the ASR channel. By modeling these confusions we can guide selftraining and recover from recognition errors.
The ASR channel confusion model is represented by a conditional probability distribution , which denotes the probability that the most likely word in the output of the recognizer (i.e., the “1best” word) is , given that the true (reference) word spoken is . Of course, this conditional distribution is just an approximation as many other phenomena – coarticulation, nonstationarity of speech, channel variations, lexical choice in the context, etc. – cause this probability to vary. We assume that is an “average” of the conditional probabilities under various conditions.
We use the following simple procedure for estimating the ASR channel, similar to that of [Xu et al., 2009] for computing cohort sets:

Create confusion networks [Mangu et al., 1999] with the available audio.

Count , the number of times words appear in the same bin.

The conditional ASR probability is computed as .

Prune words whose posterior probabilities are lower than 5% of the max probability in a bin.

Keep only the top 10 words in each bin.
The last two items above were implemented as a way of reducing the search space and the complexity of the task. We did not observe significant changes in the results when we relaxed these two constraints.
4.1 Learning with ASR Confusion Estimates
We now derive a maximumlikelihood estimate of based on the 1best ASR output but relies on the ASR channel confusion model . The loglikelihood of the observations (most likely path in the confusion network) is:
(14) 
where is the induced distribution on the observations under the confusion model and the estimated distribution of (1):
(15) 
Recall that while we sum over , in practice the summation is limited to only likely words. One could argue that the ASR channel confusion distribution should discount unlikely confusions. However, since is not bin specific, unlikely words in a specific bin could receive nonnegligible probability from if they are likely in general. This makes the truncated summation over problematic.
One solution would be to reformulate so that it becomes a conditional distribution given the left (or even right) lexical context of the bin. But this formulation adds complexity and suffers from the usual sparsity issues. The solution we follow here imposes the constraint that only the words already existing in are allowed to be candidate words giving rise to . This uses the “prefiltered” set of words in to condition on context (acoustic and lexical), without having to model such context explicitly. We anticipate that this conditioning on the words of leads to more accurate inference. The likelihood objective then becomes:
(16) 
with defined as:
(17) 
i.e., the induced probability conditioned on bin . Note that although we estimate a conversation level distribution , it has to be normalized in each bin by dividing by , in order to condition only on the words in the bin. The maximumlikelihood estimation for becomes:
Note that the appearance of in the denominator makes the maximization harder.
As before, we rely on an EMprocedure to maximize this objective. Let us assume that at the th iteration of EM we have an estimate of the prior distribution , denoted by . This induces an observation probability in bin based on equation (17), as well as a posterior probability that a word is the reference word, given . The goal is to come up with a new estimate of the prior distribution that increases the value of the loglikelihood. If we define:
(19) 
then we have:
(20)  
where (20) holds because [Cover and Thomas, 1996]. Thus, we just need to find the value of that maximizes , as this will guarantee that . The distribution can be written:
(21) 
Thus, , as a function of , can be written as:
Equation (LABEL:eq:q2) cannot be easily maximized with respect to by simple differentiation, because the elements of appear in the denominator, making them coupled. Instead, we will derive and maximize a lower bound for the difference (20).
For the rest of the derivation we assume that , where . We can thus express as a function of as follows:
Interestingly, the fact that the sum appears in both numerator and denominator in the above expression allows us to discard it.
At iteration , the goal is to come up with an update , such that results in a higher value for ; i.e., we require:
(24) 
where is the weight vector that resulted after optimizing in the th iteration. Obviously,
(25) 
Let us consider the difference:
(27)  
We use the wellknown inequality and obtain:
(28)  
Next, we apply Jensen’s inequality to obtain:
(29) 
By combining (27), (28) and (29), we obtain a lower bound on the difference of (27):
(30) 
It now suffices to find a that maximizes the lower bound of (30), as this will guarantee that the difference will be greater than zero. Note that the lower bound in (30) is a concave function of , and it thus has a global maximum that we will find using differentiation. Let us set equal to the righthandside of (30). Then,
By setting (LABEL:eq:partial1) equal to 0 and solving for , we obtain the update for (or, equivalently, ):
(32)  
4.2 Learning from Expected Counts
As before, we now consider maximizing the expected loglikelihood of the ASR output:
(33) 
where takes the value with probability equal to the confidence (posterior probability) of the recognizer. The modified maximumlikelihood estimation problem now becomes:
By following a procedure similar to the one described earlier in this section, we come up with the lower bound on the difference:
(35) 
whose optimization results in the following update equation for :
(36)  
4.3 MaximumAposteriori Estimation of
Finally, we consider a MAP estimation of with the confusion model. The optimization contains the term and the difference (20) is:
where are the corresponding normalizing constants. In order to obtain a lower bound on the difference in the second line of (LABEL:eq:q_diff_map), we consider the following chain of equalities:
(38)  
We distinguish two cases:
(i) Case 1: .
We
apply Jensen’s inequality to obtain the following lower bound on (38),
(39) 
The lower bound (39) now becomes part of the lower bound in (30) (for the case of the 1best) or the lower bound (35) (for the case of expected counts), and after differentiating with respect to and setting the result equal to zero we obtain the update equation:
(40)  
where in (40) is equal to
(41) 
in the case of using just the 1best, and
(42) 
in the case of using the lattice.
(ii) Case 2: .
We apply the wellknown inequality and obtain the following lower bound on (38),
(43) 
As before, the lower bound (43) now becomes part of the lower bound in (30) (for the case of the 1best) or the lower bound (35) (for the case of expected counts), and after differentiating with respect to and setting the result equal to zero we obtain the update equation:
(44)  
where in (44) is equal to (41) or (42), depending on whether the loglikelihood of the 1best or the expected loglikelihood is maximized.
Summary:
We summarize our methods as graphical models in Figure 1. We estimate the topic distributions for a conversation by either selftraining directly on the ASR output, or including our ASR channel confusion model. For training on the ASR output, we rely on MLE 1best training ((5),(6), left figure), or expected counts from the lattices (10). For both settings we also consider MAP estimation: 1best (12) and expected counts (13). When using the ASR channel confusion model, we derived parallel cases for MLE 1best ((32), middle figure) and expected counts (36), as well as the MAP (right figure) training of each ((41),(42)).
5 Experimental Results
We compare selftraining, as described in Section 3, with our confusion approach, as described in Section 4, on topicbased language model adaptation.
Setup
Speech data is taken from the Fisher telephone conversation speech corpus, which has been split into 4 parts: set is one hour of speech used for training the ASR system (acoustic modeling). We chose a small training set to simulate a high WER condition (approx. 56%) since we are primarily interested in low resource settings. While conversational speech is a challenging task even with many hours of training data, we are interested in settings with a tiny amount of training data, such as for new domains, languages or noise conditions. Set , a superset of , contains 5.5mil words of manual transcripts, used for training the topicbased distribution . Conversations in the Fisher corpus are labeled with 40 topics so we create 40 topicbased unigram distributions. These are smoothed based on the vocabulary of the recognizer using the WittenBell algorithm [Chen and Goodman, 1996]. Set B, which consists of 50 hours and is disjoint from the other sets, is used as a development corpus for tuning the MAP parameter . Finally, set C (44 hours) is used as a blind test set. The ASR channel and the topic proportions are learned in an unsupervised manner on both sets B and C. The results are reported on approximately 5hour subsets of sets B and C, consisting of 35 conversations each.
BBN’s ASR system, Byblos, was used in all ASR experiments. It is a multipass LVCSR system that uses stateclustered Gaussian tiedmixture models at the triphone and quinphone levels [Prasad et al., 2005]. The audio features are transformed using cepstral normalization, HLDA and VTLN. Only ML estimation was used. Decoding performs three passes: a forward and backward pass with a triphone acoustic model (AM) and a 3gram language model (LM), and rescoring using quinphone AM and a 4gram LM. These three steps are repeated after speaker adaptation using CMLLR. The vocabulary of the recognizer is 75k words. References of the dev and test sets have vocabularies of 13k and 11k respectively.
Content Words
Our focus on estimating confusions suggests that improvements would be manifest for content words, as opposed to frequently occurring function words. This would be a highly desirable improvement as more accurate content words lead to improved readability or performance on downstream tasks, such as information retrieval and spoken term detection. As a result, we care more about reducing perplexity on these content words than reducing overall scores, which give too much emphasis to function words, the most frequent tokens in the reference transcriptions.
To measure content word (lowfrequency words) improvements we use the method of ?), who compute a constrained version of perplexity focused on content words. We restrict the computation to only those words whose counts in the reference transcripts are at most equal to a threshold thr. Perplexity [Jelinek, 1997] is measured on the manual transcripts of both dev and test data based on the formula , where represents the estimated language model, and is the size of the (dev or test) corpus. We emphasize that constrained perplexity is not an evaluation metric, and directly optimizing it would foolishly hurt overall perplexity. However, if overall perplexity remains unchanged, then improvements in content word perplexity reflect a shift of the probability mass, emphasizing corrections in content words over the accuracy of function words, a sensible choice for improving output quality.
Results
First we observe that overall perplexity (far right of Figure 2) remains unchanged; none of the differences between models with and without confusion estimates are statistically significant. However, as expected, the confusion estimates significantly improves the performance on content words (left and center of Figure 2.) The confusion model gives modest (46% relative) but statistically significant () gains in all conditions for content (lowfrequency) word. Additionally, the MAP variant (which was tuned based on lowfrequency words) gives gains over the MLE version in all conditions, for both the selfsupervised and the confusion model cases. This indicates that modeling confusions focuses improvements on content words, which improve readability and downstream applications.
ASR Improvements
Finally, we consider how our adapted language models can improve WER of the ASR system. We used each of the language models with the recognizer to produce transcripts of the dev and test sets. Overall, the best language model (including the confusion model) yielded no change in the overall WER (as we observed with perplexity). However, in a rescoring experiment, the adapted model with confusion estimates resulted in a 0.3% improvement in content WER (errors restricted to words that appear at most 3 times) over the unadapted model, and a 0.1% improvement over the regular adapted model. This confirms that our improved language models yield better recognition output, focused on improvements in content words.
6 Conclusion
We have presented a new model that captures the confusions (errors) of the ASR channel. When incorporated with adaptation of a topicbased language model, we observe improvements in modeling of content words that improve readability and downstream applications. Our improvements are consistent across a number of settings, including 1best and lattice selftraining on conversational speech. Beyond improvements to language modeling, we believe that our confusion model can aid other speech tasks, such as topic classification. We plan to investigate other tasks, as well as better confusion models, in future work.
References
 [Bacchiani and Roark, 2003] M. Bacchiani and B. Roark. 2003. Unsupervised language model adaptation. In Proceedings of ICASSP2003, pages 224–227.
 [Bacchiani et al., 2004] M. Bacchiani, B. Roark, and M. Saraclar. 2004. Language model adaptation with MAP estimation and the perceptron algorithm. In Proceedings of HLT2004, pages 21–24.
 [Bellegarda, 2001] J. R. Bellegarda. 2001. An overview of statistical language model adaptation. In Proceedings of ISCA Tutorial and Research Workshop (ITRW), Adaptation Methods for Speech Recognition.
 [Blei et al., 2003] D. Blei, A. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research, 3, January.
 [Chen and Goodman, 1996] S. F. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting of the ACL, pages 310–318.
 [Cover and Thomas, 1996] T. M. Cover and J. A. Thomas. 1996. Elements of Information Theory. John Wiley and Sons, Inc.
 [Hazen and Richardson, 2008] T. J. Hazen and F. Richardson. 2008. A hybrid svm/mce training approach for vector space topic identification of spoken audio recordings. In Proceedings of Interspeech2008, pages 2542–2545.
 [Hofmann, 2001] T. Hofmann. 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1/2):177–196.
 [Hsu and Glass, 2006] BJ. Hsu and J. Glass. 2006. Style & topic language model adaptation using HMMLDA. In Proceedings of EMNLP2006.
 [Jelinek, 1997] F. Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press.
 [Karakos et al., 2011] D. Karakos, M. Dredze, K. Church, A. Jansen, and S. Khudanpur. 2011. Estimating document frequencies in a speech corpus. In Proceedings of ASRU2011.
 [Mangu et al., 1999] L. Mangu, E. Brill, and A. Stolcke. 1999. Finding consensus among words: Latticebased word error minimization. In Proceedings of Eurospeech1999.
 [Novotney et al., 2009] S. Novotney, R. Schwartz, and J. Ma. 2009. Unsupervised acoustic and language model training with small amounts of labelled data. In Proceedings of ICASSP2009.
 [Prasad et al., 2005] R. Prasad, S. Matsoukas, C.L. Kao, J. Z. Ma, D.X. Xu, T. Colthurst, O. Kimball, R. Schwartz, JL. Gauvain, L. Lamel, H. Schwenk, G. Adda, and F. Lefevre. 2005. The 2004 BBN/LIMSI 20xRT english conversational telephone speech recognition system. In Proceedings of Interspeech2005, pages 1645–1648.
 [Seymore and Rosenfeld, 1997] K. Seymore and R. Rosenfeld. 1997. Using story topics for language model adaptation. In Proceedings of Eurospeech1997.
 [Tam and Schultz, 2005] YC. Tam and T. Schultz. 2005. Dynamic language model adaptation using variational Bayes inference. In Proceedings of Eurospeech2005.
 [Wang and Stolcke, 2007] W. Wang and A. Stolcke. 2007. Integrating MAP, marginals, and unsupervised language model adaptation. In Proceedings of Interspeech2007, pages 618–621.
 [Wu and Khudanpur, 2000] J. Wu and S. Khudanpur. 2000. Maximum entropy techniques for exploiting syntactic, semantic and collocational dependencies in language modeling. Computer Speech and Language, 14:355–372.
 [Xu et al., 2009] P. Xu, D. Karakos, and S. Khudanpur. 2009. Selfsupervised discriminative training of statistical language models. In Proceedings of ASRU2009.