ON MODELING ASR WORD CONFIDENCE

On Modeling Asr Word Confidence

Abstract

We present a new method for computing ASR word confidences that effectively mitigates the effect of ASR errors for diverse downstream applications, improves the word error rate of the 1-best result, and allows better comparison of scores across different models. We propose 1) a new method for modeling word confidence using a Heterogeneous Word Confusion Network (HWCN) that addresses some key flaws in conventional Word Confusion Networks, and 2) a new score calibration method for facilitating direct comparison of scores from different models. Using a bidirectional lattice recurrent neural network to compute the confidence scores of each word in the HWCN, we show that the word sequence with the best overall confidence is more accurate than the default 1-best result of the recognizer, and that the calibration method substantially improves the reliability of recognizer combination.

\name

Woojay Jeon, Maxwell Jordan, and Mahesh Krishnamoorthy \addressApple Inc.
One Apple Park Way, Cupertino, California
{woojay,maxwell_jordan,maheshk}@apple.com \ninept {keywords} confidence, word confusion network, combination, lattice, RNN

1 Introduction

Automatic speech recognition (ASR) systems often output a confidence score [1] for each word in the recognized results. The confidence score is an estimate of how likely the word is to be correct, and can have diverse applications in tasks that consume ASR output.

We are particularly interested in “data-driven” confidence models [2] that are trained on recognition examples to learn systematic “mistakes” made by the speech recognizer and actively “correct” them. A major limitation in such confidence modeling methods in the literature is that they only look at Equal Error Rate (EER) [1] or Normalized Cross Entropy (NCE) [3] results and do not investigate their impact on speech recognizer accuracy in terms of Word Error Rate (WER). Some past studies [4][5] have tried to improve the WER using confidence scores derived from the word lattice by purely mathematical methods, but to our knowledge no recent work in the literature on statistical confidence models has reported WER.

If a confidence score is true to its conceptual definition (the probability that a given word is correct), then it is natural to expect that the word sequence with the highest combined confidence should also be at least as accurate as the default 1-best result (obtained via the MAP decision rule). One reason that this may not easily hold true is that word confidence models, by design, often try to force all recognition hypotheses into a fixed segmentation in the form of a Word Confusion Network (WCN) [6][7][2]. While the original motivation of WCNs was to obtain a recognition result that is more consistent with the WER criterion, we argue that it often unnaturally decouples words from their linguistic or acoustic context and makes accurate model training difficult by introducing erroneous paths that are difficult to resolve.

Instead, we propose the use of a “Heterogeneous” Word Confusion Network (HWCN) for confidence modeling that can be interpreted as a representation of multiple WCNs for a given utterance. Although the HWCN’s structure itself is known, our interpretation of HWCNs and our application of them to data-driven confidence modeling is novel. We train a bidirectional lattice recurrent neural network [8][2] to obtain confidence values for every arc in the HWCN. Obtaining the sequence with the best confidence from this network results in better WER than the default 1-best. In addition, we recognize the need to be able to directly compare confidence scores between different confidence models, and propose a non-parametric score calibration method that maps the scores to empirical probabilities, and show that this gives better accuracy when combining recognizers that have different confidence models.

2 Confidence modeling using Heterogeneous Word Confusion Networks

2.1 Defining confidence and addressing key flaws in WCNs

Word Confusion Networks (WCNs) [6] were originally proposed with the motivation of transforming speech recognition hypotheses into a simplified form where the Levenshtein distance can be approximated by a linear comparison of posterior probabilities, thereby optimizing the results for WER instead of Sentence Error Rate (SER).

Subsequent works on word confidence modeling [2] have based their models on WCNs because of their linear nature, which allows easy identification and comparison of competing words. However, WCNs are fundamentally flawed in that they force all hypothesized word sequences to share the same time segmentation, even in cases where the segmentation is clearly different.

Consider the (contrived) word hypothesis lattice in Fig.1 with 1-best sequence “I will sit there.” A corresponding WCN is in Fig.2, obtained by aligning all possible sequences with the 1-best sequence.

Let us define the confidence for a word in a time segment given acoustic features as

(1)

where is a tuple of start and end times of the ’th slot, drawn from a finite set of time segments, and the set is a full description of the time segmentation of the WCN in Fig. 2.

For a sequence of words in the time segments , consider a random variable denoting the number of correct words in each slot with time segment containing . The expectation of is directly the confidence probability:

(2)

Let the random variable denote the total number of correct words in the sequence. In a manner similar to [6], we approximate the WER as the ratio between the expected number of incorrect words and the total number of words. Since , we have

(3)

Hence, the WER can be minimized by finding, for each slot , the word with the highest .

Figure 1: Word hypothesis lattice where each arc is labeled with a word and a number indicating the point in time (in frames) where the word ends. The best path “I will sit there” is marked in blue.
Figure 2: Word confusion network (WCN) corresponding to Fig. 1, where the time segments (in frame ranges) are assigned based on the best path:
Figure 3: Four WCNs extracted from Fig.1 that represent all possible word sequences without requiring forced slotting or eps insertion. Each WCN has a time segmentation , where each time segment is represented by , where and

However, in Fig.2, all sequences are required to follow the same time segmentation as the 1-best result, so “I’ll” and “aisle” have been unnaturally forced into the same slot as “it” and “I”, even though they actually occupy a much greater length of time extending into the second slot. Also, in order to be able to encode the hypotheses “I’ll sit there” and “aisle seat here”, an epsilon “skip” label “eps” had to be added to the second slot. Such heuristics add unnecessary ambiguity to the data that make it difficult to model. Furthermore, confidences like are actually 0, since “aisle” does not even fit into , so no meaningful score can be assigned to “aisle” even though it is a legitimate hypothesis.

2.2 Mitigating flaws in a WCN by using multiple WCNs

To address the aforementioned problems, let us consider deriving multiple WCNs from the lattice. Fig.3 shows four different WCNs that represent all possible sequences in Fig.1, but without any unnaturally-forced slotting or epsilon insertion.

Each WCN has a unique segmentation with length , but note that a lot of the time segments are shared, i.e., , , , , and . For every the WER for a sequence of words is

(4)

To find the best word sequence, we simply look for the sequence of words across all four segmentations with the lowest WER:

(5)

The WER in (4) and (5) is more sensible than the WER in (3) because it no longer contains invalid probabilities like nor extraneous probabilities like . At the same time, it still retains the basic motivation behind WCNs of approximating Levenshtein distances with linear comparisons.

2.3 The Heterogeneous Word Confusion Network

We now propose using a “Heterogeneous” Word Confusion Network (HWCN). The HWCN is derived from the word hypothesis lattice by 1) merging all nodes with similar times (e.g. within a tolerance of 100ms), and then 2) merging all competing arcs (arcs sharing the same start and end node) with the same word identity.

The HWCN corresponding to the lattice in Fig.1 is shown in Fig.4. Since the destination nodes of “it” and “I” have the same time (12), the two nodes are merged into one. The destination nodes of the two “will” arcs are also merged, making the two arcs competing arcs. Since they have the same word, the two arcs are merged.

This sort of partially-merged network has been used in previous systems in different contexts [4], but not with data-driven confidence models. Most important is that the HWCN is in fact a representation of the four separate WCNs in Fig.3. Fig.5 shows how the first and second WCNs of Fig.3, with segmentations and , respectively, are encoded inside the HWCN. The third and fourth WCNs can be easily identified in a similar manner. The shared time segments (e.g. in Fig. 3) are also fully represented in the HWCN.

Note that there are some extraneous paths in the HWCN that are not present in the lattice. For example, word sequences like “I will simmer” or “aisle sit there” can occur. On the other hand, if what the speaker actually said was “I will sit here”, which is not a possible path in the lattice, the HWCN has a chance to correct eggregious errors in the recognizer’s language model to provide the correct transcription.

Now, if we train a word confidence model to compute the scores in Eq.(1) for every arc in the HWCN, the WER in Eq.(5) can be minimized by finding the sequence in the HWCN with the highest mean word confidence per Eq.(4) via dynamic programming.

When merging arcs, such as the two “will” arcs in Fig.1, we must define how their scores will be merged, as these scores will be used as features for the confidence model. Assume arcs to be merged, all with the same word, start time, and end time, and consuming the same acoustic features . Each starts at node and ends at , and has an arc posterior probability , acoustic likelihood , and transitional (language & pronunciation model) probability . We want to merge the start nodes into one node , the end nodes into one node , and the arcs into one arc . The problem is to compute , , and .

Figure 4: The Heterogeneous Word Confusion Network corresponding to Fig. 1. Nodes with the same times have been merged, and competing arcs with the same word (“will”) have been merged.
Figure 5: Illustration of how the HWCN is actually an encoding of the individual WCNs in Fig.3. (Top) Shown in red, the first WCN (with segmentation ) of Fig.3 is encoded in the upper part of the HWCN. (Bottom) Similarly, the second WCN (with segmentation ) is encoded.

If conceptually represents the union of the arcs, the posterior of the merged arc is the sum of the posteriors of the individual arcs. This is because we only merge competing arcs (after node merging), and there is no way to traverse two or more competing arcs simultaneously (e.g. the two “will” arcs in Fig.1), so the traversal of such arcs are always disjoint events. The acoustic scores of the original arcs should be very similar since their words are the same (but may have different pronunciations) and occur at the same time, so we can approximate the acoustic score of the merged arc as the mean of the individual acoustic scores. Hence,

(6)

The transitional score of the merged arc can be written as

(7)

It is easy to see that the first term in the summation is . As for the second term, if represents the union of the nodes, we have

(8)

where is the prior transitional probability of node that can be obtained by a lattice-forward algorithm on the transitional scores in the lattice.

When labeling the arcs of the HWCN as correct(1) or incorrect(0) for model training, we align the 1-best sequence with the reference sequence. Then, for any arc in the 1-best that has one or more competing arcs, we label each competing arc as 1 if its word matches the corresponding reference word or 0 if it does not. All other arcs in the HWCN are labeled 0. If there is no match between the 1-best and the reference sequence, all arcs are labeled 0.

3 Confidence score calibration

While our abstract definition of confidence is Eq.(1), the confidence score from model is actually an estimate of the confidence according to model , i.e., where are the parameters of the speech recognizer and the RNN confidence model [2] trained on HWCNs. In general, probability estimates computed from two different statistical models cannot be directly compared with each other. This poses a problem when we want to combine the results from multiple recognizers for personalized ASR [9] or when using the same downstream natural language processor with different ASR systems.

It is easy to prove that when the confidence scores are the true probabilities of the words being correct, a system that uses multiple classifiers and chooses the result with the highest score will always be as least as accurate overall as the best individual classifier. Consider classifiers where each classifier outputs results with confidence and hence has expected accuracy (since itself is the probability of being correct). The combined classifier has confidence with expected accuracy for all , so it is at least as good as the best classifier. When the confidence scores are not the true probabilities, however, this guarantee no longer holds, and system combination may actually degrade results.

In this work, we propose a data-driven calibration method that is based on distributions of the training data but do not require heuristic consideration of histogram bin boundaries [10] nor make assumptions of monotonicity in the transformation [11].

Given a confidence score of , the problem is to compute where is the event that the result is correct. We also define as the event that the result is wrong, and write

(9)

The priors and can be estimated using the counts and of correctly- and incorrectly-recognized words, respectively, over the training data, i.e., and .

We use the fact that the probability distribution function is the derivative of the cumulative distribution function (CDF):

(10)

One immediately recognizes that the CDF is in fact the miss probability of the detector at threshold :

(11)

can be empirically estimated by counting the number of positive samples in the training data that have scores less than :

(12)

where is the set of (indices of) positive training samples, is the confidence score of the ’th training sample, and is a step function with value 1 when and 0 when . In order to be able to take the derivative, we approximate as a sigmoid function controlled by a scale factor :

(13)

This lets us solve Eq.(10) to obtain

(14)

Likewise, we can see that is the negative derivative of the false alarm probability of the detector at threshold :

(15)

which leads to

(16)

where is the set of (indices of) negative training samples. Applying Eqs. (14) and (16) to Eq.(9) now gives us a closed-form solution for transforming the confidence score to a calibrated probability , where the only manually-tuned parameter is . In our experiments, we found to work the best on the development data.

4 Experiment

We took 9 different U.S.-English speech recognizers used at different times in the past for the Apple personal assistant Siri, each with its own vocabulary, acoustic model, and language model, and trained lattice-RNN confidence models [2] on the labeled HWCNs of 83,860 training utterances. For every speech model set, we trained a range of confidence models – each with a single hidden layer with 20 to 40 nodes and arc state vectors with 80 to 200 dimensions – and took the model with the best EER on the development data of 38,127 utterances. The optimization criterion was the mean cross entropy error over the training data. For each arc, the features included 25 GloVe Twitter [12] word embedding features, a binary silence indicator, the number of phones in the word, the transitional score, the acoustic score, the arc posterior, the number of frames consumed by the arc, and a binary feature indicating whether the arc is in the 1-best path or not. The evaluation data was 38,110 utterances.

To evaluate detection accuracy, we compare the confidence scores with the arc posteriors obtained from lattice forward-backward computation and merged via Eq.(6). Tab.1 shows the EER and NCE values, computed using the posteriors and confidences on the labeled HWCNs on the evaluation data.

Next, we measure the WER when using the path with the maximum mean word confidence (as noted in Eq.(5) and Sec.2.3, such a path minimizes the estimated WER). Tab.2 shows that the WER decreases for every recognizer in this case, compared to using the default 1-best result. The WER decrease is marginal in some cases, but there is a decrease in all 9 recognizers, implying that the effect is statistically significant.

Finally, we assess the impact of the score calibration in Sec.3. There are possible combinations of recognizers from the nine shown in Tab.2. Combination of recognizers is done by obtaining results (via best mean confidence search on the HWCN of every recognizer) and choosing the result with the highest mean word confidence.

Method EER (%) NCE
Arc Posterior from Forward-Backward 4.23 0.868
Proposed Confidence 3.42 0.621
Table 1: EER and NCE for the arc posterior from lattice-forward-backward, and the proposed confidence measure on evaluation data.
1 2 3 4 5 6 7 8 9
B 12.26 6.93 4.95 4.88 4.80 4.88 10.92 6.81 6.46
P 12.00 6.84 4.90 4.85 4.76 4.84 10.86 6.75 6.43
Table 2: WER (%) of nine recognizers on the evaluation data. The Baseline (B) uses the default 1-best result obtained from the MAP decision rule, while the Proposed (P) uses the word sequence with the maximum mean word confidence.
Raw Calibrated
No. of times better 110 (21.9%) 502 (100%)
No. of times worse 392 (78.1%) 0 (0%)
Table 3: Impact of score calibration on system combination. Out of a total 502 experiments, we counted the number of times the combined system did better or worse than the best individual system when using the raw confidence and the calibrated confidence.
Recognizers Best Indiv. Raw conf. Calib. conf.
Combined WER(%) WER(%) WER(%)
2, 7 6.84 (no.2) 7.24 6.80
3, 4, 5, 6, 9 4.76 (no.5) 4.39 4.63
1, 5, 6, 8, 9 4.76 (no.5) 5.33 4.61
Table 4: Sample recognizer combination results from the experiment in Tab.3, showing the best individual WER among the recognizers combined, the WER when combining using raw confidences, and the WER when combining using calibrated confidences.

We performed all combinations, and counted the number of times the WER of the combined system was better than the best individual WER of the recognizers used in each combination.

Tab.3 shows that when using the “raw” confidence scores from the RNN model, in most cases (78.1%) the combined recognizer had higher WER. When the calibrated scores proposed in Sec.3 are used, however, the combined system beat the best recognizer in all trials. Tab.4 shows some example combination results. Anecdotally, we found the WER from raw confidences tend to have higher variance than the WER from calibrated confidences. The raw scores sometimes give very accurate results, but Tab.3 shows the calibrated scores give improvements much more consistently.

5 Conclusion and future work

We have proposed a method for modeling word confidence using Heterogeneous Word Confusion Networks and showed that they have better detection accuracy than lattice arc posteriors as well as improving the WER compared to the 1-best result from the MAP decision rule. We have also proposed a method for calibrating the confidence scores so that scores from different models can be better compared, and demonstrated the efficacy of the method using system combination experiments.

Future work could address one shortcoming of the proposed model in that there is no normalization of the confidence scores to ensure that for any given time segment . Thanks to Rogier van Dalen, Steve Young, and Melvyn Hunt for helpful comments.

References

  • [1] F. Wessel, R. Schluter, K. Macherey, and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 288–298, March 2001.
  • [2] Q. Li, P. M. Ness, A. Ragni, and M. J. F. Gales, “Bi-directional lattice recurrent neural networks for confidence estimation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 6755–6759.
  • [3] M.-H. Siu, H. Gish, and F. Richardson, “Improved estimation, evaluation and applications of confidence measures for speech recognition.,” in Proc. of EUROSPEECH, Jan. 1997.
  • [4] F. Wessel, R. Schluter, and H. Ney, “Using posterior word probabilities for improved speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), June 2000, vol. 3, pp. 1587–1590 vol.3.
  • [5] G. Evermann and P. C. Woodland, “Large vocabulary decoding and confidence estimation using word posterior probabilities,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), June 2000, vol. 3, pp. 1655–1658 vol.3.
  • [6] L. Mangu, E. Brill, and A. Stolcke, “Finding consensus among words: lattice-based word error minimisation,” Computer Speech and Language, pp. 373–400, 2000.
  • [7] D. Hakkani-Tur and G. Riccardi, “A general algorithm for word graph matrix decomposition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), April 2003, vol. 1, pp. I–I.
  • [8] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister, “LatticeRNN: Recurrent neural networks over lattices,” in INTERSPEECH 2016, pp. 695–699.
  • [9] M. Paulik, H. Mason, and M. Seigel, “Privacy preserving distributed evaluation framework for embedded personalized systems,” in United States Patent 9,972,304 B2, 2018.
  • [10] B. Zadrozny and C. Elkan, “Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers,” in Proc. International Conference on Machine Learning, 2001, pp. 609–616.
  • [11] B. Zadrozny and C. Elkan, “Transforming classifier scores into accurate multiclass probability estimates,” in Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2002, pp. 694–699, ACM.
  • [12] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393301
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description