Exploring Methods for the Automatic Detection of Errors in Manual Transcription

Exploring Methods for the Automatic Detection of Errors in Manual Transcription


Quality of data plays an important role in most deep learning tasks. In the speech community, transcription of speech recording is indispensable. Since the transcription is usually generated artificially, automatically finding errors in manual transcriptions not only saves time and labors but benefits the performance of tasks that need the training process. Inspired by the success of hybrid automatic speech recognition using both language model and acoustic model, two approaches of automatic error detection in the transcriptions have been explored in this work. Previous study using a biased language model approach, relying on a strong transcription-dependent language model, has been reviewed. In this work, we propose a novel acoustic model based approach, focusing on the phonetic sequence of speech. Both methods have been evaluated on a completely real dataset, which was originally transcribed with errors and strictly corrected manually afterwards.

Exploring Methods for the Automatic Detection of Errors in Manual Transcription

Xiaofei Wang, Jinyi Yang, Ruizhi Li, Samik Sadhu, Hynek Hermansky

Center for Language and Speech Processing, The Johns Hopkins University, USA

Human Language Technology Center of Excellence, The Johns Hopkins University, USA

xiaofeiwang@jhu.edu, hynek@jhu.edu

Index Terms: Transcription error detection, phoneme posteriors, forced alignment, biased language model.

1 Introduction

Current automatic speech recognition (ASR) systems are trained on a large amount of data, with speech recordings and corresponding transcriptions. Human transcribers are usually employed to write down texts while listening to the pre-recorded audios with the help of particular software or platforms. Obviously, there exists a trade-off between transcription efficiency and its accuracy [1]. On one aspect, skilled transcribers are in great demands so that the cost of employment is relatively high. On the other hand, even a highly-skilled transcriber is not able to guarantee a 100% accuracy of the transcription. Therefore, it is necessary to have some automatic post-processing techniques to detect the errors in manual transcriptions and correct the errors to make sure that the dataset is accurate enough for the corresponding task. Since the errors are rare compared to the whole dataset, reliable automatic detection techniques can quickly help locate the errors, optimizing the cost of both labors and time [2].

The most straight-forward way to detect the errors is using a well-trained ASR system to find the mismatch between the decoding results and the transcription [3][4]. Since the recognizer is usually not perfect, judgment would not be always reliable. Apparently, the recognizer should be taken advantage of in a thorough manner.

In conventional hybrid ASR, acoustic model (AM) represents the relationship between an audio signal and the phonemes that make up speech, and language model (LM) provides the context to distinguish between words and phrases that sound similar [5]. They both contribute to finding the best hypotheses given an audio signal and help to detect the errors. For language model, the biased LM, which was built based on individual utterance with some top frequent words occurred in the training set, can strongly bias the decoding results towards transcription, unless there are errors in the transcription [6][7]. For acoustic model, a Viterbi alignment can be derived based on the given transcription [8]. The mean and variance of the best-path log-likelihood were used as confidence measures to determine if the transcription has errors or not [9]. In this way, only the fact of transcription itself was considered, while a classification output is still helpful which might reflect the ground truth [10][11].

In this work, while given the well-trained hybrid ASR system, speech recordings, and the corresponding transcriptions, we investigate both the linguistic and phonetic contributions to detect the errors in the transcription. The mismatch between the alignment and posteriors from the classifier is novelly utilized. The remainder of this paper is organized as follows: In Section 2, we describe the methods to detect the erroneous transcriptions. Section 3 defines the experiments and results and section 4 follows with the conclusion.

2 Methods

Figure 1: Posteriorgrams of forced alignment (top) and uniform phoneme graph (middle) as well as their computed KL-divergences (bottom). (a) Insertion transcription error; (b) Substitution transcription error; (c) Correct transcription. Note: the transcription errors are highlighted in red color.

In this section, we describe two methods to detect the erroneous transcription, which are based on the language model and acoustic model, respectively. The former one emphasizes the word-level mismatch between the transcription and fully decoding results, and the latter one pays more attention to the phoneme-level information, which keeps the acoustic details of the speech.

2.1 Using Language Model

2.1.1 Biased Language Model

Biased language model trained on imperfect transcriptions has been used in lightly supervised training of the ASR system. The term “biased” refers to building a separate uni-gram LM for each utterance, with most words only from its own transcription and the most frequent words from training transcriptions. Intuitively, the prediction from the biased LM strongly agrees with ASR decoding result when manual transcription does not have errors.

2.1.2 Detection using biased LM

Similar to our previous study, we decode the utterance using the corresponding biased LM, as well as a well-trained acoustic model, to generate a lattice instead of only choosing the best path [6]. Given the lattice, the lattice oracle word error rate is computed, which is the minimum Levenshtein edit distance between a path in the lattice and the transcription [7]. It is expected to be low if the words in the transcription are seen in the lattice.

2.2 Using Acoustic Model

2.2.1 Phoneme based forced alignment

Since we already have the transcription and a well-trained Hidden Markov Model (HMM) based hybrid ASR system, it is straightforward to get the alignment of each utterance, which is a representation of the sequence of HMM states. Alignment can be derived from aligning the reference transcript by using the Viterbi (best-path) approach. If focusing on the phoneme, we are able to merge context-dependent tri-phone state posteriors into monophone probabilities and perform forced-alignment at monophone level.

The phoneme-based forced alignment demonstrates the fact of the transcription, no matter whether the transcription is correct or not. It finally builds a relationship between the time and phonemes from transcription frame-wisely, which is given as follows:


, where is the number of phonemes and only one element in equals due to the best-path property of the forced alignment, which is supervisely obtained. is the transpose operation.

2.2.2 Phoneme based posterior probabilities

The posterior probabilities from the classifier always yield hypotheses based on the acoustic features frame-by-frame, regardless of what the transcription is. Even though errors might exist in the alignment, the hypotheses can reflect the ground truth if the acoustic model is good.

For each frame, the posterior probabilities of the HMM tri-phone states can be obtained from the output layer of DNN based classifier, which uses soft-max non-linearity. Similar to the forced alignment, the posteriors of HMM states can be merged to phoneme posteriors according to the their mappings. Frame-based distribution of posteriors is defined as follows:


, where is the posterior probability of phoneme given the feature sequence extracted from signals at time .

In general, posteriors will sum to one but have very small values for most of the phonemes. Thus, to focus on the most significant phoneme sequences that the classifier predicts, a uni-gram phonotactic graph is built instead to generate the phone lattices, yielding a phoneme recognizer, with each phoneme having equal prior probability. Essentially, the phoneme recognizer plays a role in emphasizing the posteriors while pruning the less important sequences.

2.2.3 Comparison between the forced alignment and classifier posteriors

In the AM based approach, frame-based alignments are as the fact of transcription (may have errors) while posteriors are as hypotheses. Ideally, for each utterance, the transcription is correct when they reveal consistent sequences.

To demonstrate the diversity of phonemes, we test the idea on Mandarin speech, which contains different tones for most of the phonemes. Fig.1(a) and Fig.1(b) are examples that the transcription has an insertion error and a substitution error, respectively. The upper two sub-figures represent probability distributions over the phone set sampled with 10 ms frame length, and show how likely a certain phone has been uttered at a given time.

In Fig.1(a), the forced alignment has an extra syllable ”d e1” (the red character, between 2.0s - 2.5s) due to the transcription error while the posteriors show the correct answer. In Fig.1(b), the last character of the speech recording (2.5s - 3.0s) is ”q i2 aa2 uu2”, but it was manually transcribed to be ”j i1 aa1 uu1”, which has been detected as well. Once we have a well-trained AM, it is feasible to detect such kind of errors in the transcriptions by comparing the difference between the alignment and posteriors.

2.2.4 Kullback-Leibler (KL) divergence-based erroneous transcription detection

The KL-divergence is a measure of how one probability distribution is different from the reference probability distribution. Given the forced alignment and phoneme posteriors, we calculate the frame-based symmetric KL divergence between them, shown as follows:


, where .

In general, the boundaries of each estimated phoneme are blurred and not well aligned. To solve this problem, we follow the strategy in [12], which uses a soft alignment. At each frame , a context of frames () was used and the value of the center frame is weighted using a Median Filter, yielding a more smoothing output:


The bottom sub-figures in Fig.1 show the frame-by-frame non-smoothing and smoothing KL-divergence. The smoothing strategy overcomes the blurred boundary problem. After smoothing, the peaks can be observed when the transcription has errors, while a correct transcription yields a small value range of KL-divergence, shown in Fig.1(c). In this work, we use the standard deviation (std) of the KL-divergence within the whole utterance as a confidence measure to determine if the utterance has transcription errors. A larger std results in a higher possibility to detect the errors.

3 Experiment setup and Results

3.1 Experiment setup

The evaluation set consists of 11,903 sentences (14.8 hours) which are Mandarin recordings with transcriptions manually edited by skilled transcribers. 1,810 sentences had transcription errors including Insertion (), Deletion (), and Substitution (), shown by Fig.2. After an artificially correcting process, all the errors had been strictly modified to match the audios. Therefore, we got ground truths, which can be used to evaluate automatic detection schemes. Speech recordings, sampled at 16kHz, are close-talking smart-device commands and daily conversation sentences, acquired using high fidelity microphones. In this paper, 708.2 hours of training data covering the topics of evaluation set had been used to train the hybrid ASR system for the purpose of detecting the erroneous transcriptions.

Figure 2: Distribution of the transcription errors. Compared to the correct transcriptions, total sentence error rate is and word error rate is .

The experiments were performed using the Kaldi speech recognition toolkit [13]. A triphone HMM-GMM system with speaker-adaptive training was used to generate the alignments and train the neural network acoustic model. The HMM was trained on 122 position dependent phonemes, with 3 silence phonemes and 119 non-silence phonemes. Time delay neural network (TDNN) based AM and general tri-gram word LM had been trained using 40-dimension MFCC & 100-dimension i-vectors and lexicon from training data, respectively. We tested both evaluation sets using the well-trained ASR system. Table 1 shows the word error rate (WER) of the evaluation sets with and without transcription errors. The transcription errors lead to a slight performance degradation because they supplied wrong references for ASR.

Model/WER(%) w trans. errors w/o trans. errors
HMM-GMM 15.94 15.18
HMM-TDNN 11.59 10.67
Table 1: Word Error Rate (WER%) on hybrid ASR system consisting of AM and general tri-gram word LM.

3.2 Forced alignment and phoneme posteriors

Phoneme-based forced alignment of each utterance was generated by the HMM-GMM AM. It was converted to posteriors where each frame has a single phoneme with unit posterior. We used a uniform phonetic LM and the HMM-TDNN AM to get the phoneme posteriors. As for the KL-divergence calculation, we used a 15-frame () Median filter to calculate the smoothed score between the alignment and phoneme posteriors.

3.3 Biased ASR system

We followed the procedures that were used to build the biased ASR system utterance-by-utterance [6][7]. Firstly, a four-gram unmodified Kneser-Ney interpolated language model from the transcript for each utterance was built. This language model was then interpolated with a uni-gram language model estimated using counts of the 100 most frequent words in the whole training data set. Note that the uni-gram language model allows the decoding process to predict word sequences that are not the same as the transcription, therefore the decoded lattice is more likely to include paths that are close to speech.

3.4 Results

Figure 3: The DET curves of different approaches in detecting the transcription errors.
Method General ASR Biased LM KL-div std
EER(%) 38.62 31.95 34.11
Table 2: Equal Error Rate (EER%) using different approaches.

The Detection Error Tradeoff (DET) curve has been used to derive the Equal Error Rate (EER), which is the point where false positive rate is equal to false negative rate. Lower ERR results in better performance. DET curves of three compared approaches are derived in Fig.3 using the following setups.

General ASR: As a baseline, we decode each utterance in the evaluation set using the HMM-TDNN AM and tri-gram word LM. WER of each utterance is compared with a threshold ranged from 0 to 1.

Biased ASR: Each utterance is decoded using HMM-TDNN AM and biased LM. WER of each utterance is compared with a threshold ranged from 0 to 1.

KL-div std: Standard deviation of the smoothed KL-divergence for each utterance is calculated. The std score is compared with a threshold ranged from 0 to 20.

The Biased ASR and KL-div std show comparable performance from the DET curves in Fig .3. The EERs of them are and , respectively, which are better than the baseline general ASR approach (Table 2). One advantage of the proposed KL-div std method is that it is much more efficient, without doing the full-pass decoding as the other ones did, which is helpful for a huge amount of data.

According to the convex direction, Fig.3 also illustrates how to select the detection scheme given a particular task. If we would like to find more erroneous transcriptions to make sure the total transcription accuracy of the dataset can achieve some numbers (usually over 95%-98% for commercial uses), KL-div std would be the best choice. While biased LM approach is desirable if we want to have higher recall accuracy.

The problem of transcriptions from Mandarin speech is that there exist many homonyms. To the best of our knowledge, both methods cannot deal with this kind of problem, which made the EER higher than other languages.

4 Conclusions

The work presented two alternative strategies to automatically detect the errors in the manual speech transcription. From a linguistic view, we used the biased language model to find the difference between the hypotheses and transcription. In addition, we proposed to use mismatch between the forced alignment and posteriors of the acoustic classifier to detect the phonetic errors, which is more intuitive. Facing a completely real Mandarin dataset, in which manual transcription errors reasonably distributed, these two utterance-based measures showed comparable ability in detecting the transcription errors. Since the perspectives of these two measures are complementary, which are respectively from the linguistic and phonetic view of speeches, finding a combination of two would be interesting in the future.

5 Acknowledgements

This work is supported by a gift to Hynek Hermansky from Beijing Magic Data Technology Co., Ltd., a Google faculty award to Hynek Hermansky and National Science Foundation under Grant No. 1704170. We would like to thank Beijing Magic Data Technology Co., Ltd. for providing the real dataset for this work.


  • [1] T. J. Hazen, “Automatic alignment and error correction of human generated transcripts for long speech recordings,” in Ninth International Conference on Spoken Language Processing, 2006.
  • [2] C. Barras, E. Geoffrois, Z. Wu, and M. Liberman, “Transcriber: development and use of a tool for assisting speech corpora production,” Speech Communication, vol. 33, no. 1-2, pp. 5–22, 2001.
  • [3] Y. Tian, Y. Gong, and F. K. Soong, “Speech model refinement with transcription error detection,” Dec. 28 2010, uS Patent 7,860,716.
  • [4] J. Li, Y. Gong, C. Liu, and K. Yao, “Model training for automatic speech recognition from imperfect transcription data,” Mar. 8 2016, uS Patent 9,280,969.
  • [5] L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X. He, J. Williams et al., “Recent advances in deep learning for speech research at microsoft,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.    IEEE, 2013, pp. 8604–8608.
  • [6] V. Peddinti, V. Manohar, Y. Wang, D. Povey, and S. Khudanpur, “Far-field asr without parallel data.” in Interspeech, 2016, pp. 1996–2000.
  • [7] J. Yang, L. Ondel, V. Manohar, and H. Hermansky, “Towards automatic methods to detect errors in transcriptions of speech recodings.” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
  • [8] M. Gubian, B. Schuppler, J. van Doremalen, E. Sanders, and L. Boves, “Novelty detection as a tool for detection of orthographic transcription errors,” 2009.
  • [9] S. Tanamala, J. J. Prakash, and H. A. Murthy, “A semi-automatic method for transcription error correction for indian language tts systems,” in 2017 Twenty-third National Conference on Communications (NCC).    IEEE, 2017, pp. 1–6.
  • [10] X. Wang, R. Li, and H. Hermansky, “Stream attention for distributed multi-microphone speech recognition,” in Interspeech, 2018, pp. 3033–3037.
  • [11] X. Wang, R. Li, S. H. Mallid, T. Hori, S. Watanabe, and H. Hermansky, “Stream attention-based multi-array end-to-end speech recognition,” arXiv preprint arXiv:1811.04903, 2018, to appear in ICASSP 2019.
  • [12] L. Burget, P. Schwarz, P. Matejka, M. Hannemann, A. Rastrow, C. White, S. Khudanpur, H. Hermansky, and J. Cernocky, “Combination of strongly and weakly constrained recognizers for reliable detection of oovs,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.    IEEE, 2008, pp. 4081–4084.
  • [13] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” IEEE Signal Processing Society, Tech. Rep., 2011.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description