The 2015 Sheffield System for Transcription of Multi–Genre Broadcast Media
We describe the University of Sheffield system for participation in the 2015 Multi–Genre Broadcast (MGB) challenge task of transcribing multi–genre broadcast shows. Transcription was one of four tasks proposed in the MGB challenge, with the aim of advancing the state of the art of automatic speech recognition, speaker diarisation and automatic alignment of subtitles for broadcast media. Four topics are investigated in this work: Data selection techniques for training with unreliable data, automatic speech segmentation of broadcast media shows, acoustic modelling and adaptation in highly variable environments, and language modelling of multi–genre shows. The final system operates in multiple passes, using an initial unadapted decoding stage to refine segmentation, followed by three adapted passes: a hybrid DNN pass with input features normalised by speaker–based cepstral normalisation, another hybrid stage with input features normalised by speaker feature–MLLR transformations, and finally a bottleneck–based tandem stage with noise and speaker factorisation. The combination of these three system outputs provides a final error rate of 27.5% on the official development set, consisting of 47 multi–genre shows.
Oscar Saz, Mortaza Doulaty, Salil Deena, Rosanna Milner,
Raymond W.M. Ng, Madina Hasan, Yulan Liu, Thomas Hain
Audio-visual media is an area of high interest for research in a variety of topics related to computer vision, speech processing and natural language processing. The ability to search into vast media archives, browse through thousands of hours of recordings or structure the complete resources of a media company would significantly increase the efficiency of these organisations and the services provided to the end users.
From the point of view of Automatic Speech Recognition (ASR), work on transcription of broadcast news has achieved significant reduction in error rates since the early works in the 1990s [Woodland97, Gauvain02], with word error rates falling below 10% for traditional broadcast new programmes [Gales06]. However, other types of broadcast media shows have not been so widely explored. The transcription of multi-genre data is a complex task due to the large amounts of variability arising from multiple, diverse speakers, the variety of acoustic and recording conditions and the lexical and linguistic diversity of the topics covered [Lanchantin13].
Evaluations of technology covering different aspects of research in audio-visual media have been a major driver behind some of the most recently achieved results in audio-visual media processing. The MediaEval evaluation campaign [MediaEval] has brought together researchers from many areas to work in automatic classification and retrieval of broadcast data. Evaluation series such as the NIST-organised Hub4 tasks [Hub4] helped start the earlier efforts in broadcast news transcriptions in English, while the Topic Detection and Tracking (TDT) campaign [TDT] expanded this work to other tasks related to broadcast news. More recently, the Ester campaigns [Ester] have created increased interest in the transcription of French broadcast news and the Albayzin campaigns [Albayzin] have pushed the efforts in audio processing of Spanish broadcast news.
Following these efforts, the Multi-Genre Broadcast (MGB) challenge [MGB] aimed to take on several tasks of an increasing complexity in broadcast media. This work tries to address that with advances in several areas of ASR and its application in a fully functional system for Task 1 of the MGB challenge: Speech-to-text transcription of broadcast television.
The rest of the paper is organised as follows: Section 2 describes the experimental setup. Section 3 explains data selection techniques used for acoustic model training. Section 4 introduces new procedures for improved automatic segmentation for ASR. Sections 5 and 6 describe different approaches for acoustic model adaptation and language modelling adaptation for multi-genre shows. Section 7 outlines the final system. Overall results are presented in Section 8. Finally, Section 9 discusses outcomes and concludes the paper.
2 MGB Challenge - Task 1
The MGB challenge 2015 consisted of four different tasks, covering the topics of multi-genre broadcast show transcription, lightly supervised alignment, longitudinal broadcast transcription and longitudinal speaker diarisation. The focus of this work was on Task 1: Speech-to-text transcription of broadcast television, although aspects of the system presented here were used in submissions to other challenge tasks. A full description of this and the other tasks in the challenge can be found in [MGB], but a brief description of the task is given here.
Participation in this task required the automatic transcription of a set of shows broadcast by the British Broadcasting Corporation (BBC). These shows were chosen to cover the multiple genres in broadcast TV, categorised in terms of 8 genres: advice, children’s, comedy, competition, documentary, drama, events and news. Acoustic Model (AM) training data was fixed and limited to more than 2,000 shows, broadcast by the BBC during 6 weeks in April and May of 2008. The development data for the task consisted of 47 shows that were broadcast by the BBC during a week in mid-May 2008. The numbers of shows and the associated broadcast time for training and development data are shown in Table 1.
Additional data was available for Language Model (LM) training in the form of subtitles from shows broadcast since 1979 to March 2008, with a total of 650 million words, and referred to as . The subtitles from the 2,000+ shows for acoustic modelling could also be used for LM training, referred to as . Statistics for these 2 sets can be seen in Table 2.
2.1 Common system description
Throughout this work, two different types of systems were used. This Section describes the fundamental features for both of them, while specific descriptions will be given in the paper, if further experiments are addressing specific issues.
The first types of systems used were DNN-HMM systems, built using the Kaldi toolkit [Kaldi]. These were based on a Deep Neural Network (DNN) where the input were 5 contiguous spliced frames of Perceptual Linear Prediction (PLP) features of dimensions. Features were obtained by using a linear discriminant analysis transformation of 117 spliced PLP features (from 13 dimensions with a context of 4 to the left and right and middle frame), followed by a global CMLLR transform. Features were transformed using a boosted Maximum Mutual Information (bMMI) discriminative transformation [PoveyKKRSV08], unless otherwise stated. DNNs consisted of 6 hidden layers of 2,048 neurons, and an output layer of 6,478 triphone state targets. State-level Minimum Bayes Risk (sMBR) [KingsburySS12, gibson_is06] as target functions, unless otherwise mentioned, and Stochastic Gradient Descent (SGD) was used as the optimisation method. Decoding with systems was performed in two stages; in the first stage, lattices were generated using a highly pruned 3-gram, and afterwards the lattices were rescored using a complete 4-gram and the 1-best obtained and scored using the official MGB scoring package.
The second system types used are so-called DNN-GMM-HMM systems built using the TNet toolkit [TNet] for DNN training and the HTK toolkit [HTK] for Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) training and decoding. systems used a DNN as a front-end for extracting a set of 26 bottleneck features. Such DNNs took as input 15 contiguous log-filterbank frames and consisted of 4 hidden layers of 1,745 neurons plus the 26-neuron bottleneck layer, and an output layer of 8,000 triphone state targets. sMBR was used for training, unless otherwise stated. Feature vectors for training the GMM-HMM systems were 65-dimensional, including the 26 dimensional bottleneck features, as well as 13 dimensional PLP features together with their first and second derivatives. GMM-HMM models were trained using 16 Gaussian components per state, and around 8k distinct triphone states. Decoding with systems was also performed in two stages; in a first stage, lattices were generated using a 2-gram, and afterwards these lattices were rescored using a 4-gram and the 1-best obtained and scored with the official MGB scoring package.
All decoding experiments were performed using a 50,000-word vocabulary, constructed from the most frequent words in the subtitles as provided for language model training. Pronunciations were obtained using the Combilex pronunciation dictionary[Combilex], which was provided to the challenge participants. When a certain word was not contained in the lexicon, automatically generated pronunciations were obtained using the Phonetisaurus toolkit [Phonetisaurus]. These pronunciations were expanded to incorporate pronunciation probabilities, learnt from the alignment of the AM training data [Hain05]. Unless otherwise stated, language models used were obtained by interpolation of several language models trained with the and language model data from Table 2. LM training was performed with the SRILM toolkit [SRILM].
3 Data Selection and Training
One of the main difficulties for transcription in the MGB challenge was the efficient use of the acoustic training data provided, as the use of prior models or other data was not allowed. The transcription of the training data was not created for ASR training purposes. Only the subtitle text broadcast with each show could be used, which is of varying quality for a variety of reasons. An aligned version of the subtitles was provided where the time stamps of the subtitles had been corrected in a lightly supervised manner [MGB, Long13]. After this process, 1,196.73 hours of speech were left available for training.
The provided transcripts for the training shows were unreliable in two ways: First, the subtitle text might not always match the actual spoken words; and second, the time boundaries given might have errors arising from the lightly supervised alignment process. This work did not aim to improve on the second aspect, but instead it studied how to perform data selection in order to train with those segments with the most accurate transcripts. An initial selection strategy was based on selecting segments for training based on their Word Matching Error Rate (WMER), a by-product of the semi-supervised alignment process that measures how similar the text in the subtitle matched the output of a lightly supervised ASR system for that segment [MGB, Long13].
A more complex selection strategy was designed using confidence scores for each segment. The scores were obtained from the posterior probabilities given by a 4-layer DNN trained on the initial selection of data whose targets were 144 monophone states [Zhang14]. The inputs to this DNN were 15 contiguous log-Mel-filter-bank frames, and each hidden layer had 1,745 neurons. For each segment in the training set, the monophone state sequence was obtained using forced alignment, and the segment-based confidence measure was calculated as the average of the logarithmic posteriors of each frame for its corresponding monophone state, excluding silence areas.
Two different training data setups arose from these two strategies: , which contained 512.6 hours of speech segments with WMER of 40% or less; and , which contained 698.9 hours of speech segments with confidence score above . The amount of data per genre in each data training definition can be found in Table 3.
Both training strategies were evaluated on the and systems, as defined in Section 2.1, in this case using Cross-Entropy (CE) training [Hinton12]. Recognition experiments were performed on the manual segmentation available for the development data, with the Word Error Rate (WER) results shown in Table 4. The results indicate that there is a 1% absolute improvement from using instead of , although the gain might have been due mainly to the extra 180 hours of data included in . The gain was independent of the system setup, and was achieved in both and systems.
4 Automatic Segmentation
Automatic speech segmentation is a very important aspect in automatic processing of broadcast media, where the presence of music, applause, laughter and other background sounds can significantly degrade the ability to detect sections containing speech. Errors in segmentation can then propagate as ASR errors in regions of undetected speech or those where speech was incorrectly detected. In this work, a multi-stage automatic segmentation procedure is introduced: an initial segmentation based on DNN posteriors is subsequently improved using the output of an ASR system.
NNs have been used extensively for speech segmentation of meetings [dines_is06, Hain2012] and naturally DNNs are equally useful for this task [Ryant13]. The neural networks are trained to classify each frame in one of two classes, one corresponding to speech being present and the other one representing speech not being present. One of the challenges in this work’s setup was, as seen in the previous section, the unreliability of the data and the requirement to have efficient data selection strategies. Two strategies were tested to cope with the issue. In the first one, , all acoustic training data available were used for training the DNN, the originally defined segments were force-aligned to determine which areas were speech and which areas were non-speech. All audio that was not assigned to a speech segment in the original segments was labelled as non-speech. The second strategy, , took the 512.5 hours from the data selection strategy, as defined in Section 3, and used force alignment to label areas as speech and non-speech, without adding any extra non-speech areas. The amount of training data can be seen in Table 5.
The segmentation DNN provided, for any given audio output, the estimated values of the posterior probabilities of speech or non-speech for each frame. A two-state HMM was used to smooth this sequence of posteriors to a sequence of valid speech segments, with extra 0.25 seconds added at the beginning and the end of each speech segment. This, with either of the strategies or , gave the initial segmentation used for recognition in the first pass.
With the output of decoding based on the original segmentations, a refinement stage was performed as follows. Confidence measures based on the posteriors of a 144-monophone-target DNN were obtained for each word in the hypothesis, as seen for acoustic data selection in section 3. Then, the raw confidence scores were mapped using a decision tree trained on the development data, using decision targets that were either if the word was in an area of speech as defined in the reference segmentation, or if the word was in an area of non-speech. The features to the decision tree were the raw confidence score of each word, the confidence score of the segment, the length of the word (in seconds), the length of the word (in phonemes) and the length of the segment (in seconds). Once the confidences were calculated, words with confidence score below a threshold were removed from the transcript. New segments were redefined then around the remaining words.
The results of the this systems are presented in Table 6, in terms of segmentation error: i.e. missed speech and false alarms, and WER for sMBR and systems trained on the data. Both DNN segmenters produced a significant degradation compared to the use of manually defined segments. However, was found to achieve a much larger false alarm rate than , possibly due to the unbalanced amount of data used for training . This made more suitable for the refinement stage, where areas of false speech detection could be pruned by the use of confidence measures in the ASR output. Table 6 shows how this refinement stage using ASR gave more than 1% absolute improvement over and , despite its segmentation error rate of 9.4%, similar to at 9.2%.
5 Acoustic Background Modelling
Tackling acoustic variability is one of the main issues arising for multi-genre broadcast transcription. The presence of a large variety of possible recording conditions and acoustic background environments presents a real challenge for ASR systems. In this work, two approaches to compensating for such variability were studied. The first aimed to normalise the background variability in the input to DNNs for hybrid systems, while the second one aimed to use asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transformations [Saz13] for the compensation of dynamic background noises in bottleneck systems.
5.1 Domain adaptation of hybrid systems
Adaptation of DNN-based ASR systems is currently one of the most extensively researched areas of speech recognition technology. While several approaches have been evaluated in the past, the normalisation of the input features is most commonly employed. For example, for speaker adaptation, this has been done by directly transforming the input features via feature MLLR (fMLLR) transformations [Gales1998] or by using additional input features representing some characteristic of the speaker, like i-Vectors [Karanasou15, Liu15].
Latent Dirichlet Allocation (LDA) models have been recently used to model hidden acoustic categories in audio data. In [Doulaty15], it was shown that LDA is a suitable model for structuring acoustic data from unknown origin, into unsupervised categories, that could be used to provide domain adaptation in ASR. In this work, 64 hidden acoustic domains were found in the acoustic model training data using the LDA model following the procedure in [Doulaty15]; these domains were found in a unsupervised manner and internally structured the different acoustic conditions of the data. Afterwards, each segment in the training and development sets was assigned to one of these domains. In DNN training, 64 extra features were appended in the input layer, where the domain corresponding to the input frame was codified as a 1–of–N vector. Decoding is performed as usual, with the hidden domain corresponding to the input segment being also appended in the input layer.
5.2 Dynamic noise adaptation of bottleneck systems
One of the advantages of tandem (DNN-GMM-HMM) systems is that techniques for adaptation such as Maximum A Posteriori (MAP) or MLLR [Gales96] can be employed. In our previous works, a new HMM topology for asynchronous adaptation of GMM-HMM systems was proposed and shown to produce ASR improvement in the presence of dynamic background conditions [Saz13].
This setup was applied to this task and expanded through the use of asynchronous Noise Adaptive Training (aNAT) [Kalinli10, Saz13]. First, a global aCMLLR transformation with 8 parallel paths was trained on the whole training data in order to characterise the most common background conditions in this data. Then, the initial sMBR-trained model was retrained in an adaptive training fashion using this aCMLLR transformation. Finally, the global aCMLLR transformation was retrained into show-based aCMLLR transformations using an initial decoding stage in order to more finely characterise the types of noise and background existing in each show, and these transformations were used with the aNAT model to run the final noise-adapted system.
The results, including baseline results, for systems with domain adaptation and systems with noise adaptation are shown in Table 7 using the manually defined segmentation and for systems trained on data. The baseline and adapted systems were cross-entropy (CE) trained in this case, because sequence training for domain-adapted hybrid DNNs did not complete in time. The domain adapted DNN in the setup provided a significant improvement of 1.8% (5.9% relative), which showed the strength of the hidden domain found through the LDA model. For systems, the improvement over the baseline was 1% absolute (3.2% relative) in WER, with balanced improvement across the 8 genres. The experiments in Table 7 were carried out after the challenge and thus were not a part of the final submission.
6 Multi–genre Language Modelling
Acoustic variation is not the only source of variability that can be found in multi-genre broadcasts. Lexical and linguistic variability is also present in this data, due to the large variety of topics that are covered in these shows. In order to tackle this linguistic variability, several experiments were designed to improve language modelling in this task.
One of the aspects explored in this work is the use of genre-specific LMs. While the subtitles in the language model training data were already categorised by genre, this information was not available in the much larger language model training data. In order to automatically derive genre labels for that dataset, genres were automatically inferred using an LDA based approach. First, hidden LDA topics were inferred from the data where genre labels are present. Given those, a Support Vector Machines (SVM) classifier could be trained that would allow classifying a show into one of the 8 genres using the distribution of LDA hidden topic posteriors as input. These SVMs were used to produce labels for separated chunks of the training data. The statistics of words assigned to each genre can be seen in Table 9.
Once all the data had been classified into genres, genre-based LMs were trained in two different configurations: The first one was based on a Recurrent Neural Network (RNN) LM [Mikolov10], initially trained on the full and training data. This initial RNNLM was then converted into 8 genre–dependent RNNLMs by fine–tuning each one of them to the genre-dependent data. These RNNLMs were used to rescore the lattices obtained by the systems using the baseline 4-gram language model. The second one was based on genre-based 4-grams as the interpolation of the genre-independent 4-gram with each genre-dependent 4-gram and was used to rescore lattices in systems. Both systems used manual segmentation and were trained on .
The perplexity and recognition results obtained with the genre-specific LMs are shown in Table 8, along with the results using the baseline LMs. The results show a very significant drop in perplexity when using RNNLMs but only a modest improvement in word error rate of 0.7%. This is consistent with the experiments reported on the same BBC data in [ChenTLLWGW15]. The main difference, however is that in [ChenTLLWGW15], instead of as background language model, another corpus of 1 billion words was used for language modelling, and different topic models including LDA, were used to classify the text into a set of different genres. As noted above, the LM training data is noisy, both in word accuracy and genre labelling.
Using genre-specific n-gram language models yields an improvement of only 0.2% and the perplexity reductions are not as significant. This could be explained by the need to use longer contexts than 4-grams, in order to obtain improvements, which RNNLMs are able to achieve through the use of unrestrained context. It is also interesting to note that genre-specific RNNLMs perform worse than corresponding n-grams on some genres (e.g., comedy and drama). This seems to be related to data sparsity with these two genres having fewer words than the rest as shown in Table 9 and thus the RNNLM fine-tuning does not work very well. The experiments in Table 8 were carried out after the challenge and thus were not a part of the final submission.
7 System description
The final system processing as submitted for the the MGB challenge followed the diagram pictured in Figure 1. Each node in the diagram was implemented as a composition of separate modules, each performing specific computation on the speech data.
The input audio was split into speech segments using a DNN segmenter based on the strategy, as defined in Section 4. These segments were then decoded by an initial, unadapted ASR system: ASR-P1, trained on . The segmentation was afterwards refined using confidence measures in the ASR output as described in section 4. After resegmentation, speaker clustering based on Bayesian Information Criterion (BIC) [Chen98] was performed to assign each speech segment to a given speaker.
From here onwards, three different decoding passes were deployed: ASR-P2-1, ASR-P2-2 and ASR-P2-3, which where based on complementary forms of dealing with speaker and noise variability. ASR-P2-1 was a system where the features were normalised using speaker-based Cepstral Mean and Variance Normalisation (CMVN) without requiring any previous transcript. ASR-P2-2 was also a system, but in this case speaker variability was compensated through the use of fMLLR input features based on the transcript from ASR-P1. Finally, ASR-P2-3 was a system where asynchronous noise transformations were used as described in Section 5, and speaker-based MLLR transformations were trained on top of this for further speaker and noise factorisation. All these three systems were trained following the sMBR criterion using the training data definition.
The output of these three passes was finally combined via a Recognition Output Voting Error Reduction (ROVER) [Fiscus97] procedure.
7.1 System implementation
The implementation of the system is based on the Resource Optimisation Toolkit (ROTK), which is developed by the team at the University of Sheffield and was presented initially in [Hain2012]. ROTK allows the formulation of functional modules that can be executed in asynchronous fashion using computing grid infrastructure. Systems are defined as a set of modules linked together by directed links transferring data of specific types. This is informally depicted in a graph in Figure 1; the actual modules used are more specific. The system uses metadata to organise how data is processed in an efficient parallelised way through the graph. Each module can split its own tasks into several subtasks based on data, which then can be processed in parallel. The overall dependency structure of these sub-tasks is then automatically inferred. Each module submits jobs on a grid system using the Sun Grid Engine (SGE). The ROTK system allows for simple repeatability of the experiments as the same graph can be executed on multiple datasets such as development and evaluation sets.
The results of all intermediate passes and the final output are presented in Table 10. In this Table, the gains obtained by the 3 adapted systems in relation to the baseline can be seen, as well as the final gain obtained by the combination of the three outputs. Since the results that lead to the development of the proposed system have already been presented and discussed all through the paper, this Section only reviews the final results achieved by the full system on the development set.
Evaluating the results per genre, the results vary significantly from News shows, with a 13.2% WER, to Comedy shows, with a 40.9% WER. This highlights the considerable impact of the acoustic variability present in broadcast shows. In terms of gain, Children’s shows achieved the largest improvement from the initial unadapted system, 36.5%, to the final output, 27.7%. This shows how the different techniques proposed for compensating variability worked in complementary ways in one of the most challenging conditions, i.e., where children and adults may appear in the same show and large amounts of music and other backgrounds happen.
In this paper we presented the complete system structure, model training and implementation of the University of Sheffield system for speech–to–text transcription of broadcast media. The system was designed for participation in Task 1 of the MGB challenge. The final result, 27.5% WER, reflects the complexity of the task, especially in the most challenging genres such as comedy or drama shows. It is important to note that these results are obtained without the availability of high quality training data, which is normally available for other related evaluation campaigns. The proposed system has made use of the complementarity of DNN-HMM and DNN-GMM-HMM systems using different adaptation strategies.
Several techniques have been proposed and evaluated. In terms of data selection techniques for acoustic model training, results have shown that adding more data of more quality can provide improvements in both and models. The refinement of automatic speech segmentation using the output of an ASR stage is a significant contribution of this system, with the results showing how this can be used to find speech segments that minimise error rates without necessarily minimising segmentation error rates. The two techniques proposed for domain and noise adaptation of acoustic models have shown how complementary techniques can be used successfully. In this work, domain–based input features have been shown to reduce domain variability in systems; while asynchronous adaptation with CMLLR transformations performs a similar effect in systems. Finally, language model adaptation to multi–genre shows have been shown to produce slight improvements. In this case, the use of genre–dependent 4–grams does not achieve the gains obtained using genre information in RNNLMs, indicating that more work should be focused on adaptation of RNNs for language modelling.
10 Acknowledgements and Data
We would like to thank others in the MINI research group at Sheffield that have helped to develop this system, with their advice and discussions. We would also like to thank our partners in the NST programme, at the Universities of Cambridge and Edinburgh, for the many discussions which helped us greatly in the development of systems.
The audio and subtitle data used for these experiments were distributed as part of the MGB Challenge (www.mgb-challenge.org) through a licence with the BBC. System output and results for the presented system are also available as part of the challenge results to participants.
- thanks: This work was supported by the EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology).