BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge
This paper describes joint effort of BUT and Telefónica Research on development of Automatic Speech Recognition systems for Albayzin 2020 Challenge. We compare approaches based on either hybrid or end-to-end models. In hybrid modelling, we explore the impact of SpecAugment[16, 20] layer on performance. For end-to-end modelling, we used a convolutional neural network with gated linear units (GLUs). The performance of such model is also evaluated with an additional n-gram language model to improve word error rates. We further inspect source separation methods to extract speech from noisy environment (i.e. TV shows). More precisely, we assess the effect of using a neural-based music separator named Demucs. A fusion of our best systems achieved 23.33 % WER in official Albayzin 2020 evaluations. Aside from techniques used in our final submitted systems, we also describe our efforts in retrieving high-quality transcripts for training.
Martin Kocour, Guillermo Cámbara, Jordi Luque, David Bonet, Mireia Farrús, Martin Karafiát, Karel Veselýand Jan “Honza” Černocký
Brno University of Technology, Speech@FIT, IT4I CoE
Universitat Pompeu Fabra
Universitat de Barcelona \firstname.lastname@example.org
Index Terms: fusion, end-to-end model, hybrid model, semi-supervised, automatic speech recognition, convolutional neural network.
Albayzin 2020 challenge is a continuation of the Albayzin 2018 challenges , which has evaluations for the following tasks: Speech to Text, Speaker Diarization and Identity Asignement, Multimodal Diarization and Scene Description and Search on Speech. The target domain of the series is broadcast TV and radio content, with shows in a notable variety of Spanish accents.
This paper describes BCN2BRNO’s team Automatic Speech Recognition (ASR) system for IberSPEECH-RTVE 2020 Speech to Text Transcription Challenge, a joint collaboration between Speech@FIT research group, Telefónica Research (TID) and Universitat Pompeu Fabra (UPF). Our goal is to develop two distinct ASR systems, one based on a hybrid model  and the other one on an end-to-end approach , and complement each other through a joint fusion.
We submitted one primary system and one contrastive system. The primary system – Fusion B – is a word-level ROVER fusion of hybrid ASR models and end-to-end models. It achieved 23.33 % WER on official evaluation dataset. However, the same result was accomplished by the contrastive system – Fusion A–, a fusion which comprises only hybrid ASR models. In this paper we describe both ASR systems, plus a post-evaluation analysis and experiments that lead to a better performance of the primary fusion. We also discuss the effect of speech enhancement techniques like background music removal or speech denoising.
The Albayzin 2020 challenge comes with two databases: RTVE2018 and RTVE2020. The RTVE2018 is the main source of training and development data, while the RTVE2020 database is used for the final evaluation of submitted systems. RTVE2018 database  comprises 15 different TV programs broadcast between 2015 and 2018 by the Spanish public television RadiotelevisiÃ³n EspaÃ±ola (RTVE). The programs contain a great variety of speech scenarios from read speech to spontaneous speech, live broadcast, political debates, etc. They cover also different Spanish accents, including Latin-American ones. The database is partitioned into 4 different subsets: train, dev1, dev2 and test. The database consists of hours of audio data, from which hours are provided with subtitles (train set), and hours are human-revised (dev1, dev2 and test sets). Both hybrid and end-to-end models utilize dev1 and train sets for training, while dev2 and test sets serve as validation datasets. RTVE2020 database  consists of TV shows of different genres broadcast by the RTVE from 2018 to 2019. It includes more than hours of audio and it has been whole manually annotated.
In addition, three Linguistic Data Consortium (LDC) corpora were used for training the language model in the hybrid ASR system: Fisher Spanish Speech, CALLHOME Spanish Speech and Spanish Gigaword Third Edition.
Fisher Spanish Speech  corpus comprises spontaneous telephone speech from native Caribbean Spanish and non-Caribbean Spanish speakers with full orthographic transcripts. The recordings consists of telephone conversations lasting up to minutes each.
CALLHOME Spanish Speech  corpus consists of telephone conversations between Spanish native speakers lasting less than 30 minutes. Spanish Gigaword Third Edition  is an extensive database of Spanish newswire text data acquired by the LDC. It includes reports, news, news briefs, etc. collected from 1994 through Dec 2010. We also downloaded the text data from Spanish Wikipedia.
The end-to-end model is trained on Fisher Spanish Speech, Mozilla’s Common Voice Spanish corpus and Telefónica’s Call Center in-house data (23 hours). Mozilla’s Common Voice Spanish  corpus is an open-source dataset that consists of recordings from volunteer contributors pronouncing scripted sentences, recorded at 48kHz rate. The sentences come from original contributor donations and public domain movie scripts. The version of Common Voice corpus used for this work is 5.1, which has 521 hours of recorded speech. However, we have kept only speech validated by the contributors, an amount of 290 hours.
2.1 Transcript retrieval
The training data from RTVE2018 database includes many hours of subtitled speech. Although, the captions contain several errors. In the most cases captions are shifted by a few seconds, so a segment with correct transcript corresponds to a different portion of audio. This phenomenon also occurs in human-revised development and test sets. Another problem with subtitled speech is “partly-said” captions. This issue involves misspelled and unspoken words of the transcription.
Since the training procedure of the hybrid ASR is quite error-prone in case of misaligned labels, we decided to apply a transcript retrieval technique developed by Manohar, et al. : the closed-captions related to the same audio, i.e., the whole TV show, are first concatenated according to the original timeline. This creates a small text corpus containing a few hundreds of words. The text corpus is used for training a biased -gram language model (LM) with , so the model is biased only on the currently processed captions. During decoding, the weight of the acoustic model (AM) is significantly smaller than the weight of LM, because we believe that the captions should occur in hypotheses. Then, the “winning” path is retrieved from the hypothesis lattice as the path that has a minimum edit cost w.r.t. the original transcript. Finally, the retrieved transcripts are segmented using the CTMs obtained from the oracle alignment (previous step). More details can be found in [12, 17].
The transcript retrieval technique is applied twice. First, we train an initial ASR system on out-of-domain data, e.g., Fisher and CALLHOME. A system is used in the first pass of transcript retrieval. Then, a new system is trained from scratch on already cleaned data and the whole process of transcript retrieval is repeated again. Table 1 shows how this 2-pass cleaning leads to recover almost all the manually annotated development data and half of the subtitled training data.
Figure 1 depicts how many hours have been recovered in individual TV programs. It also shows how data is distributed in the database. Most speech comes from La-MaÃ±ana (LM) TV program. We discarded most data in this TV program after 2-pass data cleaning. It happened because this particular TV show was quite challenging for our ASR model.
3 Hybrid speech recognition
3.1 Acoustic Model
In all our experiments, the acoustic model was based on a hybrid Deep Neural Network – Hidden Markov Model architecture trained in Kaldi . The NN part of the model contains 6 convolutional layers followed by 19 TDNN layers with semi-orthogonal factorization  (CNN-TDNNf). The input consists of 40-dim MFCCs concatenated with speaker dependent 100-dim i-vectors. Whole model is trained using LF-MMI objective function with bi-phone acoustic units as the targets.
In order to make our NN model training more robust, we introduced feature dropout layer into the architecture. This prevents the model from overfitting on training data. In fact, it turned overfitting problem into underfitting problem. Thus, it leads to a slower convergence during training. This is solved by increasing the number of epochs from 6 to 8 to balance the underfitting in our system. This technique is also known as Spectral Augmentation. It was first suggested for multi-stream hybrid NN models in  and fully examined in .
3.2 Language Model
We trained three different -gram language models: Alb, Wiki and Giga. The names suggest which text corpus was used during training. Albayzin LM was trained on dev1 and train sets from RTVE2018. This text mixture contains thousand unique words in million sentences. This small training text is not optimal to train -gram LM, which is able to generalize well. So we also included larger text corpora: Wikipedia and Spanish Gigaword. These databases were further processed to get rid of unrelated text like advertisement, emoji, urls, etc. This resulted into more than million fine sentences in Wikipedia and million sentences in Spanish Gigaword. We experimented with combinations of interpolation: Alb, Alb+Wiki, Alb+Giga, Alb+Wiki+Giga.
Our vocabulary consists of words from RTVE2018 database and from Santiago lexicon
3.3 Voice Activity Detection
Voice activity detection (VAD) was applied on evaluation data in order to segment the audio into smaller chunks. VAD is based on feed-forward neural network with two outputs. It expects 15-dimensional filterbank features with additional 3 Kaldi pitch features  as the input. Features are normalized with cepstral mean normalization. More details can be found in .
4 End-to-end speech recognition
4.1 Acoustic Model
The end-to-end acoustic model is based on a convolutional architecture proposed by Collobert et al.  that uses gated linear units (GLUs). Using GLUs in convolutional approaches helps avoiding vanishing gradients, by providing them linear paths while keeping high performances. Concretely, we have used the model from wav2letter’s Wall Street Journal (WSJ) recipe. This model has approximately 17M parameters with dropout applied after each of its 17 layers. The WSJ dataset contains around 80 hours of audio recordings, which is smaller than the magnitude of our data (600 hours). The LibriSpeech recipe (1000 hours) provides a deeper ConvNet GLU based architecture, however we decided to use the WSJ one in order to reduce computational time and improve hyper-parameter fine-tuning of the network.
All data samples are resampled at 16kHz, and the system is trained with wav2letter++ framework. Mel-frequency spectral coefficients (MFSCs) are extracted from raw audio, using 80 filterbanks, and the system is trained using the Auto Segmentation criterion (ASG)  with batch size set to 4. The learning rate starts at 5.6 and is decreased down to 0.4 after 30 epochs, where training is finished since no significant WER gains are achieved. From epochs 22 to 28 the system is trained also with the same train set, but adding the RTVE2018 train and dev1 samples with the background music cleaned by Demucs module . The last two epochs, from epoch 28 to epoch 30, are done incorporating further samples with background noise removed by Demucs and denoised by a neural denoiser . This way, data augmentation with samples without background music and noise is done, to aid the network at training with samples with difficult acoustic conditions. Besides, the network is more likely to generalize audio artifacts caused by the denoiser and music separator networks, which is useful when using these to clean test audio.
4.2 Language Model
Regarding the lexicon, we extract it from the train and validation transcripts, plus Sala lexicon . The resulting lexicon is a grapheme-based one with 271k words. We use the standard Spanish alphabet as tokens, plus the ”Ã§” letter from Catalan and the vowels with diacritical marks, making a total of 37 tokens.
The LM is a 5-gram model trained with KenLM  using only transcripts from the training sets: RTVE2018 train and dev1, plus Common Voice, Fisher and Call Center. The resulting LM is described in this paper as Alb+Others.
Fine-tuning of decoder hyperparameters is done via grid-search with RTVE2018 dev2 set. The best results are achieved with a LM weight of 2.25, a word score of 2.25 and a silence score of -0.35. This same configuration is then applied for evaluation datasets from RTVE2018 and RTVE2020.
5.1 Data cleaning
Data cleaning by means of 2-pass transcript retrieval improves the performance of our models the most. Table 1 shows the effect of each pass. The nd pass helped to improve the accuracy by almost % in terms of WER. We also ran the rd pass, but that did not help anymore. We simply did not retrieve more cleaned data from the original transcripts, just hours more. We could not train the models with the original subtitles, since these contained wrong timestamps.
5.2 Speech Enhancement
It is very common to find background music on TV programs, which can confuse our recognizer if it has a notorious presence. This brought us the idea of processing the audio through a Music Source Separator called Demucs . It separates the original audio into voice, bass, drums and others. By keeping only the voice component, we managed to significantly eliminate the background music, while maintaining relatively good quality in the original voice.
We enhanced both validation sets in order to assess possible WER reductions. As seen in Table 4, this approach yielded a small increase in WER. We also tried applying a specialized denoiser  after background music removal, but the WER for dev2 increased in an absolute 1.6%, compared to original system without enhancement. None of these two approaches (Demucs and Demucs+Denoiser) provided WER improvements at first, so we did not apply them for the end-to-end model used in the fusion. Although, the end-to-end, end-to-end + Demucs and end-to-end + Demucs + Denoiser models were submitted as separate systems by UPF-TID team, see Table 5 for details.
Our hypothesis is that not all the samples contain background music. Speech enhancement for already clean samples is detrimental because it causes slight degradation in the signal. Hence, we have evaluated the effects of applying music source separation to samples under certain SNR ranges, measured with the WADA-SNR algorithm . The application of music separation on RTVE dataset is optimal for SNR ranges between -5 and 5 or 8 as it is shown in Table 3. Looking at Figure 2, best improvements are found at TV shows with higher WER (thus harder/noisier speech), e.g., AV, where most of the time speakers are in a car, or LM and DH, where music and speech often overlap. Other shows have slighter benefits, since these already contain good quality audio. The exception is AFI show, which is reported to have poor quality audio, so further audio degradation from Demucs might cause worse performance.
|SNR||Cleaned Samples [%]||Test WER [%]|
5.3 Spectral augmentation
Table 4 shows compared models with and without spectral augmentation. The technique helps quite significantly. All models with feature dropout layer outperformed their counterparts with a quite constant absolute WER improvement on RTVE2018 test set and around on RTVE2018 dev2 set.
5.4 Model fusion
We also fuse the output of our best systems to further improve the performance. Overall results of our systems considered for the fusion are depicted in Table 4. Since the models with spectral augmentation performed significantly better, we decided to fuse only these systems. We analyzed two different approaches: a pure hybrid model fusion (Fusion A) and hybrid and end-to-end model fusion (Fusion B).
Considering that the end-to-end model does not provide word-level timestamps, we had to force-align the transcripts with the hybrid ASR system in order to obtain CTM output. The original word-level fusion was done using ROVER toolkit . Fusion B with end-to-end models performed slightly better than its counterpart Fusion A, despite the fact that the end-to-end models achieved worse results. This somehow proves the idea that the fusion can benefit from different modeling approaches.
6 Final systems
Table 5 depicts the results on RTVE2020 test set. For the end-to-end ConvNet GLU model, the performance drops around a 15% WER when compared with previous results on development sets. Since the TV shows in such sets are also present in training dataset, our hypothesis is that the model slightly overfits to them. Therefore, when facing different acoustic conditions, voices, background noises and musics presented in RTVE2020 test set, the WER noticeable increases. Enhancing the test samples with Demucs or with Demucs+Denoiser yields a worse WER score, probably due to an inherent degradation of the signal. A deeper analysis about more efficient ways to apply such enhancements has been done in section 5.2.
Also, note that the submitted systems had a leak of dev2 stm transcripts in the LM, causing an hyperparameter overfitting during LM tuning. This caused a WER drop in all end-to-end systems, yielding WERs of 41.4%, 42.3% and 58.6%. Table 5 also displays the results of same systems with the leakage and LM tuning corrected in post-evaluation analysis.
|2||Alb + Wiki||13.6||14.9|
|3||Alb + Giga||13.6||15.1|
|4||Alb + Wiki + Giga||13.5||15.0|
|10||Alb + Others||20.8||20.7|
|12||Alb + Others||21.1||20.8|
|13||Fusion A (row 5-8)||12.9||13.7|
|14||Fusion B (row 5-8 and 10)||12.8||13.3|
|+ Demucs + Denoiser||58.6
In this paper we described two different ASR model architectures and their fusion. We focused on improving the original subtitled data in order to train our models on high quality target labels. We also improved the -gram language model by incorporating publicly available text data from Wikipedia and Spanish Gigaword corpus from LDC. We have also successfully incorporated the spectral augmentation into our AM architecture. Our best system achieved % and 23.24 % WER on RTVE2018 and RTVE2020 test sets respectively.
The performance of our hybrid system can be further improved by using lattice-fusion with Minimum Bayes Risk decoding. Another space for improvement is offered by adding a RNN-LM lattice-rescoring. Our end-to-end model shows relatively competitive performance on RTVE2018 test set in comparison with its hybrid counterpart. However, its performance on RTVE2020 expose that the model was not able to generalize very well since this database turns out to contain slightly different acoustic conditions. Despite of this fact, the model still managed to improve the results in the final fusion with hybrid systems. An exploration on background music removal shows that it yields the best results for lower SNR ranges, thus having a different impact depending on the acoustic conditions of each TV show.
- Primary system of UPF-TID team.
- First contrastive system of UPF-TID team.
- Second contrastive system of UPF-TID team.
- (2019) Common voice: a massively-multilingual speech corpus. External Links: Cited by: §2.
- (1996) CALLHOME Spanish Speech. LDC96S35. Web Download. Philadelphia: Linguistic Data Consortium. Cited by: §2.
- (2016) Wav2Letter: an end-to-end convnet-based speech recognition system. CoRR abs/1609.03193. External Links: Cited by: §1, §4.1, §4.1.
- (2020) Real time speech enhancement in the waveform domain. External Links: Cited by: §4.1, §5.2.
- (2019) Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254. Cited by: BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge, §4.1, §5.2.
- (2012) eSpeak text to speech. External Links: Cited by: §3.2.
- (1997) A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER). In 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, Vol. , pp. 347–354. External Links: Cited by: §5.4.
- (2014-05) A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, Florence, Italy. Cited by: §3.3.
- (2010) Fisher Spanish Speech. LDC2010S01. DVD. Philadelphia: Linguistic Data Consortium. Cited by: §2.
- (2011) KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT â11, USA, pp. 187â197. External Links: Cited by: §4.2.
- (2008) Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In Ninth Annual Conference of the International Speech Communication Association, Cited by: §5.2.
- (2019) Automatic Speech Recognition System Continually Improving Based on Subtitled Speech Data. Diploma thesis, Brno University of Technology, Faculty of Information Technology, Brno. Note: technical supervisor Dr. Ing. Jordi Luque Serrano. supervisor Doc. Dr. Ing. Jan Černocky Cited by: §2.1.
- (2018) RTVE2018 Database Description. External Links: Cited by: §2.
- (2019) Albayzin 2018 evaluation: the iberspeech-rtve challenge on speech technologies for spanish broadcast media. Applied Sciences 9 (24), pp. 5412. Cited by: §1.
- (2020) RTVE2020 Database Description. External Links: Cited by: §2.
- (2016) A Framework for Practical Multistream ASR. In Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 8-12, 2016, N. Morgan (Ed.), pp. 3474–3478. Cited by: BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge, §3.1.
- (2017) JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. 2018-, pp. 346–352. External Links: Cited by: §2.1.
- (2011) Spanish Gigaword Third Edition. LDC2011T12. Web Download. Philadelphia: Linguistic Data Consortium. Cited by: §2.
- (2002) SpeechDat across all america: sala ii. Cited by: §4.2.
- (2019) SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp. 2613–2617. Cited by: BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge, §3.1.
- (2018) Analysis of but-pt submission for nist lre 2017. In Proceedings of Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 47–53. Cited by: §3.3.
- (2018-09) Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Proceedings of Interspeech, pp. 3743–3747. External Links: Cited by: §1, §3.1.
- (2011-12) The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Note: IEEE Catalog No.: CFP11SRW-USB Cited by: §3.1.
- (2013) Revisiting hybrid and GMM-HMM system combination techniques. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013, Cited by: §7.