IMS-Speech: A Speech to Text Tool

IMS-Speech: A Speech to Text Tool


We present the IMS-Speech, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines which require accesses to lexical information in spoken language materials. This tool is based on modern open source software stack, advanced speech recognition methods and public data resources and is freely available for academic researchers. The utilized models are built to be generic in order to provide transcriptions of competitive accuracy on a diverse set of tasks and conditions.


Institute for Natural Language Processing (IMS), University of Stuttgart \email{pavel.denisov|}

1 Introduction

There is a considerable amount of spoken language materials in form of audio recordings, which researchers in e.g. humanities and social science could incorporate into their studies. However, to be able to access to their content, one needs to automatically transcribe these recordings. While all needed resources for building of an automatic speech recognition (ASR) system are typically available for academic usage, their utilization requires specialized knowledge and technical experience [1], [2]. Therefore, in order to provide people easy accesses to information in spoken language materials, a speech to text tool with a user interface should be helpful.

This paper presents the IMS-Speech1, a web based tool for German and English speech transcription aiming to facilitate research in various disciplines. We are willing to provide a speech transcription service with an intuitive web interface accessible with a wide range of computing devices and to people with various backgrounds. The service is based on modern open source software stack, advanced speech recognition methods and public data resources and is freely available for academic researchers. The utilized models are built to be generic in order to provide transcriptions of competitive accuracy on a diverse set of tasks and conditions. In addition to that, they can serve as a strong base for customized task specific applications.

2 System description

In order to produce a meaningful transcription for the most of recordings that might be uploaded by users, two tasks must be performed for every recording sequentially. First, a recording must be split to segments not exceeding some short duration and corresponding to speech intervals. Second, actual ASR must be performed over each speech segment for finding the most probable sequence of words being said in the segment and thus constructing final transcription.

2.1 Speech Segmentation

Speech segmentation is performed with a speech activity detection (SAD) system based on Time-Delay Neural Network (TDNN) [3] with statistics pooling [4] for long-context information. TDNN is trained to estimate probability of 3 classes, Silence, Speech and Garbage, for each frame. Training targets are assigned based on lattices produced by Gaussian Mixture Model (GMM) based acoustic models and predefined lists of phones for each class. GMM is used for forced alignment as well as for unconstrained decoding. Training targets are obtained from both procedures separately and consequently merged by weighted summing, while samples with high disagreement between two methods are discarded. During the decoding, 3 estimated probabilities are transformed to pseudo-likelihoods of 2 states, Silence and Speech, using priors of 3 classes and manually chosen proportions of 2 states in 3 classes. Decoding is performed with Viterbi algorithm [5].

2.2 End-to-end ASR

End-to-end approach implements ASR system as a single neural network based model that takes a -length sequence of dimensional feature vectors in the input and provides a -length sequence of output labels , where is a set of distinct output labels and usually . Common architecture for such models is attention-based encoder-decoder network trained to minimize cross-entropy loss:


Here, and are recurrent neural networks, is an attention mechanism and , and are the hidden vectors. Attention mechanism has been developed in the context of machine translation problem [6] and provides a means to model correspondence of all elements of hidden representations sequence to all elements of output sequence in the decoder. Attention mechanism allows to learn non-sequential mapping between its inputs and outputs, meaning that order of output elements is not always the same as order of input elements corresponding to them, what can be an advantage in case of machine translation task, as word order sometimes differs between languages. However, this property of attention mechanism makes training of speech recognition suboptimal, because it is known in advance that word order is the same in audio and in transcription. Connectionist Temporal Classification (CTC) sequence level loss function [7] has been adopted as a secondary learning objective for end-to-end ASR models in order to suppress this drawback:


where . Encoder output followed by a single linear layer serves as estimated output label sequence in CTC loss calculation, while target is set to be all possible -length sequences of an extended output labels set , corresponding to the original output labels sequence :


It has been found that CTC output can also improve decoding results when combined with the main attention-based probabilities during the search:


External language model (LM) is commonly-used technique to improve ASR results. LMs are trained on text corpora, which usually contain order of magnitude more examples of written language compared to acoustic corpora, and therefore provide a reliable source of information for selection of well formed transcriptions from hypotheses. In end-to-end ASR, this information is used during the decoding by adding LM probability of hypothetical output label sequence with scaling factor to probabilities obtained from the main model:


Encoder-decoder architecture allows output sequence (transcription) to have any length that does not exceed length of input sequence (audio recording). Consequently, it is possible to employ different kinds of output units, for example words or characters. In case of words, transcription hypotheses are limited by words presenting in vocabulary, what causes out of vocabulary problems and requires large dimensionality of final layers. In case of characters, output sequences become very long for alphabetical languages, what leads to high number of hypothetical transcriptions and slows down the decoding. Sub-word units have been suggested first as a trade-off solution in machine translation [8] and recently have been adopted in speech recognition [9]. Sub-word units include single characters and can be used to encode any word. In addition to that, sub-word units include combinations of several characters and encode words to shorter sequences compared to single characters. Unigram language model algorithm [10] performs segmentation of a string by searching for the most probable sequence of sub-word units composing the string:


where probability of a sequence of sub-word units is defined as the product of occurrence probabilities of sub-word units:


Sub-word units vocabulary is derived during the training of segmentation model by starting from some large set of frequent in the training data substrings and iterative elimination of certain percent of substrings having lowest impact on total likelihood of all possible sequences of sub-word units for all sentences until some predefined size of vocabulary is reached.

3 Implementation

The frontend is implemented as a Node.js/React application and utilizes WebSocket protocol to communicate with the backend. Users can sign in and upload their recordings for transcription. We plan to add the users’ feedback with the main focus on customization and fine tuning.

Speech segmentation is performed with Kaldi toolkit [1]. We use the pretrained SAD model downloaded from The model is trained on Fisher-English corpus [11] augmented with room impulses and additive noise from Room Impulse Response and Noise Database [12]. The input features of SAD model are 40-dimensional Mel Frequency Cepstral Coefficients (MFCC) without cepstral truncation with a frame length 25 ms and shift of 10 ms. We use the segmentation parameters suggested in aspire Kaldi recipe, but extend maximum speech segment duration from 10 to 30 seconds and enable consecutive speech segments merging when duration of merged segment does not exceed 10 seconds.

Speech recognition is implemented with ESPnet end-to-end speech recognition toolkit [2] with PyTorch backend. We follow LibriSpeech ESPnet recipe and use 80-dimensional log Mel filterbank coefficients concatenated with 3-dimensional pitch having a frame length of 25 ms and shift of 10 ms as acoustic features and sub-word units as output labels. Kaldi toolkit is used to extract and normalize input features. Normalization to zero mean and unit variance is done with global statistics from the training data. SentencePiece unsupervised text tokenizer2 is used to generate list of sub-word units based on the language model training data and to segment all kinds of text data. We evaluated several sizes of sub-word unit vocabulary between 50 and 5000 and found that 100 resulted in better results for both English and German systems. The ASR model is an encoder-decoder neural network. The encoder network consists of 2 VGG [13] blocks followed by 5 Bidirectional Long Short-Term Memory Network (BLSTM) layers [14] with 1024 units in each layer and direction. The decoder network consists of 2 Long Short-Term Memory Network (LSTM) [15] layers with 1024 units and location based attention mechanism with 1024 dimensions, 10 convolution channels and 100 convolution filters. CTC weight is set to for both training and decoding. Training is performed with AdaDelta optimizer [16] and gradient clipping on 4 Graphics Processing Units (GPUs) in parallel with a batch size of 24 for 10 epochs. The optimizer is initialized with and . is halved after an epoch if performance of the model did not improve on validation set. The model with the highest accuracy on validation set is used for the decoding with beam size of 20.

External LM for the English system contains 2 layers of 650 LSTM units and is trained with stochastic gradient descent optimizer with batch size 256 for 60 epochs. LM scaling factor is set to during decoding for the English system. External LM for the German system contains 2 layers of 3000 LSTM units and is trained with Adam optimizer [17] with batch size 128 for 10 epochs. LM scaling factor is set to during decoding for the German system.

4 Resources

Both English and German systems are trained on multiple speech databases, which are summarized in Table 1. We use data preparation scripts from multi_en Kaldi recipe and German ASR recipe [18]. German system is additionally improved by data augmentation, applied to 3 datasets (marked with (*) in the table) with Acoustic Simulator3 package. This procedure gives an augmented dataset that is 10 times larger than original dataset.

External LM for the English system is trained with on transcriptions from the training speech databases except of Common Voice. External LM for the German system is trained on all transcriptions form the training speech databases and additional text corpus4 containing 8 millions of preprocessed read sentences from the German Wikipedia, the European Parliament Proceedings Parallel Corpus and a crawled corpus of direct speech.

Language Corpus Style Hours
]7*English LibriSpeech [19] Read 960
Switchboard [20] Spontaneous 317
TED-LIUM 3 [21] Spontaneous 450
AMI [22] Spontaneous 229
WSJ [23] Read 81
Common Voice5 Read 240
Total 2277
]8*German Tuda-De [24] Read 109
SWC [25] Read 245
M-AILABS6 (*) Read 2336
Verbmobil 1 and 2 [26] (*) Mixed 417
VoxForge7 (*) Read 571
RVG 1 [27] Mixed 100
PhonDat 1 [28] Mixed 19
Total 3797
Table 1: English and German training data covering data sets with different styles

5 ASR Performance

5.1 Results

Table 2 compares the results of IMS-Speech on several testing datasets with the best results for the corresponding datasets which we could find in various sources. In summary, these results suggest that our generic systems can compete with task specific systems and in some cases even outperform them, possibly due to better generalization from larger amount of training data.

Language Dataset IMS-Speech State of the art
]7*English WSJ eval’92 3.8 3.5 [29]
LibriSpeech test-clean 4.4 3.2 [30]
LibriSpeech test-other 12.7 7.6 [30]
TED-LIUM 3 test 12.8 6.7 [21]
AMI IHM eval 17.4 19.28
AMI SDM eval 38.5 36.79
AMI MDM eval 34.1 34.210
]4*German Tuda-De dev 11.1 13.1 [18]
Tuda-De test 12.0 14.4 [18]
Verbmobil 1 dev 6.7 18.2 [18]
Verbmobil 1 test 7.3 12.7 [31]
Table 2: ASR performance comparison with state of the art results (WER, %)

We evaluate the recognition speech with different beam widths and batched recognition with inference using CPU and GPU. The results in Table 3 show that batched recognition can significantly increase speed of recognition without any impact on WER.

]2*Beam width Inference on 1 CPU core with batch size of 1 Inference on 1 GPU with batch size of 23
WER, % RT factor WER, % RT factor
20 12.0 14.2 12.0 0.7
15 12.2 11.3 12.2 0.5
10 12.6 8.8 12.6 0.4
5 13.7 7.0 13.7 0.3
Table 3: Beam width effect on recognition performance and speed on Tuda-De test set

5.2 Comparisions with Google API

We use the ASR benchmark framework [32] to compare performance of IMS-Speech and Google API. The results of Google API were retrieved on 8.01.2019. As the framework uses custom WER computation method instead of NIST sclite utility used in ESPnet recipes, we had to perform scoring of IMS-Speech output with the framework as well. We excluded all utterances for which Google API transcriptions contained digits, because WER would be high for them even if transcriptions were correct (a couple of examples are given in Table 4), and also utterances for which Google API transcriptions were empty. The results are shown in Table 5. The numbers suggest that that Google API models may be optimized for certain speech domain and recording conditions that differ significantly from the ones tested by us.

Utterance System Transcription
]2* LibriSpeech test-other, 2609-157645-0010 IMS-Speech then let them sing to the hundred and nineteenth replied the curate
Google API then let them sing the 119th repository
]2* Verbmobil 1 test, w007dxx0_001_BFG IMS-Speech Ich würde Ihnen den einundzwanzigsten August bis zum vier fünfundzwanzigsten vorschlagen
Google API ich würde Ihnen den 21. August bis den 425 vorschlagen
Table 4: Examples of some perfect IMS-Speech transcriptions and Google API transcriptions
Language Dataset IMS-Speech Google API Scored utterances
]3*English LibriSpeech test-clean 4.3 15.9 2444 of 2620 (93%)
LibriSpeech test-other 12.5 28.0 2708 of 2939 (92%)
Common Voice valid-test 4.5 19.2 3772 of 3995 (94%)
]2*German Tuda-De test 10.0 12.4 3481 of 4100 (85%)
Verbmobil 1 test 8.7 19.5 334 of 631 (53%)
Table 5: ASR performance comparison with Google API in term of WER (%)

6 Conclusion

We presented IMS-Speech, a web based speech transcription tool for English and German languages that can be used by non-technical researchers in order to utilize the information from audio recordings in their studies. The comparison of the IMS-Speech results with the results of specialized systems in terms of WER showed that the described service can perform decently in a diverse set of tasks and conditions. In the future, we plan to allow the users to customize the system for their needs as well as to constantly improve our ASR system.




  1. Povey, D., A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz \bbletal: The Kaldi speech recognition toolkit. \bblin Proc. of ASRU. 2011.
  2. Watanabe, S., T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen \bbletal: ESPnet: End-to-End Speech Processing Toolkit. arXiv preprint arXiv:1804.00015, 2018.
  3. Waibel, A., T. Hanazawa, G. Hinton, K. Shikano, \bbland K. J. Lang: Phoneme recognition using time-delay neural networks. \bblin Readings in speech recognition. 1990.
  4. Ghahremani, P., V. Manohar, D. Povey, \bbland S. Khudanpur: Acoustic Modelling from the Signal Domain Using CNNs. \bblin Proc. of Interspeech. 2016.
  5. Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 1967.
  6. Bahdanau, D., K. Cho, \bbland Y. Bengio: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  7. Graves, A., S. Fernández, F. Gomez, \bbland J. Schmidhuber: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. \bblin Proc. of ICML. 2006.
  8. Sennrich, R., B. Haddow, \bbland A. Birch: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
  9. Zeyer, A., K. Irie, R. Schlüter, \bbland H. Ney: Improved training of end-to-end attention models for speech recognition. arXiv preprint arXiv:1805.03294, 2018.
  10. Kudo, T.: Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. arXiv preprint arXiv:1804.10959, 2018.
  11. Cieri, C., D. Miller, \bbland K. Walker: The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text. \bblin LREC. 2004.
  12. Ko, T., V. Peddinti, D. Povey, M. L. Seltzer, \bbland S. Khudanpur: A study on data augmentation of reverberant speech for robust speech recognition. \bblin Proc. of IEEE ICASSP. 2017.
  13. Simonyan, K. \bbland A. Zisserman: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  14. Graves, A., S. Fernández, \bbland J. Schmidhuber: Bidirectional LSTM networks for improved phoneme classification and recognition. \bblin International Conference on Artificial Neural Networks, \bblpp 799–804. Springer, 2005.
  15. Hochreiter, S. \bbland J. Schmidhuber: Long short-term memory. Neural computation, 9(8), \bblpp 1735–1780, 1997.
  16. Zeiler, M. D.: ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  17. Kingma, D. P. \bbland J. Ba: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Milde, B. \bbland A. Köhn: Open Source Automatic Speech Recognition for German. \bblin Proc. of ITG. 2018.
  19. Panayotov, V., G. Chen, D. Povey, \bbland S. Khudanpur: Librispeech: an ASR corpus based on public domain audio books. \bblin Proc. of IEEE ICASSP. 2015.
  20. Godfrey, J. J., E. C. Holliman, \bbland J. McDaniel: SWITCHBOARD: Telephone speech corpus for research and development. \bblin Proc. of IEEE ICASSP. 1992.
  21. Hernandez, F., V. Nguyen, S. Ghannay, N. Tomashenko, \bbland Y. Esteve: TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation. arXiv preprint arXiv:1805.04699, 2018.
  22. Carletta, J.: Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus. Language Resources and Evaluation, 2007.
  23. Paul, D. B. \bbland J. M. Baker: The design for the Wall Street Journal-based CSR corpus. \bblin Proc. of the workshop on Speech and Natural Language. 1992.
  24. Radeck-Arneth, S., B. Milde, A. Lange, E. Gouvêa, S. Radomski, M. Mühlhäuser, \bbland C. Biemann: Open source german distant speech recognition: Corpus and acoustic model. \bblin Text, Speech, and Dialogue. 2015.
  25. Köhn, A., F. Stegen, \bbland T. Baumann: Mining the Spoken Wikipedia for Speech Data and Beyond. \bblin Proc. of LREC. 2016.
  26. Wahlster, W.: Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013.
  27. Burger, S. \bbland F. Schiel: RVG 1 – A Database for Regional Variants of Contemporary German. \bblin Proc. of LREC. 1998.
  28. Hess, W. J., K. J. Kohler, \bbland H.-G. Tillmann: The Phondat-verbmobil speech corpus. \bblin Fourth European Conference on Speech Communication and Technology. 1995.
  29. Chan, W. \bbland I. Lane: Deep recurrent neural networks for acoustic modelling. arXiv preprint arXiv:1504.01482, 2015.
  30. Han, K. J., A. Chandrashekaran, J. Kim, \bbland I. Lane: The CAPIO 2017 conversational speech recognition system. arXiv preprint arXiv:1801.00059, 2017.
  31. Gaida, C., P. Lange, R. Petrick, P. Proba, A. Malatawy, \bbland D. Suendermann-Oeft: Comparing open-source speech recognition toolkits. Tech. Rep., DHBW Stuttgart, 2014.
  32. Dernoncourt, F., T. Bui, \bbland W. Chang: A Framework for Speech Recognition Benchmarking. Proc. of Interspeech, 2018.
Comments 1
Request Comment
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description