End-to-end named entity extraction from speech

End-to-end named entity extraction from speech


Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages (error propagation, metric to tune ASR systems sub-optimal in regards to the final task, reduced space search at the ASR output level,…) and it is known that more integrated approaches outperform sequential ones, when they can be applied. In this paper, we present a first study of end-to-end approach that directly extracts named entities from speech, though a unique neural architecture. On a such way, a joint optimization is able for both ASR and NER. Experiments are carried on French data easily accessible, composed of data distributed in several evaluation campaign. Experimental results show that this end-to-end approach provides better results (F-measure=0.69 on test data) than a classical pipeline approach to detect named entity categories (F-measure=0.65).

End-to-end named entity extraction from speech

Sahar Ghannay, Antoine Caubrière, Yannick Estève, Antoine Laurent, Emmanuel Morin

LIUM - University of Le Mans, France

LS2N - University of Nantes, France

firstname.lastname@univ-lemans.fr, emmanuel.morin@univ-nantes.fr

Index Terms: End-to-end approach, Named entity recognition, Automatic speech recognition, Deep learning.

1 Introduction

Named entities are sequences of words that bring basic predefined semantic information that usually refers to locations, persons, organization…that can be denoted by proper nouns or that are unique in the real world, and they usually include numeric and temporal values. Named entities often constitute the first semantic bricks to extract in order to construct a structured semantic representation of a document content.

Named entity recognition (NER) is among SLU tasks that usually extract semantic information from textual documents. Until now, NER from speech is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs. Such approach has some disadvantages.

For instance, ASR errors have a negative impact on the NER performances, introducing noise within the text to be processed [1]. Rule-based NER systems are usually built to process written language and are not robust to ASR errors. Machine learning based systems do not have good performance when they are trained on perfect transcriptions and deployed to process ASR ones, even if that can be partially compensated by simulating ASR errors in textual training data [2]. Additionally, ASR systems are generally tuned in order to get the lowest word error rate on a validation corpus, but this metric is not optimal to the NER task. For instance, this metric does not distinguish between errors on verbs or proper nouns while such errors do not have the same impact for NER. To compensate this problem, some dedicated metrics to tune ASR systems for better NER performances have been proposed, such as in [3]. Another inconvenience is that usually no information about named entities are used in the ASR process, while such information could help to better choose the partial recognition hypotheses that are dropped away during the decoding process. As a consequence, even when confusion networks or word lattices are used to go beyond the 1-best ASR hypothesis for a better robustness to ASR errors [4], such search space have been pruned without taking into account knowledge on named entity.

In the past, and integrated approach built on a high coupling of ASR and NER modules has been proposed [5], based on the finite-state machine (FSM) paradigm (i.e. transducer composition), showing that such integration can offer significant improvements in terms of NER quality. The main limit of this approach concerns the FSM paradigm itself, that is not able to natively model long distant constraint without combinatory explosion and that, by nature, can only express dependencies through a regular grammar. Another proposition to inject information about named entities in the ASR consists in directly adding some expressions of named entities into the ASR vocabulary [6], and to estimate a language model for speech recognition that take into account these named entity expressions. The main default of a such approach is that it cannot allow to detect named entity that were not injected in the ASR vocabulary.

All of these issues motivate our research work on neural end-to-end approach to extract named entities from speech. On a such way, a joint optimization is able for both ASR and NER in a NER task perspective, the architecture is more compact than the ones used in usual pipeline, and we expect to take benefit of the deep neural architecture capacities to capture long distant constraint at the sentence level. Very recently, a similar approach has been proposed by Facebook on a paper posted on the arXiv.org website [7]. This end-to-end approach is dedicated to domain and intent classification tasks, and experiments were carried on internal data close to the spirit of the ATIS corpus, as expressed by the authors.

In this paper, we present a first study of an end-to-end approach to extract named entities. Our neural architecture is very similar to the Deep Speech 2 neural ASR system proposed by Baidu in [8]. To use it for named entity recognition, we apply a multi-task training and modify the sequence of characters to be recognized from speech. Experiments were carried on French data easily accessible, and so reproducible, that were distributed in the framework on evaluation campaigns and are still available. This paper is structured as follows. Section 2 describes the neural ASR architecture we used. Section 3 explains how we propose to exploit a such neural architecture for named entity extraction from speech. Section 4 presents some propositions to optimize the system and also compensate the lack of manually annotated audio data. Section 5 presents our experimental results, before the conclusion.

2 Model architecture

The RNN architecture used in this study is similar to the Deep Speech 2 neural ASR system proposed by Baidu in [8]. This architecture is composed of convolution layers (CNN), followed by uni or bidirectional recurrent layers, a lookahead convolution layer [9], and one fully connected layer just before the softmax layer, as shown in Figure 1.

Figure 1: Deep RNN architecture used to extract named entities from French speech.

The system is trained end-to-end using the CTC loss function [10], in order to predict a sequence of characters from the input audio. In our experiments we used two CNN layers and six bidirectional recurrent layers with batch normalization as mentioned in [8].

Given an utterance and label sampled from a training set , the RNN architecture has to train to convert an input sequence into a final transcription s. For notational convenience, we drop the superscripts and use to denote a chosen utterance and the corresponding label.

The RNN takes as input an utterance represented by a sequence of log-spectrograms of power normalized audio clips, calculated on 20ms windows. As output, all the characters of a language alphabet may be emitted, in addition to the space character used to segment character sequences into word sequences (space denotes word boundaries).

The RNN makes a prediction at each output time step .

At test time, the CTC model is coupled with a language model trained on a big textual corpus. A specialized beam search CTC decoder [11] is used to find the transcription that maximizes :


where wc(y) is the number of words in the transcription . The weight controls the relative contributions of the language model and the CTC network. The weight controls the number of words in the transcription.

3 Named entity extraction process

In the literature, many studies focus on named entity recognition from text. State-of-the-art systems are based on neural networks architectures. Some of them rely heavily on hand-crafted features and domain-specific knowledge [12, 13]. Recent approaches [14, 15] takes benefits from both word and/or character-level embeddings learned automatically, by using combination of bidirectional LSTM, CNN and CRF. However, named entities recognition from automatic transcriptions is less studied. This task is made through a pipeline process that consists in processing first an automatic speech recognition (ASR) on the audio and then processing a NER on the ASR outputs [16]. Usually, the named entity recognition task is to assign a named entity tag to every word in a sentence. A single named entity could concern several words within a sentence. For this reason, the word-level labels begin-inside-outside (BIO) encoding [17] is very often adopted.

In this preliminary study, we focus on named entity extraction from speech using the network described above, without changing the neural architecture. We would like to evaluate if this neural architecture is able to capture high level semantic information that allow it to recognize named entities. For that, we propose to modify the character sequence that the neural network has to produce: information about named entities are added in the initial character sequence. Instead of applying a BIO approach, we propose to add some tag characters in this sequence to delimit named entities boundaries, but also their category. We are interested to eight NE categories that are: person, function, organization, location, production, amount, time and event.

In our experiments, the system will attribute a tag “” or “end” only before and after the named entities, the other words are not concerned. To distinguish the named entity category, we consider a begin tag for each NE category. Only one “end” tag is used for all the NE categories, considering that since there is no overlap between named entities in a such representation, this information is sufficient to delimit the end of a named entity.

According to the eight named entity categories targeted by the task, nine NE tags has to be added to the character list emitted by the neural network: “”, “”, “”, “”, “”, “”, “”, “”, and . As such neural model predicts a character at each time step, we propose to map each of these nine NE tags to one single special character, that corresponds respectively to: “[”, “(”, “{”,“$”, “&”, “%”, “#”, “)” and “]”, as illustrated in Figure 2.

With this way, the NE tags are included in the prediction process, and are taken into account by the CTC loss function during the training process.

Figure 2: Example of mapping the real NE tags to character sequence. This sentence means, in English and case sensitive: ”the sculptor Caesar died yesterday in Paris at the age of seventy-seven years”

4 Multi-task training, data augmentation, and starred mode

Audio recordings with both manual transcriptions and manual annotations of named entities are relatively rare, while neural end-to-end approaches are known to need large amount of data to become competitive.

To compensate this lack of data, we first propose to apply a multi-task learning approach to train the neural network. This consists in starting to train it only for the ASR task, without emitting character used to represent named entities, on all the audio recordings available with their manual transcriptions. At the end, the softmax layer is reinitialized to take into consideration the named entity tag markers, and a new training process is realized, on the named entity recognition task, with only training data with manual annotations of named entities.

A second proposition consists in artificially increasing the training data for the named entity recognition task. For this purpose, we propose to apply a named entity recognition system dedicated to text data in order to tag the manual transcriptions used to train the ASR neural network. Then, these manual transcriptions automatically annotated with named entities can be injected in the training data used to train the neural network to extract named entities from speech.

In addition, since we want the system to focus on named entities, and since the CTC loss gives the same importance to each character, we propose to modify the character sequence that the neural network must emit to give more importance to named entities. This proposition is interesting to better understand how the CTC loss behaves on this case, and consists in replacing by a star ”*” all character subsequences that do not contain a named entity. For instance, the character sequence presented in Figure 2 becomes: * [ césar ] * # hier ] * $ paris ] * % soixante dix sept ans ]. We call this approach the starred mode, and we expect that it can make the neural model more sensitive to named entities.

5 Experiments

5.1 Experimental setups

Experiments have been carried out on four different French corpora, including ESTER 1&2, ETAPE and Quaero. These corpora are composed of data recorded from francophone radio and TV stations, and are annotated with named entities.

The ESTER corpora were divided into three parts: training, development and evaluation. ESTER 1 [18] training (73 hours) and development (17 hours) corpora are composed of data recorded from four radio stations in French. ESTER 1 test corpus is composed of 10 hours coming from the same four radio stations plus two other stations, all of which recorded 15 month after the development data.

ESTER 2 [19] training corpus was not annotated with named entities and was not used in this study. The development (17 hours) and test set (10 hours) is composed of manual transcriptions of speech recorded from six radio stations (two of those radio stations were already used in ESTER 1).

The ETAPE [20] data consists of manual transcriptions and annotations of TV and radio shows. It contains 36 hours of speech, recorded between 2010 and 2011, divided into three parts: training (22 hours), development (7 hours) and test (7h).

QUAERO (ELRA-S0349) data is composed of 12 hours of manual transcriptions of TV and radio shows coming from 6 different sources recorded in 2010.

Our corpus, called DeepSUN, is the combination of those four corpora. The training corpus is composed of the training sets of ESTER 1, ETAPE and QUAERO, while, the development and test sets are composed respectively of the development and test sets of ESTER 1&2, and ETAPE. It contains almost 160 hours of speech (training 107 hours, test 24 hours, development 30 hours). The distribution of named entities by categories in the corpus is summarized in Table 1.

category dev test train
pers 6719 4766 22115
func 1830 1425 6628
org 5133 3506 15804
loc 5195 3915 18159
prod 652 606 2317
time 3763 2769 12020
amount 1591 1450 5959
event 79 0 321
Sum 24962 18437 83323
Table 1: Distribution of named entities by categories in the DeepSUN corpus

The performance of our approach is evaluated in terms of precision (P), recall(R) and F-measure for named entity detection, the named entity/value detection and the accuracy of the value detection when the named entities tags are correctly detected. These evaluations are made with the help of the sclite111http://www.icsi.berkeley.edu/Speech/docs/sctk-1.2/sclite.htm tool.

5.2 Multi-task training

For multi-task training, we first train the E2E architecture only for ASR task, without emitting character used to represent named entities. The system is trained on all the audio recordings available with their manual transcriptions around 297.7 hours of training set, including the data described above.

It composed of two convolution layers and six BLSTM layers with batch normalization, the number of epochs was set to . This system achieves 20.70% word error rate (WER) and 8.01% character error rate (CER) on dev corpus (30.2 hours) and 19.95% of WER and 7.68% of CER on test set (40.8 hours). These results were obtained by applying a CTC beam search decoding coupled with a trigram language model. Once this system is trained, the softmax layer is reinitialized to take into consideration the named entity tag markers, and a new training process is realized, on the named entity recognition task, with only training data with manual annotations of named entities described in table 1. In addition, for the training of both E2E and ASR systems, each training audio samples is randomly perturbed in gain and tempo for each iteration.

5.3 Experimental results

We present in this section some experimental results. Table 2 shows the performances of the end-to-end model (E2E) to detect EN categories (among the eight ones). That means that in this evaluation we do not take care of values associated to the detected EN. The starred mode is also experimented and is called (E2E*) in the table: this mode provides better results in this task than the normal mode.

System Corpus Precision Recall F-measure
E2E dev 0.85 0.57 0.68
E2E test 0.83 0.52 0.64
E2E* dev 0.75 0.65 0.71
E2E* test 0.82 0.57 0.67
Table 2: Named entity category detection results for E2E and E2E* (starred mode) systems

Table 3 evaluates the quality of the category/value pairs that have been recognized. While precision and recall do not have the same behavior between normal and starred mode, both modes gets the same F-measure value.

System Corpus Precision Recall F-measure
E2E dev 0.64 0.45 0.53
E2E test 0.55 0.36 0.44
E2E* dev 0.57 0.47 0.52
E2E* test 0.47 0.38 0.42
Table 3: Named entity category+value pair detection results for E2E and E2E* systems

Last, we would like to compare these results to the ones obtained by a pipeline process, that consists in applying a text named entity recognition on the automatic transcripts produced by the end-to-end ASR system trained on the first step of the multi-task learning presented above.

The text named entity recognition system used for this experiment is based on the combination of bi-directional LSTM (BLSTM), CNN and CRF modules [15], and takes benefits from both word and character-level embeddings learned automatically during the training process. For this experiment, we used the NeuroNLP2 implementation222https://github.com/XuezheMax/NeuroNLP2. Convolutional neural network encodes character-level information of a word into its character-level embedding. Then the character-and word-level embeddings are fed into the BLSTM to model context information of each word. On top of BLSTM, the sequential CRF is used to jointly decode labels for the whole sentence. In addition, this system can be enriched with syntactic information like part of speech tagging (POS). In our experiment, NeuroNLP2 is used as a NER system and Deep Speech 2 as the ASR system. Both are trained on the DeepSUN corpus described in section 5.1. Automatic transcriptions of dev and test data have been annotated with NER system. To measure the impact of POS, we used the MACAON system [21] to tag of DeepSUN corpus and manual and automatic transcriptions.

To feed NeuroNLP2, one hot vectors represent POS information. Word embeddings, character representations and one hot concatenations feed the BLSTM layer. As we can see in Tables 4, the pipeline process is less competitive than the end-to-end model to recognize EN category, but is more efficient to extract EN values. Results also confirms that linguistic information like POS is really important for the NER task. Such observation will help for future work on the continuity of this study.

System Detection Precision Recall F-measure
Pip category 0.75 0.56 0.64
Pip+POS category 0.74 0.58 0.65
Pip cat+value 0.58 0.43 0.49
Pip+POS cat+value 0.57 0.45 0.50
Table 4: NER results for the pipeline approach (Pip) on the test data. When POS are used to tag ASR outputs before NER processing, the system is called Pip+POS

As described in section 4, we applied NeuroNLP2 (the version using POS tagging) on the manual transcriptions of the ASR training data in order to augment the amount of ”NER from speech” training data. In this experiment, the normal and starred modes were used. Table 5 shows the improvement got by the end-to-end system when training on these imperfect augmented data using the normal (E2E+) and the starred (E2E+*) modes. As we can see, the use of the augmented data was helpful for the starred mode.

System Detection Precision Recall F-measure
E2E+ category 0.82 0.57 0.67
E2E+* category 0.76 0.63 0.69
E2E+ cat+value 0.55 0.40 0.46
E2E+* cat+value 0.49 0.41 0.47
Table 5: NER results on the test data for the E2E system trained with imperfect augmented data (E2E+) in comparison to the E2E system trained with imperfect augmented data and the starred mode (E2E+*)

6 Conclusion

This paper presents a first study about end-to-end named entity extraction from speech. By integrating in the character sequence emitted by a CTC end-to-end speech recognition system some special characters to delimit and categorize named entities, we showed that such extraction is feasible. To compensate the lack of training data, we propose a multi-task learning approach (ASR + NER) in addition to an artificial data augmentation of the training corpus with automatic annotation of named entities. A starred mode is also proposed to make the neural network more focused on named entities. Experimental results show that this end-to-end approach in starred mode with training augmentation, provides better results (F-measure equals to 0.69 on test) than a pipeline approach to detect named entity categories (F-measure=0.64). On the other side, performances of this end-to-end approach to extract named entity values are worse than the ones got by the pipeline process.

As a conclusion, this study presents promising results in a first attempt to experiment an end-to-end approach to extract named entities, and constitutes an interesting start point for future work that could start by combining starred mode with training data augmentation, but also explore more different ways, like injecting linguistic information in the end-to-end neural architecture.

7 Acknowledgements

This work was supported by the French ANR Agency through the CHIST-ERA M2CR project, under the contract number ANR-15-CHR2-0006-01, and by the RFI Atlanstic2020 RAPACE project. Authors would like to sincerely thank Sean Naren to make his implementation of Deep Speech 2 available, as well as the contributors to the NeuroNLP2 project.


  • [1] M. Hatmi, C. Jacquin, E. Morin, and S. Meignier, “Named entity recognition in speech transcripts following an extended taxonomy,” in First Workshop on Speech, Language and Audio in Multimedia, 2013.
  • [2] E. Simonnet, S. Ghannay, N. Camelin, and Y. Estève, “Simulating ASR errors for training SLU systems,” in LREC 2018, Miyazaki, Japan, May 2018. [Online]. Available: https://hal-univ-lemans.archives-ouvertes.fr/hal-01715923
  • [3] M. A. B. Jannet, O. Galibert, M. Adda-Decker, and S. Rosset, “Investigating the effect of ASR tuning on named entity recognition,” Proc. Interspeech 2017, pp. 2486–2490, 2017.
  • [4] D. Hakkani-Tür, F. Béchet, G. Riccardi, and G. Tur, “Beyond asr 1-best: Using word confusion networks in spoken language understanding,” Computer Speech & Language, vol. 20, no. 4, pp. 495–514, 2006.
  • [5] C. Servan, C. Raymond, F. Béchet, and P. Nocéra, “Conceptual decoding from word lattices: application to the spoken dialogue corpus media,” in The Ninth International Conference on Spoken Language Processing (Interspeech 2006-ICSLP), 2006.
  • [6] M. Hatmi, C. Jacquin, E. Morin, and S. Meigner, “Incorporating named entity recognition into the speech transcription process,” in Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech’13), 2013, pp. 3732–3736.
  • [7] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio, “Towards end-to-end spoken language understanding,” arXiv preprint arXiv:1802.08395, 2018.
  • [8] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International Conference on Machine Learning, 2016, pp. 173–182.
  • [9] C. Wang, D. Yogatama, A. Coates, T. Han, A. Hannun, and B. Xiao, “Lookahead convolution layer for unidirectional recurrent neural networks,” 2016.
  • [10] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning.    ACM, 2006, pp. 369–376.
  • [11] A. Y. Hannun, A. L. Maas, D. Jurafsky, and A. Y. Ng, “First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns,” arXiv preprint arXiv:1408.2873, 2014.
  • [12] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.
  • [13] J. P. Chiu and E. Nichols, “Named entity recognition with bidirectional lstm-cnns,” arXiv preprint arXiv:1511.08308, 2015.
  • [14] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
  • [15] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,” arXiv preprint arXiv:1603.01354, 2016.
  • [16] C. Raymond, “Robust tree-structured named entities recognition from speech,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.    IEEE, 2013, pp. 8475–8479.
  • [17] L. A. Ramshaw and M. P. Marcus, “Text chunking using transformation-based learning,” CoRR, vol. cmp-lg/9505040, 1995. [Online]. Available: http://arxiv.org/abs/cmp-lg/9505040
  • [18] S. Galliano, E. Geoffrois, D. Mostefa, K. Choukri, J.-F. Bonastre, and G. Gravier, “The ester phase ii evaluation campaign for the rich transcription of french broadcast news,” in Ninth European Conference on Speech Communication and Technology, 2005.
  • [19] S. Galliano, G. Gravier, and L. Chaubard, “The ester 2 evaluation campaign for the rich transcription of french radio broadcasts,” in Tenth Annual Conference of the International Speech Communication Association, 2009.
  • [20] G. Gravier, G. Adda, N. Paulson, M. Carré, A. Giraudel, and O. Galibert, “The etape corpus for the evaluation of speech-based tv content processing in the french language,” in LREC-Eighth international conference on Language Resources and Evaluation, 2012, p. na.
  • [21] A. Nasr, F. Béchet, J.-F. Rey, B. Favre, and J. Le Roux, “Macaon: An nlp tool suite for processing word lattices,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations.    Association for Computational Linguistics, 2011, pp. 86–91.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description