ASR error management for improving spoken language understanding
This paper addresses the problem of automatic speech recognition (ASR) error detection and their use for improving spoken language understanding (SLU) systems. In this study, the SLU task consists in automatically extracting, from ASR transcriptions, semantic concepts and concept/values pairs in a e.g touristic information system. An approach is proposed for enriching the set of semantic labels with error specific labels and by using a recently proposed neural approach based on word embeddings to compute well calibrated ASR confidence measures. Experimental results are reported showing that it is possible to decrease significantly the Concept/Value Error Rate with a state of the art system, outperforming previously published results performance on the same experimental data. It also shown that combining an SLU approach based on conditional random fields with a neural encoder/decoder attention based architecture, it is possible to effectively identifying confidence islands and uncertain semantic output segments useful for deciding appropriate error handling actions by the dialogue manager strategy.
Index Terms: spoken language understanding, speech recognition, robustness to ASR errors
In spite of impressive research efforts and recent results, systems for semantic interpretation of text and speech still make errors. Some of the problems common to text and speech are: difficulty of concept mention localization, ambiguities intrinsic in localized mentions, deficiency to identify sufficient contextual constraints for solving interpretation ambiguities. Additional problems are introduced by the interaction between a spoken language understanding (SLU) system and an error prone automatic speech recognition (ASR) system. ASR errors may affect the mention of a concept, the value of a concept instance. Furthermore, the hypothesization of concepts and values depends, among other things, on the context in which their mention is localized. Thus, context errors may also introduce errors in concept mention location and hypothesization.
The focus of this paper
SLU systems are error prone. Part of them are caused by certain types of ASR errors. In general, ASR errors are reduced by estimating model parameters by minimizing the expected word error rate . The effect of word errors can be controlled by associating a single sentence hypothesis with word confidence measures. In  methods are proposed for constructing confidence features for improving the quality of a semantic confidence measure. Methods proposed for confidence calibration are based on the maximum entropy model with distribution constraints, the conventional artificial neural network, and the deep belief network (DBN). The latter two methods show slightly superior performance but higher computational complexity compared to the first one. More recently , new features and bidirectional recurrent neural networks (RNN) have been proposed for ASR error detection. Most SLU systems reviewed in  generate hypotheses of semantic frame slot tags expressed in a spoken sentence analyzed by an ASR system. The use of deep neural networks (DNN) appeared in more recent systems as described in . Bidirectional RNNs with long-short term memory (LSTM) have been used for semantic frame slot tagging . In , LSTMs have been proposed with a mechanism of attention for parsing text sentences to logical forms. Following , in  a convolutional neural network (CNN) is proposed for encoding the representation of knowledge expressed in a spoken sentence. This encoding is used as an attention mechanism for constraining the hypothesization of slot tags expressed in the same sentence. Most recent papers using sophisticated SLU architectures based on RNNs have the best sequence of word hypotheses as input passed by an ASR system. In this paper, two SLU architectures are considered. The first one, based an encoder with bidirectional gated recurrent units (GRU) used for machine translation , integrates context information with an attention based decoder as in . The second one integrates context information in the same architecture used in  based on conditional random fields (CRF). Both SLU systems receive word hypotheses generated by the same ASR sub-system and scored with confidence measures computed by a neural architecture with new types of embeddings and semantically relevant confidence features.
3ASR error detection and confidence measure
Two different confidence measures are used for error detection. The first one is the word posterior probability computed with confusion networks as described in . The other one is a variant of a new approach, introduced in . The latter measure is computed with a Multi-Stream Multi-Layer Perceptron (MS-MLP) architecture, fed by different heterogeneous confidence features. Among them, the most relevant for SLU are word embeddings of the targeted word and its neighbors, length of the current word, language model backoff behavior, part of speech (POS) tags, syntactic dependency labels and word governors. Other features, such as prosodic features and acoustic word embeddings described in  and  could also be used but were not considered in the experiments described in this paper. A particular attention was carried on the word embeddings computation, which is the result of a combination of different well known word embeddings (CBOW, Skip-gram, GloVe) made through the use of a neural auto-encoder in order to improve the performances of this ASR error detection system .
The MS-MLP proposed here for ASR error detection has two output units. They compute scores for Correct and Error labels associated with an ASR generated hypothesis. This hypothesis is evaluated by the softmax value of the Correct label scored with the MS-MLP. Experiments have shown that this is a calibrated confidence measure more effective than word posterior probability when comparison is based on the Normalized Cross Entropy (NCE) , which measures the information contribution provided by confidence knowledge.
Table ? shows the NCE values obtained by these two confidence measures on the MEDIA test data whose details can be found in Section 5.
Figure ? shows the predictive capability of the confidence measure based on MLP-MS compared to word posterior probability on the MEDIA test data. The curve shows the predicted percentage of correct words as a function of confidence intervals. The best measure is the one for which percentages are the closest to the diagonal line.
Thanks to these confidences measures, we expect to get relevant information in order to better handle ASR errors in a spoken language understanding framework.
4SLU features and architectures
Two basis SLU architectures are considered to carry experiments on the MEDIA corpus (described in sub-Section 5.1). The first one is an encoder/decoder recurrent neural architecture with a mechanism of attention (NN-EDA) similar to the one used for machine translation proposed in . The second one is based on conditional random fields (CRF - ). Both architectures build their training model on the same features encoded with continuous values in the first one and discrete values in the second one
4.1Set of Features
Word features, including those defined for facilitating the association of a word with a semantic content, are defined as follows:
the word itself
its pre-defined semantic categories which belongs to:
MEDIA specific categories: like names of the streets, cities or hotels, lists of room equipments, food type, … e.g.: TOWN for Paris
more general categories: like figures, days, months, … e.g.: FIGURE for thirty-three.
a set of syntactic features: the MACAON tool  is applied to the whole turn in order to obtain for each word its following tags: the lemma, the POS tag, its word governor and its relation with the current word.
a set of morphological features: the 1-to-4 first letter ngrams, the 1-to-4 letter last ngrams of the word and a binary feature that indicates if the first letter is an upper one.
the two ASR confidence measures : the ASR posterior probability (pap) and the MS-MLP confidence measure as described in Section 3.
The two SLU architectures take all those features except the two confidence measures that can be taken partially: one or another or both according to the most powerful configuration. The SLU architectures also need to be calibrated on their respective hyper-parameters in order to give the best results. The way the best configuration is chosen is described in Section 5.3.
4.2Neural EDA system
The proposed RNN encoder-decoder architecture with an attention-based mechanism (NN-EDA) is inspired from a machine translation architecture and depicted in figure ?. The concept tagging process is considered as a translation problem from words (source language) to semantic concept tags (target language). This bidirectional RNN encoder is based on Gated Recurrent Units (GRU) and computes an annotation for each word from the input sequence , ... , . This annotation is the concatenation of the matching forward hidden layer state and the backward hidden layer state obtained respectively by the forward RNN and the backward RNN comprising the bidirectional RNN. Each annotation contains the summaries of the dialogue turn contexts respectively preceding and the following a considered word.
The sequence of annotations , ... , is used by the decoder to compute a context vector (represented as a circle with a cross in figure ?). A context vector is recomputed after each emission of an output label. This computation takes into account a weighted sum of all the annotations computed by the encoder. This weighting depends on the current output target, and is the core of the attention mechanism: a good estimation of these weights allows the decoder to choose parts of the input sequence to pay attention to. This context vector is used by the decoder in conjunction with the previous emitted label output and the current state of the hidden layer of a RNN to make a decision about the current label output . A more detailled description of recurent neural networks and attention based ones can be found in .
Past experiments described in  have shown that the best semantic annotation performance on manual and automatic transcriptions of the MEDIA corpus were obtained with CRF systems. More recently in , this architecture has been compared to popular bi-directionnal RNN (bi-RNN). The results was that CRF systems outperform a bi-RNN architecture on the MEDIA corpus, while better results were observed by bi-RNN on the ATIS  corpus. This is probably explained by the fact that MEDIA contains semantic contents whose mentions are more difficult to disambiguate and CRFs make it possible to exploit complex contexts more effectively.
For the sake of comparison with the best SLU system proposed in , the Wapiti toolkit was used . Nevertheless, the set of input features used by the system proposed in this paper is different from the one used in . Among the novelties used in our system, we consider syntactic and ASR confidence features and our configuration template is different. After many experiments performed on DEV, our final feature template includes the previous and following instances for words and POS in a unigram or a bigram to associate a semantic label with the current word. Also associated with the current word are semantic categories of the two previous and two following instances. The other features are only considered at the current position.
Furthermore, the tool discretize4CRF
5Experimental setup and results
Experiments were carried with the MEDIA corpus as in . For the sake of comparison, the results of their best understanding system is reported in this paper as baseline. However, as the WER of the ASR used in this paper is lower (23.5%) than the one used in the baseline, rigorous conclusions can be drawn only on comparisons between the different SLU components introduced in this paper.
5.1The MEDIA corpus
The MEDIA corpus was collected in the French Media/Evalda project  and deals with negotiation of tourist services. It contains three sets of telephone human/computer dialogues, namely: a training set (TRAIN) with approximately 17.7k sentences, a development set (DEV) with 1.3k sentences and an evaluation set (TEST) containing 3.5k sentences. The corpus was manually annotated with semantic concepts characterized by a label and its value. Other types of semantic annotations (such as mode or specifiers) are not considered in this paper to be consistent with the experimental results provided in . Annotations also associate a word sequence to the concepts. These sequences have to be considered as estimations of concept localized mentions. Evaluations are performed with the DEV and TEST sets and report concept error rates (CER) for concept labels only and concept-value error rates (CVER)for concept-value pairs. It is worth mentioning that the number of concepts annotated in a turn has a large variability and may include more than 30 annotated concepts. Among the concepts types there are some, such as three different types of REFERENCE and CONNECTOR of application domain entities. The mentions of these concepts are often short confusable sequences of words.
5.2LIUM ASR system dedicated to MEDIA
For these experiments, a variant of the ASR system developed by LIUM that won the last evaluation campaign on French language has been used . This system is based on the Kaldi speech recognition toolkit . The training set used to estimate the DNN (Deep Neural Networks) acoustic models parameters consists of 145,781 speech segments from several sources: the radiophonic broadcast ESTER  and ESTER2  corpora, which accounts for about 100 hours of speech each; the TV broadcast ETAPE corpus , accounting for about 30 hours of speech; the TV broadcast REPERE train corpus, accounting for about 35 hours of speech and other LIUM radio and TV broadcast data for about 300 hours of speech. As a total, 565 hours of speech composes the training corpus. These recordings were converted to 8kHz before training the acoustic models in order to be more appropriate to the MEDIA telephone data. As inputs, DNN are fed (for training and decoding) with MFCCs (Mel-Frequency Cepstrum Coefficients) concatenated to -vectors, in order to adapt acoustic models to speakers.
The vocabulary of the ASR system contains all the words present in the MEDIA training and development corpora, so about 2.5K words. A first bigram language model (LM) is applied during the decoding process to generate word-lattices. These lattices are then rescored by applying a 3-gram language model. In order to get an SLU training corpus close to the test corpus, SLU models are trained with ASR transcriptions. To avoid to deal with errors made by an LM over-trained on the MEDIA training corpus, a leave-one-out approach was followed: all the dialogue files in the training and the development corpora were randomly split into 4 subsets. Each subset was transcribed by using an LM trained on the manual transcriptions present in the 3 other blocks and linearly interpolated to a ’generic’ language model trained on a large set of French newspaper crawled on the web, containing 77 millions of words. The test data was transcribed with an LM trained on the MEDIA training corpus and the same generic language model. As shown in table ?, word error rates for the training, development, and test corpora were around 23.5%.
Tests were performed for both architectures with the MEDIA DEV set. The best configuration is chosen with respect to the best results observed on the DEV set and applied for obtaining the TEST results. These results in terms of error rate, precision and recall for concepts (C) and concept value (CV) are reported for the best configuration of each architecture in Table ?.
It appears that the CRF architecture significantly outperforms NN EDA that shown minor improvements with respect to the baseline.
In order to evaluate the impact of the use of confidence measures among the input features, we made some experiments summarized in Table ?. As we can see, the confidence measure provided by the MS-MLP architecture brings relevant information to reduce the CER and the CVER.
Other versions of the two systems were considered by adding to the usual MEDIA concept labels two more output tags. During training, these tags are replacing the usual one when the hypothesized word is erroneous. If the erroneous hypothesized word is supporting a concept, it is associated to the ERROR-C tag, ERROR-N otherwise. During evaluation, ERROR-C and ERROR-N hypothesized tags are replaced by null (tag informing that the word does not convey any MEDIA information) in order to perform the usual MEDIA evaluation protocol. Results on TEST, obtained with the best configuration observed on DEV, are reported in Table ?.
Results in Table ? are similar to those in Table ?, but we can notice some small differences. For instance, precision is now better, even if the CER is not reduced for CRF while it is for NN-EDA. Using these four SLU systems that can be executed in parallel, it is worth trying to see if improvements can be obtained by their combination with weight estimated by optimal performance on the DEV set. The results are reported in Table ? and compared with the ROVER  combination applied to the six SLU systems described in .
The results show 0.6% and 0.6% absolute reductions for CER and CVER with respect to the best CRF architecture and 4.5% and 2.8% with respect to the baseline. Considering that the best results on manual transcriptions are above 10% on the TEST set, one may conclude that, with the solutions presented in this paper, the contribution of ASR errors to the overall SLU errors is inferior to errors observed for manual transcriptions. A detailed analysis of the errors observed in the automatic and manual transcriptions show a common large error contributions for concepts such as three different types of reference, connectors between domain relevant entities, and proper names that can be values of different attributes. These concepts are expressed by confusable words whose disambiguation requires complex context relations that cannot be automatically characterized (at least with the available amount of train data) by CRFs nor by the type of attention mechanisms used in NN EDA.
Considering the case in which all the four systems provided the same output (consensus) for each word, a 0.96 precision with 0.72 recall were observed on the TEST set. Lack of consensus in the DEV and the TEST sets appears to correspond in most cases to mentions of only few types of concepts. This is a very interesting result since it suggests that further investigation on these particular cases is an important challenge for future work.
Two variations of two SLU architectures respectively based on CRFs and NN-EDA have been considered. Using the MEDIA corpus, they were compared with the CRF SLU, considered as baseline that provided the best results among seven different approaches as reported in . The main novelties of the proposed SLU architectures are the use, among others, of semantically relevant confidence and input features. The CRF architectures outperformed the NN-EDA architectures with significant improvement over the baseline. Nevertheless, NN-EDA architectures appeared to be useful when combined with the CRF ones. The results show that the interaction between the ASR and SLU components is beneficial. Furthermore, all the architectures show that most of the errors are for concepts whose mentions are made of short confusable sequences of words that remain ambiguous even if they can be localized. These concept types are difficult to detect, even on manual transcription, indicating that the interpretation of the MEDIA corpus is particularly difficult. Thus, suggested directions for future work should consider new structured mechanisms of attention capable of selecting features of distant contexts in a conversation history. The objective is to identify a sufficient set of context features for disambiguating local concept mentions.
- Thanks to the ANR agency for funding through the CHIST-ERA ERA-Net JOKER under the con- tract number ANR-13-CHR2-0003-05.
- S. Hahn, M. Dinarelli, C. Raymond, F. Lefevre, P. Lehnen, R. De Mori, A. Moschitti, H. Ney, and G. Riccardi, “Comparing stochastic approaches to spoken language understanding in multiple languages,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 6, pp. 1569–1583, 2011.
- L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech recognition: word error minimization and other applications of confusion networks,” Computer Speech & Language, vol. 14, no. 4, pp. 373–400, 2000.
- D. Yu, J. Li, and L. Deng, “Calibration of confidence measures in speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8, pp. 2461–2473, 2011.
- A. Ogawa and T. Hori, “Asr error detection and recognition rate estimation using deep bidirectional recurrent neural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.1em plus 0.5em minus 0.4emIEEE, 2015, pp. 4370–4374.
- G. Tur and R. De Mori, Spoken language understanding: Systems for extracting semantic information from speech.1em plus 0.5em minus 0.4emJohn Wiley & Sons, 2011.
- G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu et al., “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 3, pp. 530–539, 2015.
- D. Hakkani-Tür, G. Tur, A. Celikyilmaz, Y.-N. Chen, J. Gao, L. Deng, and Y.-Y. Wang, “Multi-domain joint semantic frame parsing using bi-directional rnn-lstm,” in Proceedings of The 17th Annual Meeting of the International Speech Communication Association, 2016.
- S. Reddy, O. Täckström, M. Collins, T. Kwiatkowski, D. Das, M. Steedman, and M. Lapata, “Transforming dependency structures to logical forms for semantic parsing,” Transactions of the Association for Computational Linguistics, vol. 4, pp. 127–140, 2016.
- M. Ma, L. Huang, B. Xiang, and B. Zhou, “Dependency-based convolutional neural networks for sentence embedding,” in The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing.1em plus 0.5em minus 0.4emAssociation for Computational Linguistics, 2015.
- Y.-N. Chen, D. Hakanni-Tür, G. Tur, A. Celikyilmaz, J. Guo, and L. Deng, “Syntax or semantics? knowledge-guided joint semantic frame parsing,” in IEEE Workshop on Spoken Language Technology (SLT 2016), San Diego, USA, 2016.
- E. Simonnet, N. Camelin, P. Deleglise, and Y. Esteve, “Exploring the use of attention-based recurrent neural networks for spoken language understanding,” in NIPS, 2015.
- K. Cho, B. van Merrienboer, C. Gulcehre, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in EMNLP, 2014.
- S. Ghannay, Y. Esteve, and N. Camelin, “Word embeddings combination and neural networks for robustness in asr error detection,” in Signal Processing Conference (EUSIPCO), 2015 23rd European.1em plus 0.5em minus 0.4emNice, France: IEEE, 2015, pp. 1671–1675.
- S. Ghannay, Y. Esteve, N. Camelin et al., “Acoustic word embeddings for asr error detection,” Interspeech 2016, pp. 1330–1334, 2016.
- S. Ghannay, Y. Estève, N. Camelin, C. Dutrey, F. Santiago, and M. Adda-Decker, “Combining continuous word representation and prosodic features for asr error prediction,” in International Conference on Statistical Language and Speech Processing.1em plus 0.5em minus 0.4emSpringer, 2015, pp. 84–95.
- S. Ghannay, B. Favre, Y. Esteve, and N. Camelin, “Word embedding evaluation and combination,” in of the Language Resources and Evaluation Conference (LREC 2016), Portoroz (Slovenia), 2016, pp. 23–28.
- J. Lafferty, A. McCallum, F. Pereira et al., “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the eighteenth international conference on machine learning, ICML, vol. 1, 2001, pp. 282–289.
- A. Nasr, F. Béchet, and J.-F. Rey, “Macaon : Une chaîne linguistique pour le traitement de graphes de mots,” in Traitement Automatique des Langues Naturelles - session de démonstrations, Montréal, 2010.
- V. Vukotic, C. Raymond, and G. Gravier, “Is it time to switch to word embedding and recurrent neural networks for spoken language understanding?” in InterSpeech, 2015.
- C. T. Hemphill, J. J. Godfrey, G. R. Doddington et al., “The atis spoken language systems pilot corpus,” in Proceedings of the DARPA speech and natural language workshop, 1990, pp. 96–101.
- =2plus 43minus 4 T. Lavergne, O. Cappé, and F. Yvon, “Practical very large scale CRFs,” in Proceedings the 48th Annual Meeting of the Association for Computational Linguistics (ACL).1em plus 0.5em minus 0.4em Association for Computational Linguistics, July 2010, pp. 504–513. [Online]. Available: http://www.aclweb.org/anthology/P10-1052 =0pt
- H. Bonneau-Maynard, S. Rosset, C. Ayache, A. Kuhn, and D. Mostefa, “Semantic annotation of the french media dialog corpus,” in Ninth European Conference on Speech Communication and Technology, 2005.
- A. Rousseau, G. Boulianne, P. Deléglise, Y. Estève, V. Gupta, and S. Meignier, “LIUM and CRIM ASR system combination for the REPERE evaluation campaign,” in International Conference on Text, Speech, and Dialogue.1em plus 0.5em minus 0.4emSpringer, 2014, pp. 441–448.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584.1em plus 0.5em minus 0.4emIEEE Signal Processing Society, 2011.
- S. Galliano, E. Geoffrois, G. Gravier, J. f. Bonastre, D. Mostefa, and K. Choukri, “Corpus description of the Ester evaluation campaign for the rich transcription of French broadcast news,” in 5th international Conference on Language Resources and Evaluation (LREC), 2006, pp. 315–320.
- S. Galliano, G. Gravier, and L. Chaubard, “The Ester 2 evaluation campaign for the rich transcription of french radio broadcasts,” in Interspeech, 2009.
- G. Gravier, G. Adda, N. Paulsson, M. Carré, A. Giraudel, and O. Galibert, “The ETAPE corpus for the evaluation of speech-based TV content processing in the French language,” in Eighth International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 2012, pp. 114–118.
- J. G. Fiscus, “A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover),” in Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on.1em plus 0.5em minus 0.4emIEEE, 1997, pp. 347–354.