Adapting general-purpose speech recognition engine output for domain-specific natural language question answering

Adapting general-purpose speech recognition engine output for domain-specific natural language question answering

C. Anantaram C. Anantaram TCS Innovation Labs - Delhi, ASF Insignia, Gwal Pahari, Gurgaon, India
22email: c.anantaram@tcs.comSunil Kumar Kopparapu TCS Innovation Labs - Mumbai, Yantra Park, Thane (West) 44email:
   Sunil Kumar Kopparapu C. Anantaram TCS Innovation Labs - Delhi, ASF Insignia, Gwal Pahari, Gurgaon, India
22email: c.anantaram@tcs.comSunil Kumar Kopparapu TCS Innovation Labs - Mumbai, Yantra Park, Thane (West) 44email:

Speech-based natural language question-answering interfaces to enterprise systems are gaining a lot of attention. General-purpose speech engines can be integrated with NLP systems to provide such interfaces. Usually, general-purpose speech engines are trained on large ‘general’ corpus. However, when such engines are used for specific domains, they may not recognize domain-specific words well, and may produce erroneous output. Further, the accent and the environmental conditions in which the speaker speaks a sentence may induce the speech engine to inaccurately recognize certain words. The subsequent natural language question-answering does not produce the requisite results as the question does not accurately represent what the speaker intended. Thus, the speech engine’s output may need to be adapted for a domain before further natural language processing is carried out. We present two mechanisms for such an adaptation, one based on evolutionary development and the other based on machine learning, and show how we can repair the speech-output to make the subsequent natural language question-answering better.

1 Introduction

Speech-enabled natural-language question-answering interfaces to enterprise application systems, such as Incident-logging systems, Customer-support systems, Marketing-opportunities systems, Sales data systems etc., are designed to allow end-users to speak-out the problems/questions that they encounter and get automatic responses. The process of converting human spoken speech into text is performed by an Automatic Speech Recognition (ASR) engine. While functional examples of ASR with enterprise systems can be seen in day-to-day use, most of these work under constraints of a limited domain, and/or use of additional domain-specific cues to enhance the speech-to-text conversion process. Prior speech-and-natural language interfaces for such purposes have been rather restricted to either Interactive Voice Recognition (IVR) technology, or have focused on building a very specialized speech engine with domain specific terminology that recognizes key-words in that domain through an extensively customized language model, and trigger specific tasks in the enterprise application system. This makes the interface extremely specialized, rather cumbersome and non-adaptable for other domains. Further, every time a new enterprise application requires a speech and natural language interface, one has to redevelop the entire interface again.

An alternative to domain-specific speech recognition engines has been to re-purpose general-purpose speech recognition engines, such as Google Speech API, IBM Watson Speech to text API which can be used across domains with natural language question answering systems. Such general-purpose automatic speech engines (gp-ASR) are deep trained on very large general corpus using deep neural network (DNN) techniques. The deep learnt acoustic and language models enhance the performance of a ASR. However, this comes with its own limitations. For freely spoken natural language sentences, the typical recognition accuracy achievable even for state-of-the-art speech recognition systems have been observed to be about % to % in real-world environments (Lee et al., 2010). The recognition is worse if we consider factors such as domain-specific words, environmental noise, variations in accent, poor ability to express on the part of the user, or inadequate speech and language resources from the domain to train such speech recognition systems. The subsequent natural language processing, such as that in a question answering system, of such erroneously and partially recognized text becomes rather problematic, as the domain terms may be inaccurately recognized or linguistic errors may creep into the sentence. It is, hence, important to improve the accuracy of the ASR output text.

In this paper, we focus on the issues of using a readily available gp-ASR and adapting its output for domain-specific natural language question answering (Anantaram et al., 2015a). We present two mechanisms for adaptation, namely

  1. an evolutionary development based artificial development mechanism of adaptation (Evo-Devo), where we consider the output of ASR as a biological entity that needs to adapt itself to the environment (in this case the enterprise domain) through a mechanism of repair and development of its genes and

  2. a machine learning based mechanism where we examine the closest set of matches with trained examples and the number of adaptive transformations that the ASR output needs to undergo in order to be categorized as an acceptable natural language input for question-answering.

We present the results of these two adaptation and gauge the usefulness of each mechanism. The rest of the paper is organized as follows, in Section 2 we briefly describe the work done in this area which motivates our contribution. The main contribution of our work is captured in Section 3 and we show the performance of our approach through experiments in Section 4. We conclude in Section 5.

2 Related Work

Most work on ASR error detection and correction has focused on using confidence measures, generally called the log-likelihood score, provided by the speech recognition engine; the text with lower confidence is assumed to be incorrect and subjected to correction. Such confidence based methods are useful only when we have access to the internals of a speech recognition engine built for a specific domain. As mentioned earlier, use of domain-specific engine requires one to rebuild the interface every time the domain is updated, or a new domain is introduced. As mentioned earlier, our focus is to avoid rebuilding the interface each time the domain changes by using an existing ASR. As such our method is specifically a post-ASR system. A post-ASR system provides greater flexibility in terms of absorbing domain variations and adapting the output of ASR in ways that are not possible during training a domain-specific ASR system (Ringger and Allen, 1996).

Note that an erroneous ASR output text will lead to an equally (or more) erroneous interpretation by the natural language question-answering system, resulting in a poor performance of the overall QA system

Machine learning classifiers have been used in the past for the purpose of combining features to calculate a confidence score for error detection. Non-linguistic and syntactic knowledge for detection of errors in ASR output, using a support vector machine to combine non-linguistic features was proposed in (Shi, 2008) and Naive Bayes classifier to combine confidence scores at a word and utterance level, and differential scores of the alternative hypotheses was used in (Zhou et al., 2005) Both (Shi, 2008) and (Zhou et al., 2005) rely on the availability of confidence scores output by the ASR engine. A syllable-based noisy channel model combined with higher level semantic knowledge for post recognition error correction, independent of the internal confidence measures of the ASR engine is described in (Jeong et al., 2004). In (López-Cózar and Callejas, 2008) the authors propose a method to correct errors in spoken dialogue systems. They consider several contexts to correct the speech recognition output including learning a threshold during training to decide when the correction must be carried out in the context of a dialogue system. They however use the confidence scores associated with the output text to do the correction or not. The correction is carried using syntactic-semantic and lexical models to decide whether a recognition result is correct.

In (Bassil and Semaan, 2012) the authors proposes a method to detect and correct ASR output based on Microsoft N-Gram dataset. They use a context-sensitive error correction algorithm for selecting the best candidate for correction using the Microsoft N-Gram dataset which contains real-world data and word sequences extracted from the web which can mimic a comprehensive dictionary of words having a large and all-inclusive vocabulary.

In (Jun and Lei, 2011) the authors assume the availability of pronunciation primitive characters as the output of the ASR engine and then use domain-specific named entities to establish the context, leading to the correction of the speech recognition output. The patent (Amento et al., 2007) proposes a manual correction of the ASR output transcripts by providing visual display suggesting the correctness of the text output by ASR. Similarly, (Harwath et al., 2014) propose a re-ranking and classification strategy based on logistic regression model to estimate the probability for choosing word alternates to display to the user in their framework of a tap-to-correct interface.

Our proposed machine learning based system is along the lines of (Jeong et al., 2004) but with differences: (a) while they use a single feature (syllable count) for training, we propose the use of multiple features for training the Naive Bayes classifier, and (b) we do not perform any manual alignment between the ASR and reference text – this is done using an edit distance based technique for sentence alignment. Except for (Jeong et al., 2004) all reported work in this area make use of features from the internals of the ASR engine for ASR text output error detection.

We assume the use of a gp-ASR in the rest of the paper. Though we use examples of natural language sentences in the form of queries or questions, it should be noted that the description is applicable to any conversational natural language sentence.

3 Domain adaptation of ASR output

3.1 Errors in ASR output

In this paper we focus on question answering interfaces to enterprise systems, though our discussion is valid for any kind of natural language processing sentences that are not necessarily a query. For example, suppose we have a retail-sales management system domain, then end-users would be able to query the system through spoken natural language questions () such as

A perfect ASR would take as the input and produce (), namely,

We consider the situation where a ASR takes such a sentence () spoken by a person as input, and outputs an inaccurately recognized text () sentence. In our experiments, when the above question was spoken by a person and processed by a popular ASR engine such as Google Speech API, the output text sentence was ()


It should be noted that an inaccurate output by the ASR engine maybe the result of various factors such as background noise, accent of the person speaking the sentence, the speed at which he or she is speaking the sentence, domain-specific words that are not part of popular vocabulary etc. The subsequent natural language question answering system cannot answer the above output sentence from its retail sales data. Thus the question we tackle here is – how do we adapt or repair the sentence () back to the original sentence () as intended by the speaker. Namely

We present two mechanisms for adaptation or repair of the ASR output, namely , in this paper: (a) an evolutionary development based artificial development mechanism, and (b) a machine-learning mechanism.

3.2 Evo-Devo based Artificial Development mechanism of adaption

Our mechanism is motivated by Evolutionary Development (Evo-Devo) processes in biology (Harding and Banzhaf, 2008; Anantaram et al., 2015b; Tufte, 2008) to help adapt/repair the overall content accuracy of an ASR output () for a domain. We give a very brief overview of Evo-Devo process in biological organisms and discuss how this motivates our mechanism. In a biological organism, evo-devo processes are activated when a new biological cell needs to be formed or an injured cell needs to be repaired/replaced. During such cell formation or repair, the basic genetic structure consisting of the genes of the organism are replicated into the cell – the resultant set of ’genes in the cell’ is called the genotype of the cell. Once this is done, the genotype of the cell is then specialized through various developmental processes to form the appropriate cell for the specific purpose that the cell is intended for, in order to factor-in the traits of the organism – called the phenotype of the cell. For example, if a person has blue-eyes then a blue-eye cell is produced, or if a person has brown skin then a brown-skin cell is produced. During this process, environmental influence may also play a role in the cell’s development and such influences are factored into the genotype-phenotype development process. The field of Evo-Devo has influenced the field of Artificial Intelligence (AI) and a new sub-field called Artificial Development (Art-Dev) has been created that tries to apply Evo-Devo principles to find elegant solutions to adaptation and repair problems in AI.

We take inspiration from the Evo-Devo biological process and suitably tailor it to our research problem of repairing the ASR output (). In our approach we consider the erroneous ASR output text as the input for our method and treat it as an ’injured biological cell’. We repair that ’injured cell’ through the development of the partial gene present in the input sentence with respect to the genes present in the domain. We assume that we have been provided with the domain ontology describing the terms and relationships of the domain. In our framework, we consider the domain ontology as the true ’genetic knowledge’ of that ’biological organism’. In such a scenario, the ’genetic repair’ becomes a sequence of match-and-replace of words in the sentence with appropriate domain ontology terms and relationships. Once this is done, the ’genotype-to-phenotype repair’ is the repair of linguistic errors in the sentence after the ’genetic repair’. The following sub-section describes our method in detail.

3.2.1 Repair method

We assume that all the instances of the objects in the domain are stored in a database associated with the enterprise system, and can be expressed in relational form (such as [a R c]), for example ['INDUSTRY', 'has', 'PEAK SALES']. A relational database will store it as a set of tables and we treat the data in the database as static facts of the domain. The ontology of the domain can then be generated from this database. We assume that the data schema and the actual data in the enterprise application forms a part of the domain terms and their relationships in the ontology. This identifies the main concepts of the domain with a <subject- predicate-object> structure for each of the concepts. The ontology thus generated describes the relations between domain terms, for example ['SALES', 'has_code', 'NAICS_CODE'] or ['OPTICAL_GOODS', 'sales_2009', '8767_million'] and thus can be expressed using OWL schema as <s-p-o > structure. Each <s-p-o> entry forms the genes of the domain.

We start by finding matches between domain ontology terms and words that appear in the input sentence. Some words of the input sentence will match domain ontology terms exactly. The corresponding domain ontology entry consisting of subject-predicate-object triple is put into a candidate set. Next, other words in the input sentence that are not exact matches of domain ontology terms but have a ’closeness’ match with terms in the ontology are considered. This ’closeness’ match is performed through a mix of phonetic match combined with Levenshtein distance match. The terms that match help identify the corresponding domain ontology entry (with its subject-predicate-object triple) is added to the candidate set. This set of candidate genes is a shortlist of the ’genes’ of the domain that is probably referred to in the input sentence.

Next, our mechanism evaluates the ‘fittest’ domain ontology entry from the candidate set to replace the partial gene in the sentence. A fitness function is defined and evaluated for all the candidate genes short-listed. This is done for all words / phrases that appear in the input sentence except the noise words. The fittest genes replace the injured genes of the input sentence. The set of all genes in the sentence forms the genotypes of the sentence. This is the first-stage of adaptation.

Once the genotypes are identified, we grow them into phenotypes to remove the grammatical and linguistic errors in the sentence. To do this, we find parts of the sentence that is output by the first-stage of adaptation (the gene-level repair) and that violate well-known grammatical/ linguistic rules. The parts that violate are repaired through linguistic rules. This is the second stage of adaptation/ repair. This process of artificial rejuvenation improves the accuracy of the sentence, which can then be processed by a natural language question answering system (Bhat et al., 2007). Thus, this bio-inspired novel procedure helps adapt/repair the erroneously recognized text output by a speech recognition engine, in order to make the output text suitable for deeper natural language processing. The detailed steps are described below.

The fitness function takes as input the , the candidate , the Levenshtein distance weight (), the Phonetic algorithm weight () and Threshold (). Fitness function then tries to find the closeness of the match between and the candidate . To do that, the function calculates two scores: : is an aggregated score of the similarity of the with the by various phonetic algorithms; and : is the Levenshtein distance between the and the . The fitness function then calculates the final fitness of the using the formula: (1) If the is greater than a given threshold the is replaced by the candidate , otherwise the is kept as it is, namely,

Figure 1: Fitness Function.

Step 1: Genes Identification: We match the sub-parts (or sub-strings) of the ASR-output sentence with the genes of the domain. The match may be partial due to the error present in the sentence. The genes in the domain that match the closest, evaluated by a phonetic and/or syntactic match between the ontology entity and the selected sub-part, are picked up and form the candidates set for the input sentence. For example, let the actual sentence that is spoken by an end-user be "which industry has the peak sales in nineteen ninety seven?". In one of our experiments, when Google Speech API was used as the ASR engine for the above sentence spoken by a user, then the speech engine’s output sentence was "which industry has the pixel in nineteen ninety seven?". This ASR output is erroneous (probably due to background noise or the accent of the speaker) and needs repair/ adaptation for the domain.

As a first step, the ASR output sentence is parsed and the Nouns and Verbs are identified from part-of-speech (POS) tags. Syntactic parsing also helps get <subject-verb-object> relations to help identify a potential set of <s-p-o> genes from the ontology. For each of the Nouns and Verbs and other syntactic relations, the partially matching genes with respect to the domain ontology are identified; for this particular sentence the partially matching genes are, "industry" and "pixel". This leads us to identify the probable set of genes in the domain ontology that are most likely a possible match: 'INDUSTRY', 'has', 'PEAK SALES'. The set of all such probable genes need to be evaluated and developed further.
Step 2: Developing the genes to identify the genotypes: Once the basic candidate genes are identified, we evaluate the genes to find the best fit for the situation on hand with evolution and development of the genes, and then test a fitness function (see Fig. 1 and select the most probable gene that survives. This gives us the set of genotypes that will form the correct ASR sentence. For example, the basic genes "INDUSTRY" and "PIXEL" are used to match the substring "industry has the pixel" with the gene "INDUSTRY', 'has_field', 'PEAK SALES’. This is done through a matching and fitness function that would identify the most appropriate gene of the domain. We use a phonetic match function like Soundex, Metaphone, Double-metaphone (Naumann, 2015) to match "pixel" with "PEAK SALES" or an edit-distance match function like Levenshtein distance (Naumann, 2015) to find the closeness of the match. In a large domain there may be many such probable candidates. In such a case, a fitness function is used to decide which of the matches are most suitable. The genes identified are now collated together to repair the input sentence. This is done by replacing parts of the input sentence by the genes identified in the previous step. In the above example the ASR sentence, "Which industry has the pixel in nineteen ninety seven?" would be adapted/repaired to "Which industry has the peak sales in nineteen ninety seven?".
Step 3: Developing Genotypes to Phenotype of sentence: The repaired sentence may need further linguistic adaptation/ repair to remove the remaining errors in the sentence. To achieve this, the repaired ASR sentence is re-parsed and the POS tags are evaluated to find any linguistic inconsistencies, and the inconsistencies are then removed. For example, we may notice that there is a WP tag in a sentence that refers to a Wh-Pronoun, but a WDT tag is missing in the sentence that should provide the Determiner for the Wh-pronoun. Using such clues we can look for phonetically matching words in the sentence that could possibly match with a Determiner and repair the sentence. Linguistic repairs such as these form the genotype to phenotype repair/ adaptation of the sentence. The repaired sentence can then be processed for question-answering.

We use open source tools like LanguageTool to correct grammatical errors. In addition we have added some domain specific grammar rules. As we understand, the LanguageTool has grammar rules, style rules and built-in Python rules for grammar check and correction. Further we have added some domain specific rules to our linguistic repair function. Our grammar rules can be extended or modified for any domain.

3.2.2 Algorithm of the Evo-Devo process

The algorithm has two main functions: ONTOLOGY_BASED_REPAIR (that encode Steps and described above) and LINGUISTIC_REPAIR (encoding Step above). The input sentence is POS_tagged and the nouns and verbs are considered. A sliding window allows the algorithm to consider single words or multiple words in a domain term.

Let be the set of words in the ASR-output(asr_out). Let be the domain-ontology-terms. These terms may be considered as candidate genes that can possibly replace the ASR output (asr_out) words that may be erroneously recognized. A sliding window of length consisting of words is considered for matching with domain-ontology-terms. The length may vary from to , where may be decided based on the environmental information. For example, if the domain under consideration has financial terms then may be five words, while for a domain pertaining to car parts, may be two words. The part_match functionality described below evaluates a cost function, say such that minimizing would result in which may be a possible candidate to replace , namely,

The cost function

where weights, and represents each-element-in the set . If the value of the cost function is greater than the pre-determined threshold then may be replaced with the , otherwise the is maintained as it is. The broad algorithm of Evolutionary Development mechanism is shown in Algorithm 1.

1:ASR output sentence, sentence; domain_ontology
2:Repaired sentence, repaired_sentence
4:// parse the input sentence
5:parsed_sentence POS_tag()
6:// repair process starts - do genetic repair and find the genotypes
7:part_repaired_sentence ontology_based_repair(parsed_sentence)
8:// grow the genotypes into phenotypes
9:repaired_sentence linguistic_repair(parsed_sentence,part_repaired_sentence)
12:function ontology_based_repair(parsed_sentence)
13:     nouns_verbs find(parsed_sentence, noun_verb_POStoken)
14:// for each noun_verb_entry in nouns_verbs do next 4 steps
15:// find partially matching genes: match nouns and verbs with entries in domain ontology with phonetic algorithms and Levenshtein distance match
16:     concepts_referred part_match(noun_verb_entry, domain_ontology)
17:// find genes: get the subject-predicate-object for concepts
18:     candidate_genes add(spo_entry, concepts_referred)
19:// simulate the development process of the genes - find the fittest gene from candidate genes
20:     fit_gene fittest(candidate_genes, POS_token)
21:// add fittest gene into set of genotypes
22:     genotypes add(fit_gene)
23:// replace partially identified genes in input with genotypes identified
24:     repaired_sentence substitute(parsed_sentence, nouns_verbs, genotypes)
25:     return repaired_sentence
26:end function
28:function linguistic_repair(part_repaired_sentence)
29:     other_POStags find(part_repaired_sentence, remaining_POStokens)
30:// find POS tags without linguistic completion
31:     ling_err linguistic_check( other_POStags, part_repaired_sentence)
32:// find candidate words for linguistic error
33:     candidate_words add(part_repaired_sentence, ling_err)
34:// find the closest semantic match for error words
35:     fit_word fittest_word(candidate_words, ling_err)
36:// add fittest word into repaired sentence
37:     fit_words add(candidate_word, fit_word)
38:// create the repaired sentence
39:     repaired_sentence replace(part_repaired_sentence, fit_words, other_POStags)
40:     return repaired_sentence
41:end function
Algorithm 1 Evo-Devo Mechanism

3.2.3 Detailed example of our method

Let us assume that we have the domain of retail sales data described in an ontology of <subject-predicate-object> structure as shown in Table 1.

Subject Predicate Object
CAR_DEALERS SALES_2013 737640_million
CAR_DEALERS SALES_2011 610747_million
CAR_DEALERS SALES_2009 486896_million
OPTICAL_GOODS SALES_2013 10364_million
OPTICAL_GOODS SALES_2011 10056_million
OPTICAL_GOODS SALES_2009 8767_million
Table 1: Ontology Structure.

Now, let us consider that a user speaks the following sentence to Google Now speech engine: "Which business has more sales in 2013: Car dealers or optical goods?". In our experiment the Google Now speech engine produced the ASR output sentence as "which business has more sales in 2013 car dealers for optical quotes". The recognized ASR sentence has errors. In order to make this ASR sentence more accurate, we input this sentence into the Evo-Devo mechanism and run the process:

  • Genes Identification (Step 1): We parse the ASR sentence and identify the parts-of-speech in it as: which/WDT, business/NN, has/VBZ, more/JJR, sales/NNS, in/IN, 2013/CD, car/NN, dealers/NNS, for/IN, optical/JJ, quotes/NNS.

    Considering the words that have POS tags of Nouns (NN/NNS etc.) in the example sentence we get the words "business", "sales", "car", "dealers", "quotes". Based on these words we extract all the partially matching subject-predicate-object instances of the domain ontology. For example, we obtain instances such as [OPTICAL_GOODS SALES_2013 10364_million], [INDUSTRY BUSINESS OPTICAL_GOODS] and [INDUSTRY BUSINESS CAR_DEALERS], etc. from the domain ontology that are partially matching with the words "business" and "sales" respectively. POS tag 2013/CD also leads to reinforcing the above <s-p-o> instance.

  • Developing the genes to identify the genotypes (Step 2): We replace the erroneous words in the sentence by using a fitness function. The fitness function is defined using string similarity metric (Levenshtein distance) and an aggregated score of phonetic algorithms such as Soundex, Metaphone and Double Metaphone as described in Fitness function in the section above. Thus we get the following adaptation: which business has more sales in 2013 car dealers for optical goods?

  • Developing Genotypes to Phenotype (Step 3): We now find the parts-of-speech of the repaired sentence after the step 2 as: which/WDT, business/NN, has/VBZ, more/JJR, sales/NNS, in/IN, 2013/CD, car/NN, dealers/NNS, for/IN, optical/JJ, goods/NNS.

    In the linguistic repair step, we find that since there is no direct ontological relationship between "car dealers" and "optical goods", we cannot have the preposition for between these domain terms. Thus we have to find a linguistic relation that is permissible between these domain terms. One of the options is to consider linguistic relations like ‘conjunction’, ‘disjunction’ between domain terms. Thus, when we evaluate linguistic relations AND or OR between these domain terms, we find that OR matches closely with for through a phonetic match rather than AND. Thus we replace for with or in the sentence. Hence the final output of the Evo-Devo mechanism is "which business has more sales in 2013 car dealers or optical goods?". This sentence can now be processed by a question-answering (QA) system. In the above example, a QA system (Bhat et al., 2007) would parse the sentence, identify the known ontological terms {business, sales, 2013, car dealers, optical goods}, find the unknown predicates {which business, more sales}, form the appropriate query over the ontology, and return the answer "CAR_DEALERS".

3.2.4 Limitations of the method

We assume that there is a well-structured domain ontology for the domain and it is available in the form of <s-p-o> triples. We also assume that the speaker speaks mostly grammatically correct sentences using terms in the domain. While the method would work for grammatically incorrect sentences, the linguistic repair step would suffer.

We assume that the speech is processed by a gp-ASR and the ASR-output forms the input sentence that needs repair. However, it is important to note that the input sentence (i.e. the ASR output) need not necessarily contain <s-p-o> triples for our method to work. The <s-p-o> triples that are short-listed from domain ontology aid in forming a candidate set of ’possible genes’ to consider and the fittest amongst them is considered (Step 2) in the context of the other words in the sentence. For example, if the input sentence was ’Who had pick sales’ would get repaired to ’Who had peak sales’ since the domain term of ’peak sales’ would match with ’pick sales’ in our method. Further, input sentences need not necessarily be queries; these can be just statements about a domain. For example, if the above ASR-output sentence was "We hit pick sales this season", the method would repair it as "We hit peak sales this season" using the same set of steps for repair. However, as of now, our method does not repair paraphrases of sentences like "which industry had biggest sales" to "which industry had peak sales". Such repairs need extension to our matching process.

The method does not impose any restriction on the sentence or its formation; it can be a fully meaningful sentence in a domain or may contain partial information. The method finds the fittest repair for the inaccuracies occurring in an sentence, post-ASR recognition. It should also be noted that the method does not know the original sentence spoken by the speaker, but tries to get back the original sentence for a particular domain.

3.3 Machine Learning mechanism of adaptation

In the machine learning based mechanism of adaptation, we assume the availability of example pairs of namely (ASR output, the actual transcription of the spoken sentence) for training. We further assume that such a machine-learnt model can help repair an unseen ASR output to its intended correct sentence. We address the following hypothesis

Using the information from past recorded errors and the corresponding correction, can we learn how to repair (and thus adapt to a new domain) the text after ASR?

Note that this is equivalent to, albiet loosely, learning the error model of a specific ASR. Since we have a small training set, we have used the Naive Bayes classifier that is known to perform well for small datasets with high bias and low variance. We have used the NLTK (Bird et al., 2009) Naive Bayes classifier in all our experiments.

Let be the erroneous text (which is the ASR output), the corresponding reference text (which is the textual representation of the spoken sentence) and a feature extractor, such that




is a set of features extracted from . Suppose there are several pairs say (, ) for . Then we can derive for each using (2). The probability that belongs to the class can be derived through the feature set as follows.

where is the apriori probability of the class and is the probability of occurrence of the features in the class , and is the overall probability of the occurrence of the feature set . Making naive assumption of independence in the features we get


In our experiments, the domain specific reference text was spoken by several people and the spoken speech was passed through a general purpose speech recognition engine (ASR) that produced a (possibly) erroneous hypothesis . Each pair of reference and the ASR output (i.e. hypothesis) was then word aligned using edit distance, and the mismatching pairs of words were extracted as pairs. For example, if we have the following spoken sentence:

and the corresponding true transcription

One of the corresponding ASR output was

In this case the pairs are (dear, beer) and (have, has). As another example consider that was spoken but was recognized by the ASR.

Clearly, in this case the pair is (than twenty, jewelry).

Let us assume two features, namely, in (2) is of dimension . Let the two features be . Then, for the pair (than twenty, jewelry) we have

since the number of words in than twenty is and than twenty contains syllables. in this case would be the probability that the number of words in the input are two () when the correction is jewelry. A third example is:

Note that in this case the pair is (peak sales, pixel).

Calculating thus the values of for all reference corrections, for all feature values, for all the features in , we are in a position to calculate the RHS of (4). When this trained classifier is given an erroneous text, features are extracted from this text and the repair works by replacing the erroneous word by a correction that maximizes (4),

Namely, the for which is maximum.

4 Experiments and results

We present the results of our experiments with both the Evo-Devo and the Machine Learning mechanisms described earlier using the U.S. Census Bureau conducted Annual Retail Trade Survey of U.S. Retail and Food Services Firms for the period of to (USCensus, 2015).

4.1 Data Preparation

We downloaded this survey data and hand crafted a total of textual questions (AwazYP, 2015) which could answer the survey data. A set of people (L2 English) generated queries each with the only constraint that these queries should be able to answer the survey data. In all a set of queries were crafted of which duplicate queries were removed to leave queries in all. Of these, we chose queries randomly and distributed among Indian speakers, who were asked to read aloud the queries into a custom-built audio data collecting application. So, in all we had access to audio queries spoken by different Indian speakers; each speaking queries.

Figure 2: accuracy (-axis) for the utterance (-axis) for Ga, Ki, Ku and Ps.

Each of these audio utterances were passed through different ASR engines, namely, Google ASR (Ga), Kaldi with US acoustic models (Ku), Kaldi with Indian Acoustic models (Ki) and PocketSphinx ASR (Ps). In particular, that audio utterances were in wave format (.wav) with a sampling rate of kHz and bit. In case of Google ASR (Ga), each utterance was first converted into .flac format using the utility sound exchange (sox) commonly available on Unix machines. The .flac audio files were sent to the cloud based Google ASR (Ga) one by one in a batch mode and the text string returned by Ga was stored. In all utterances did not get any text output, presumably Ga was unable to recognize the utterance. For all the other utterances a text output was received.

In case of the other ASR engines, namely, Kaldi with US acoustic models (Ku), Kaldi with Indian Acoustic models (Ki) and PocketSphinx ASR (Ps) we first took the queries corresponding to the utterances and built a statistical language model (SLM) and a lexicon using the scripts that are available with PocketSphinx (CMU, 2017) and Kaldi (Kaldi, 2017). This language model and lexicon was used with the acoustic model that were readily available with Kaldi and Ps. In case of Ku we used the American English acoustic models, while in case of Ki we used the Indian English acoustic model. In case of Ps we used the Voxforge acoustic models (VoxForge, 2017). Each utterance was passed through Kaldi ASR for two different acoustic models to get corresponding to Ku and Ki. Similarly all the audio utterance were passed through the Ps ASR to get the corresponding for Ps. A sample utterance and the output of the four engines is shown in Figure 3.

Figure 3: Sample output () of four different ASR for the same spoken utterance (). Also shown are the accuracy of the ASR output.

Figure 2 and Table 2 capture the performance of the different speech recognition engines. The performance of the ASR engines varied, with Ki performing the best with of the utterances being correctly recognized while Ps returned only correctly recognized utterances (see Table 2, Column named ”Correct”) of utterances. The accuracy of the ASR varied widely. For instance, in case of Ps there were as many as instances of the erroneously recognized utterances which had an accuracy of less than %.

Figure 4: All utterances that have and accuracy (-axis) and used in all our experiments.

Note that the accuracy is computed as the number of deletions, insertions, substitutions that are required to convert the ASR output to the textual reference (namely, ) and is a common metric used in speech literature (Hunt, 1990).

For all our analysis, we used only those utterances that had an accuracy % but less that , namely, instances (see Table 2, Figure 4). An example showing the same utterance being recognized by four different ASR engines is shown in Figure 3. Note that we used corresponding to Ga, Ki and Ku in our analysis (accuracy ) and not corresponding to Ps which has an accuracy of only. This is based on our observation that any ASR output that is lower that accurate is so erroneous that it is not possible to adapt and steer it towards the expected output.

ASR engine Result No result Correct Error >=70% <70%
(A+B) (A) (B)
Google ASR (Ga) 243 7 55 188 143 45
Kaldi US (Ku) 250 0 103 147 123 24
Kaldi IN (Ki) 250 0 127 123 111 12
PocketSphinx (Ps) 250 0 44 206 109 97
Total 993 7 329 664 486 178
Table 2: ASR engines and their output %accuracy

The ASR output () are then given as input in the Evo-Devo and Machine Learning mechanism of adaptation.

4.2 Evo-Devo based experiments

We ran our Evo-Devo mechanism with the ASR sentences (see Table 2) and measured the accuracy after each repair. On an average we have achieved about to % improvements in the accuracy of the sentences. Fine-tuning the repair and fitness functions, namely Equation (1), would probably yield much better performance accuracies. However, experimental results confirm that the proposed Evo-Devo mechanism is an approach that is able to adapt to get closer to . We present a snapshot of the experiments with Google ASR (Ga) and calculate accuracy with respect to the user spoken question as shown in Table 3.

User’s Question (), Google ASR out (), After Evo-devo () Acc
: Ga: 80%
: ED: 100%
ED: 85.7% �
Ga: 85.7%
: ED:89.3%
Table 3: Evo-Devo experiments with Google ASR (Ga).

Table 3 clearly demonstrates the promise of the evo-devo mechanism for adaptation/repair. In our experiments we observed that the adaptation/repair of sub-parts in ASR-output () that most probably referred to domain terms occurred well and were easily repaired, thus contributing to increase in accuracy. For non-domain-specific linguistic terms the method requires one to build very good linguistic repair rules, without which the method could lead to a decrease in accuracy. One may need to fine-tune the repair, match and fitness functions for linguistic terms. However, we find the abstraction of evo-devo mechanism is very apt to use.

4.3 Machine Learning experiments

In the machine learning technique of adaptation, we considers pairs as the predominant entity and tests the accuracy of classification of errors.

In our experiment, we used a total of misrecognition errors (for example, (dear, beer) and (have, has) derived from or (than twenty, jewelry) derived from ) in the sentences. We performed -fold cross validation, each fold containing pairs for training and pairs for testing, Note that we assume the erroneous words in the ASR output being marked by a human oracle, in the training as well as the testing set. Suppose the following example () occurs in the training set:

The classifier is given the pair (latest stills), cumulative sales} to the classifier. And if the following example occurs in the testing set (),

the trained model or the classifier is provided (wine) and successful repair would mean it correctly labels (adapts) it to remain the. The features used for classification were ( in Equation (3))

Left context (word to the left of ),
Number of errors in the entire ASR sentence,
Number of words in ,
Right context (word to the right of ),
Bag of vowels of and
Bag of consonants of .

The combination of features , , , , namely, (bag of consonants, bag of vowels, left context, number of words, right context) gave the best results with % improvement in accuracy in classification over -fold validation.

The experimental results for both evo-devo and machine learning based approaches demonstrate that these techniques can be used to correct the erroneous output of ASR. This is what we set out to establish in this paper.

5 Conclusions

General-purpose ASR engines when used for enterprise domains may output erroneous text, especially when encountering domain-specific terms. One may have to adapt/repair the ASR output in order to do further natural language processing such as question-answering. We have presented two mechanisms for adaptation/repair of ASR-output with respect to a domain. The Evo-Devo mechanism provides a bio-inspired abstraction to help structure the adaptation and repair process. This is one of the main contribution of this paper. The machine learning mechanism provides a means of adaptation and repair by examining the feature-space of the ASR output. The results of the experiments show that both these mechanisms are promising and may need further development.

6 Acknowledgments

Nikhil, Chirag, Aditya have contributed in conducting some of the experiments. We acknowledge their contribution.


  • Lee et al. [2010] Cheongjae Lee, Sangkeun Jung, Kyungduk Kim, Donghyeon Lee, and Gary Geunbae Lee. Recent approaches to dialog management for spoken dialog systems. Journal of Computing Science and Engineering, 4(1):1–22, 2010.
  • Anantaram et al. [2015a] C. Anantaram, Rishabh Gupta, Nikhil Kini, and Sunil Kumar Kopparapu. Adapting general-purpose speech recognition engine output for domain-specific natural language question answering. In Workshop on Replicability and Reproducibility in Natural Language Processing: adaptive methods, resources and software at IJCAI 2015, Buenos Aires, 2015a.
  • Ringger and Allen [1996] E.K. Ringger and J.F. Allen. Error correction via a post-processor for continuous speech recognition. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, volume 1, pages 427–430 vol. 1, May 1996. doi: 10.1109/ICASSP.1996.541124.
  • Shi [2008] Yongmei Shi. An Investigation of Linguistic Information for Speech Recognition Error Detection. PhD thesis, University of Maryland, Baltimore County, October 2008.
  • Zhou et al. [2005] Lina Zhou, Jinjuan Feng, A. Sears, and Yongmei Shi. Applying the naive bayes classifier to assist users in detecting speech recognition errors. In System Sciences, 2005. HICSS ’05. Proceedings of the 38th Annual Hawaii International Conference on, pages 183b–183b, Jan 2005. doi: 10.1109/HICSS.2005.99.
  • Jeong et al. [2004] Minwoo Jeong, Byeongchang Kim, and G Lee. Using higher-level linguistic knowledge for speech recognition error correction in a spoken q/a dialog. In HLT-NAACL special workshop on Higher-Level Linguistic Information for Speech Processing, pages 48–55, 2004.
  • López-Cózar and Callejas [2008] Ramón López-Cózar and Zoraida Callejas. Asr post-correction for spoken dialogue systems based on semantic, syntactic, lexical and contextual information. Speech Commun., 50(8-9):745–766, August 2008. ISSN 0167-6393. doi: 10.1016/j.specom.2008.03.008. URL
  • Bassil and Semaan [2012] Youssef Bassil and Paul Semaan. ASR context-sensitive error correction based on microsoft n-gram dataset. CoRR, abs/1203.5262, 2012. URL
  • Jun and Lei [2011] J. Jun and L. Lei. Asr post-processing correction based on ner and pronunciation primitive. In 2011 7th International Conference on Natural Language Processing and Knowledge Engineering, pages 126–131, Nov 2011. doi: 10.1109/NLPKE.2011.6138180.
  • Amento et al. [2007] B. Amento, P. Isenhour, and L. Stead. Error correction in automatic speech recognition transcripts, September 6 2007. URL US Patent App. 11/276,476.
  • Harwath et al. [2014] David Harwath, Alexander Gruenstein, and Ian McGraw. ”choosing useful word alternates for automatic speech recognition correction interfaces. In INTERSPEECH-2014, pages 949–953, 2014.
  • Harding and Banzhaf [2008] Simon Harding and Wolfgang Banzhaf. Artificial development. In Organic Computing, Understanding Complex Systems, pages 201–219. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-77656-7. doi: 10.1007/978-3-540-77657-4˙9. URL
  • Anantaram et al. [2015b] C Anantaram, Nikhil Kini, Chirag Patel, and Sunil Kopparapu. Improving asr recognized speech output for effective nlp. In The Ninth International Conference on Digital Society, pages 17–21, Lisbon, Portugal, Feb 2015b.
  • Tufte [2008] Gunnar Tufte. From Evo to EvoDevo: Mapping and Adaptation in Artificial Development. Development, 2008.
  • Bhat et al. [2007] Shefali Bhat, C. Anantaram, and Hemant Jain. Framework for text-based conversational user-interface for business applications. In Proceedings of the 2Nd International Conference on Knowledge Science, Engineering and Management, KSEM’07, pages 301–312, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 3-540-76718-5, 978-3-540-76718-3. URL
  • Naumann [2015] Felix Naumann., 2015.
  • Bird et al. [2009] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python. O’Reilly Media, Inc., 1st edition, 2009. ISBN 0596516495, 9780596516499.
  • USCensus [2015] USCensus., 2015. Viewed Sep 2015.
  • AwazYP [2015] AwazYP., 2015. Viewed Aug 2017.
  • CMU [2017] CMU. Building language model for pocketsphinx, 2017. URL
  • Kaldi [2017] Kaldi. Overview of graph creation in kaldi, 2017. URL
  • VoxForge [2017] VoxForge. Updated 8khz sphinx acoustic model, 2017. URL
  • Hunt [1990] Melvyn J. Hunt. Figures of merit for assessing connected-word recognisers. Speech Communication, 9(4):329 – 336, 1990. ISSN 0167-6393. doi: URL
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description