Automatic Documentation of ICD Codes with Far-Field Speech Recognition

Automatic Documentation of ICD Codes with Far-Field Speech Recognition

Albert Haque
Corinna Fukushima
Ambient Intelligence Research, Sensa Inc, Palo Alto, CA, USA
Department of Computer Science, Stanford University, CA, USA
Rush University Medical Center, Chicago, IL, USA

Documentation errors increase healthcare costs and cause unnecessary patient deaths. As the standard language for diagnoses and billing, ICD codes serve as the foundation for medical documentation worldwide. Despite the prevalence of electronic medical records, hospitals still witness high levels of ICD miscoding. In this paper, we propose to automatically document ICD codes with far-field speech recognition. Far-field speech occurs when the microphone is located several meters from the source, as is common with smart homes and security systems. Our method combines acoustic signal processing with recurrent neural networks to recognize and document ICD codes in real time. To evaluate our model, we collected a far-field speech dataset of ICD-10 codes used in emergency departments and found our model to achieve 87% accuracy with a BLEU score of 85%. By sampling from an unsupervised medical language model, our method is able to outperform existing methods. This work shows the potential of automatic speech recognition to provide efficient, accurate, and cost-effective documentation.

Automatic Documentation of ICD Codes with Far-Field Speech Recognition

Albert Haque
Corinna Fukushima
Ambient Intelligence Research, Sensa Inc, Palo Alto, CA, USA
Department of Computer Science, Stanford University, CA, USA
Rush University Medical Center, Chicago, IL, USA

1 Introduction

More than 250,000 people die every year in the United States due to medical errors, making it the third leading cause of death (James, 2013; CDC, 2016). Of those who survive an error, many become permanently disabled (Schiff et al., 2009). Unsurprisingly, many of these errors are preventable, such as inaccurate drug doses, unlisted allergies, and wrong-site amputations. Directly responsible for many of these errors is poor documentation (Hartel et al., 2011; Mulloy & Hughes, 2008; Henderson et al., 2006). Worldwide, ICD111International Classification of Diseases, World Health Organization codes serve as the standard language for diagnoses, treatment, and billing (CMS, 2016; NHS, 2017). However, ICD miscoding occurs as much as as 20%, with similar rates dating back to the 1990s (Cheng et al., 2009; Henderson et al., 2006; MacIntyre et al., 1997). Despite electronic medical records, documentation errors still occur and cost the United States up to $25 billion each year (Lang, 2007; Farkas & Szarvas, 2008).

One of the main sources of ICD miscoding is during patient admission (O’malley et al., 2005). This is paramount for emergency departments, which handle 141 million visits each year in the United States (CDC, 2014). However, emergency departments are overcrowded, leading to overworked and rushed clinical teams, which can ultimately produce more documentation errors (US GAO, 2009; Sun et al., 2013; Hooper et al., 2010). While there has been work on inferring ICD codes from text, those methods still require a written document (Larkey & Croft, 1995; Koopman et al., 2015; Subotin & Davis, 2016; Shi et al., 2017; Tutubalina & Miftahutdinov, 2017; Li et al., 2018; Huang et al., 2018). If we can automate such documentation, not only can we potentially reduce medical errors, but we can also free up time from medical teams, who spend up to 26% of their time on documentation tasks alone (Ammenwerth & Spötl, 2009; Arndt et al., 2017).

Figure 1: Far-field speech vs near-field speech visualized as audio spectrograms. (a) Far-field speech, 12 feet (3.6 meters). (b) Near-field speech at 2 feet (0.6 meters). The -axis denotes time, -axis denotes the frequency, and the colors denote intensity. Brighter colors (yellow-green) indicate stronger intensity. (c) Raw waveform. (d) Transcription of the spoken words.

Outside of healthcare, smart homes have enjoyed the benefits of voice-based home assistants such as Google Home and Amazon Alexa (Li et al., 2017). These devices can perform interactive search queries to accelerate ordinary household workflows such as cooking or morning routines. This “always-on” environment gives rise to ambient intelligence: a sensor-equipped environment that adapts to users as they interact with the environment (Aarts & Wichert, 2009). These smart devices often rely on internet-connected knowledge-bases to perform real-time information retrieval (Kopetz, 2011). However, such functionality is not required to improve the quality of current medical documentation. In this paper, we bring the advances of smart homes and far-field speech recognition to automatically document ICD codes in healthcare.

Far-field speech recognition serves as the foundation for voice interaction in many devices today by recognizing distant human speech, usually one to ten meters away. Such a setting presents technical challenges associated with acoustic reverberations and external background noises (see Figure 1). As a result, the accuracy of far-field speech recognition is significantly lower than near-field speech recognition (e.g., talking to a cell phone) (Kinoshita et al., 2013). Additionally, hospitals are more dynamic and often faster-paced than home or office environments, especially in the case of emergency departments. Combined with esoteric Latin roots, medical words are some of the most difficult to pronounce, even for native English speakers (Clagett, 1941; Skinner, 1961).

1.1 Contributions

In this work, we propose a method to automatically document ICD codes from far-field speech. The input to our model is a spectrogram and the output is a text transcription.

  • First, we present a method for far-field speech recognition. We show that by sampling from an unsupervised language model during training, we can achieve a knowledge distillation effect to improve overall transcription performance.

  • Second, we perform a series of ablation studies to better understand the strengths and limitations of our method in the context of hospital settings. We analyze the effect of background noise such as ambulance sirens at various intensity levels. Given the importance of ICD codes for surgeries and amputations, we also analyze our model’s ability to discern keywords such as left and right and compare it to human performance.

2 Related Work

Medical Transcription

Today, transcription is performed by medical scribes or automatic speech recognition software (ASR) (Gellert et al., 2015; Mandel, 1992). In the healthcare literature, there has been a lot of discussion centered around (i) comparative performance analysis of different ASR systems and (ii) usability studies.

Performance evaluations of medical ASR systems typically arrive at the same conclusion: ASR accuracy is acceptable, with word error rates of 7% to 15% (Zafar et al., 2004; Mandel, 1992; Tritschler & Gopinath, 1999). Some methods employ medical dictionaries to boost performance (Zafar et al., 1999). Usability studies consist of evaluating ASR systems in practice. Zick & Olsen (2001) found that ASR achieves similar accuracy to human transcription but ASR demonstrates a faster turnaround time and is more cost-effective. Borowitz (2001) arrived at the same conclusion, with ASR completing 100% of notes within one day, while a medical scribe completed 24%. ASR has improved documentation speed and quality across many specialties such as radiology (Bhan et al., 2008), pathology (Al-Aynati & Chorneyko, 2003), and emergency medicine (Gröschel et al., 2004).

While healthcare ASR methods demonstrate accuracy rates in the 90%+ range, they still occasionally fail on common cases such as past tense words (Devine et al., 2000). Furthermore, they have one critical disadvantage which limits widespread adoption: they all operate on near-field speech. The speaker is required to be physically at a computer, wear a microphone, or speak into a hand-held recorder. In this paper, we propose a healthcare ASR system on far-field speech. Our goal is to transcribe ICD codes using microphones placed 10+ feet (3+ meters) from the speaker. This will minimize interruptions to clinical workflows, which has been cited as a potential barrier to ASR adoption (Cruz et al., 2014).

Recurrent Neural Networks

One of the seminal works using neural networks for sequential data is the Connectionist Temporal Classification (CTC) by Graves et al. (2006). In their work, they propose a recurrent neural network (RNN) to output a variable number of phoneme-level tokens, depending on the sentence being spoken. This led to the development of sequence-to-sequence models (Sutskever et al., 2014; Cho et al., 2014). Commonly used in machine translation, sequence-to-sequence models comprise of two RNNs: (i) an encoder RNN which processes the input into a single embedding and (ii) a decoder RNN which takes the embedding and outputs a variable length sequence. This significantly increased the interest and usage of RNNs across nearly all areas of machine perception (Donahue et al., 2015; Clark et al., 2017; Vinyals et al., 2015). Further propelling the field forward was the introduction of attention (Xu et al., 2015). By providing the decoder with access to the encoder’s input, output, or intermediate states (instead of a single embedding), attention improved the performance on many tasks (Luong et al., 2015; Mascharka et al., 2018). Graves et al. (2006) continued his line of work on end-to-end speech recognition to much success (Graves, 2012; Graves et al., 2013a; b; Graves & Jaitly, 2014). Combined with large-scale datasets, this gave rise to high performing speech models such as Deep Speech (Hannun et al., 2014; Amodei et al., 2016; Battenberg et al., 2017). While designed for speech synthesis, the Tacotron model (Wang et al., 2017; Shen et al., 2018) also demonstrated incredible performance. In this work, we bridge the advances from these end-to-end speech recognition with healthcare documentation.

3 Method

In this paper, our goal is to automatically document ICD codes using far-field speech recognition. The input to our model is an audio spectrogram and the output is a text transcription. Overall, our method consists of a sequence-to-sequence deep neural network as an acoustic model (see Figure 2). During training, the acoustic model samples from an unsupervised medical language model to improve transcription.

3.1 Acoustic Model

Modeling Long Medical Words. Medical words are often longer than conversational speech. They can contain many syllables, such as the words hyperventilation or calcaneofibular. As a consequence, the resulting ICD code is long, sometimes five to ten seconds in length. Modeling the full spectrogram would require unrolling of the encoder RNN for an infeasibily large number of timesteps, typically on the order of hundreds to thousands of RNN time steps (Sainath et al., 2015). Even with truncated backpropagation through time, this would be a challenging task (Haykin et al., 2001). We present two approaches to reduce the temporal dimensionality as part of the acoustic encoder.

Naive Temporal Dimensionality Reduction. Inspired by Sainath et al. (2015), we propose to reduce the temporal length of the input spectrogram by using a learned convolutional filter bank. As shown in Figure 1, silence appears throughout the spectrogram, denoted by dark blue regions. By treating the spectrogram as an image, convolutional filters can not only reduce the temporal dimension but also compress redundant frequency-level information, such as the absence of high-frequencies (i.e., Figure 1a, 0.5 to 1.0 seconds).

However, we found convolutional networks are unable to sufficiently reduce the temporal dimension to a manageable length. Recently, there has been work exploring the use of multi-layer neural networks such that each layer has a different temporal extent (Koutnik et al., 2014). WaveNet, a method for speech synthesis, employs dilated convolutions to control the temporal receptive field at each layer of the network (Oord et al., 2016; Dutilleux, 1990; Yu & Koltun, 2015). As a result, a single temporal representation from a high-level layer can encode hundreds, if not thousands of RNN timesteps. This is an attractive feature for our task, where ICD codes also span hundreds, if not thousands of timesteps.

Exponential Temporal Dimensionality Reduction

To reduce even further, we design the a hierarchy as a pyramid. Formally, let denote the hidden state of a long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997) cell at the -th timestep of the -th layer (see Equation 1). For a pyramidal LSTM (pLSTM), the outputs from the immediately preceding layer, which contains high-resolution temporal information, are concatenated,


where denotes the concatenation operator. In Equation 1, the output of a pLSTM unit is now a function of not only its previous hidden state, but also the outputs from previous timesteps from the layer below. The pLSTM provides us with an exponential reduction in number of RNN timesteps. Not only does the pyramidal RNN provide higher-level temporal features, but it also reduces the inference complexity (Chan et al., 2016). Because we do not use a bidirectional RNN, our model can run in real-time. After the pLSTM (encoder) processes the input, the states are given to the decoder.

Figure 2: Overview of our method for recognizing ICD codes. The dashed arrow indicates random sampling/flow. Green squares denote the encoder, yellow is the decoder, blue denotes the language model, and gray denotes the model output.

Attention-Based Decoder

The concept of attention has proven useful in many tasks (Chorowski et al., 2015; Mascharka et al., 2018). With attention, each step of the model’s decoder has access to all of the encoder’s outputs. The goal of such an “information shortcut” is to address the challenge of learning long-range temporal dependencies (Bengio et al., 1994). The distribution for the predicted word is a function of the decoder state and attention context . The context vector is produced by an attention mechanism (Chan et al., 2016). Specifically, , where attention is defined as the alignment between the current decoder timestep and encoder timestep :


and where the score between the output of the encoder or the hidden states, , and the previous state of the decoder cell, is computed with where and are sub-networks, e.g. multi-layer perceptrons. The , and are learnable parameters.

The final output from the decoder is a sequence of word embeddings, which encode the transcribed sentence (e.g., one-hot vector). This can be trained by optimizing the cross-entropy loss objective. Many previous works end at this point by performing greedy-search or beam-search decoding methods (Sutskever et al., 2014; Graves et al., 2006), but we propose to randomly sample from the language model to improve the transcription.

3.2 Language Model

Our language model is an n-gram model. Given a sequence of words , the model assigns a probability of the sequence occurring in the training corpus:


where . By default, . To overcome the paucity of higher-order n-grams, we approximate the n-gram probability by interpolating the individual n-gram probabilities.

3.3 Combining the Acoustic and Language Models

To improve our ICD transcriptions, we impose a training constraint on the acoustic model, subject to the language model. However, such a concept is not new. When a language model is used during inference but not training (Chorowski & Jaitly, 2016; Wu et al., 2016), we call this shallow fusion . This was extended by Sriram et al. (2017) in the paper Cold Fusion, to include the language model during training (Kannan et al., 2017). However, in their work, the language model was fixed during training and kept as a feature extractor. In our work, we also keep the language model constant, but instead, we combine cold fusion with scheduled sampling to achieve a distillation effect.

In traditional encoder-decoder RNN models, errors in early stages of the decoder may propagate to later timesteps, making it difficult to learn elements later in the sequence (Bengio et al., 2015). To combat this, during training, the ground truth is occasionally used as input for a timestep, as opposed to the RNN’s own, and possibly incorrect, prediction from its previous timestep. In our method, we also randomly sample from the language model when training the acoustic model. The resulting effect is a type of unsupervised knowledge distillation (Hinton et al., 2015). We can combine both the language model and our acoustic model to select the optimal transcription :


where and denote the acoustic and language model, respectively. The ’s control the language model sampling probability and also serve as mixing hyperparameters. The ’s denotes the posterior probability and denotes the input spectrogram. The entire process is differentiable and can be optimized with first-order methods (Kingma & Ba, 2015).

4 Experiments


We collected a far-field speech dataset of common ICD-10 codes used in emergency departments (AAPC, 2014). First, a pre-determined list of one hundred ICD-10 codes and descriptions was selected. This resulted in 141 unique words with an average character length of 7.3, median length of 7, maximum length of 16, and a standard deviation of 3.2 characters. Second, the vocabulary list was repeated five times by multiple speakers. This was done to collect diverse pitch tracks and intonation from each speaker. Third, full ICD code descriptions were generated by procedurally concatenating each individual word. For a single ICD code, this produced 20,730 acoustic variations per speaker, on average. Details on audio sampling rate and spectrogram processing can be found in Appendix B.


A total of six speakers participated in data collection. The speakers were 50% female and 50% non-native English speakers. Each speaker stood 12 feet (3.6 meters) from a computer monitor and microphone. Each speaker was instructed to speak one word at a time, as displayed on the monitor. Each speaker spoke the words in the same order. At the end of data collection, each speaker was asked to complete the NASA Task Load Index questionnaire (Hart & Staveland, 1988) to measure cognitive load during the study. The results are summarized in Appendix A.

Training and Test Sets

For all experiments, one speaker was excluded from the training set and used as the test set. This was done six times, such that each speaker was part of the test set once (i.e., cross-validation), and the average WER and BLEU are reported. Although our procedural dataset generation technique can allow for millions of training examples, we limited the variations per ICD code to 1,000. As a result, the training set consisted of 60,480 sentences per speaker, for a total training size of 302,400 sentences. The test set consisted of 60,480 sentences from the leave-one-out speaker. All words were converted to a one-hot word embedding and used as the decoder targets. All utterances were padded with the start token <go> and end token <end>.

Language Model

The language model was trained on the entire ICD medical corpus (CMS, 2016). The corpus consisted of 94,127 sentences, 7,153 unique words, 922,201 total words. The shortest sentence contains 1 word and the longest contains 31 words, with a median of 9 words. The text was normalized by converting all characters to lower case English. Punctuations such as commas, dashes, and semi-colons were removed. Numbers we converted to canonical word form. The model was trained on n-grams up to length 10.


We evaluate four speech recognition baselines:

  • Connectionist Temporal Classification (CTC) (Graves et al., 2006). The CTC consists of an RNN which produces a variable number of output tokens (see Hannun (2017)).

  • Sequence-to-Sequence (Seq2Seq) (Sutskever et al., 2014). It consists of one RNN for encoding the input into an embedding and another RNN for decoding it into a sentence.

  • Listen, Attend, and Spell (LAS) (Chan et al., 2016). Same as the Seq2Seq model but the decoder is equipped with attention, allowing it to “peek” at the encoder outputs.

  • Cold Fusion (Sriram et al., 2017). A fixed language model is used when the acoustic model is trained. This can be applied to any of CTC, Seq2Seq, or LAS.


We use word error rate (WER) and BLEU as metrics. The WER is defined as where denotes the number of word substitutions, denotes deletions, denotes insertions, and denotes the number of words in the ground truth sentence. If then accuracy is equivalent to recall (i.e., sensitivity). Word-level accuracy is denoted by . Common in machine translation but less used in speech recognition, we use the Bilingual Evaluation Understudy (BLEU) metric since it can better capture contextual and syntactic roles of a word (He et al., 2011). We provide a list of audio processing, network architecture, and optimization details in Appendix B.

Method Native Non-Native Native Non-Native
Human (Medically Trained)
Human (Untrained)
CTC (Graves et al., 2006)
Seq2seq (Sutskever et al., 2014)
LAS (Chan et al., 2016)
Cold Fusion (Battenberg et al., 2017)
Our Method
Table 1: Comparison with existing methods. Lower WER and higher BLEU is better. Human refers to manual transcription. Native refers to native English speakers. All methods were trained and evaluated on our ICD-10 dataset. denotes the 95% confidence interval. For details about the human transcription baselines, see Appendix C.1.

4.1 Technical Results

Table 1 shows quantitative results for existing methods and our proposed method. For most methods, the performance on non-native English speakers is lower than native speakers. This is to be expected due to the larger variances in non-native speech, especially for complex Latin medical words. Surprisingly, Cold Fusion has a higher WER and lower BLEU than the LAS model, despite cold fusion using an external language model. One explanation is the establishment of a dependence, on a potentially biased language model. Our method could be viewed as the same as cold fusion, but with a smaller mixing parameter. However, we did not run exhaustive tuning to find the optimal hyperparemeters for this dataset. As for CTC, the poor performance could partially be attributed to CTC’s design for phoneme-level recognition. In our task, we evaluate CTC at word-level, which significantly increases the branching factor (i.e., there are more unique words than phonemes).

Qualitative Comparison

Table 2 shows “qualitative” results of our model compared to the baselines. While only two ICD codes are shown in Table 2, in general, many mistakes made by Cold Fusion and our method occur on words which have a complementary or opposite pair (e.g., with/without, left/right, lower/upper). Either word from the pair is valid, according to the language model. The slightest acoustic variation such as a breath or pause may cause the word to be incorrectly substituted. The word pain appears at the beginning of sentences often because .

N-Gram Size

Our language model uses n-grams to assign probabilities to predicted label sequences . As mentioned in Equation 4, our model can operate on n-grams of varying lengths. Figure 5 shows the effect of the n-gram length (e.g., bigram, trigram, 4-gram, etc.) on WER and BLEU. We train and evaluate multiple models, with different n-gram lengths in the language model. The results indicate that bigrams, and potentially trigrams, are the optimal n-gram lengths for our ICD speech recognition task.

Method Transcription Transcription
CTC Generalized pain Pain abdominal
Seq2Seq Abdominal pain Pain injury without loss of consciousness
LAS Lower abdominal pain Intracranial injury without loss of consciousness
Cold Fusion Generalized abdominal pain Intracranial injury with loss of consciousness
Our Method Generalized abdominal pain Intracranial injury without loss of consciousness
Ground Truth Generalized abdominal pain Intracranial injury without loss of consciousness
Table 2: Transcriptions for two ICD codes. Bold text indicates a substitution, insertion, or deletion error. Each column indicates a single test-set example. The ground truth transcription is shown in the bottom row.

4.2 Healthcare-Motivated Ablation Experiments

Ambulance Sirens

Ambulance sirens are commonplace in hospitals, especially in emergency departments. We conducted an experiment to analyze the effect of sirens on speech recognition performance. Figure 5 shows the spectrogram for an audio clip where a siren is played while someone is speaking. As shown in Figure 5a, the triangular shapes denote the siren. By design, most sirens fall in the range of 1 kHz to 3 kHz. This is a higher frequency than most human voices (100 Hz to 300 Hz) so we would suspect sirens to have little effect on speech recognition performance.

We applied two types of sirens: A) a long-rise and long-fall siren (Figure 5a, 0-20 seconds) and B) a fast, alternating wailing siren (Figure 5a, 20-30 seconds). The human scribes were unable to recognize the majority of words and demonstrated a WER of 84% for siren A and WER of 87% for siren B. However, our method demonstrated a WER of 35% and 61% for sirens A and B, respectively. The reason for such a poor performance by the human scribes (see Appendix C.1) is because human ears are most sensitive to frequencies between 2 kHz and 5 kHz (Gelfand, 2009). Loud sounds at these frequencies can “drown out” all the other frequencies. Our model, however, is not distracted by the siren, even at the input level. This is visible from the spectrogram in Figure 5a. The low frequencies of human speech are still visible, despite strong presence of frequencies 1 kHz or higher. Unless a human has real-time spectrum analyzer, they will likely miss the words.

Figure 3: Effect of an ambulance siren.
Figure 4: Effect of n-gram size
Figure 5: Left-right substitution rates

Left vs. Right

All too common are mistakes where surgery or treatment was performed on the wrong body part. In one case, surgeons amputated the wrong leg due to a documentation error (NYT, 1995). Given the importance of the left and right keywords, we perform an ablation study to better understand the rate at which left is mistaken with right and vice versa. We call these mistakes substitution errors. Figure 5 shows the substitution rate for left and right. For reference, the language model priors are and . Not listed in Figure 5, both the untrained human scribe and medically-trained scribe had 0% left-right substitution errors (i.e., they were perfect). While our method shows relatively low substitution rates, compared to humans, all ASR systems could use further improvement. Because left and right are interchangeable, it is unclear how much benefit a language model can provide.

5 Conclusion

While the method presented in this paper is able to automatically document ICD codes from far-field speech, it does have some limitations. Our method can only process single speaker scenarios. Without explicit speaker identity detection, the model may become confused as to which speakers spoke which words. However, there is promising work on speaker source-separation which can address this problem (Weng et al., 2015). Additionally, our analysis was done using English. A tonal language such as Mandarin Chinese may be more prone to environmental noises. Recent work on style tokens may prove beneficial (Wang et al., 2018; Haque et al., 2018).

In this work, we presented a method to automatically document ICD codes with far-field speech recognition. Our method combines acoustic signal processing techniques with deep learning-based approaches to recognize ICD codes in real-time. We evaluated our model on a newly collected dataset of ICD-10 codes used in emergency departments and showed the viability of our model, even under noisy environmental conditions. The emphasis on far-field speech recognition capabilities was done to minimize disruptions to existing hospital workflows and maximize clinical impact. Overall, this work shows the potential of modern automatic speech recognition to provide efficient, accurate, and cost-effective healthcare documentation.


  • AAPC (2014) AAPC. Icd-10 top 50 codes in emergency departments, 2014.
  • Aarts & Wichert (2009) Emile Aarts and Reiner Wichert. Ambient intelligence. In Technology Guide. Springer, 2009.
  • Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv, 2016.
  • Al-Aynati & Chorneyko (2003) Maamoun M Al-Aynati and Katherine A Chorneyko. Comparison of voice-automated transcription and human transcription in generating pathology reports. Archives of Pathology & Laboratory Medicine, 2003.
  • Ammenwerth & Spötl (2009) E Ammenwerth and H-P Spötl. The time needed for clinical documentation versus direct patient care. Methods of Information in Medicine, 2009.
  • Amodei et al. (2016) Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, 2016.
  • Arndt et al. (2017) Brian G Arndt, John W Beasley, Michelle D Watkinson, Jonathan L Temte, Wen-Jan Tuan, Christine A Sinsky, and Valerie J Gilchrist. Tethered to the ehr: primary care physician workload assessment using ehr event log data and time-motion observations. The Annals of Family Medicine, 2017.
  • Battenberg et al. (2017) Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur, Yi Li, Hairong Liu, Sanjeev Satheesh, David Seetapun, Anuroop Sriram, and Zhenyao Zhu. Exploring neural transducers for end-to-end speech recognition. Automatic Speech Recognition and Understanding Workshop, 2017.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Neural Information Processing Systems, 2015.
  • Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. on Neural Networks, 1994.
  • Bhan et al. (2008) Sasha N Bhan, Craig L Coblentz, Sammy H Ali, et al. Effect of voice recognition on radiologist reporting time. Canadian Association of Radiologists Journal, 2008.
  • Borowitz (2001) Stephen M Borowitz. Computer-based speech recognition as an alternative to medical transcription. Journal of the American Medical Informatics Association, 2001.
  • CDC (2014) CDC. National hospital ambulatory medical care survey: 2014 emergency department summary tables, 2014.
  • CDC (2016) CDC. Deaths and mortality, 2016.
  • Chan et al. (2016) William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, 2016.
  • Cheng et al. (2009) Ping Cheng, Annette Gilchrist, Kerin M Robinson, and Lindsay Paul. The risk and consequences of clinical miscoding due to inadequate medical documentation: a case study of the impact on health services funding. Health Information Management Journal, 2009.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv, 2014.
  • Chorowski & Jaitly (2016) Jan Chorowski and Navdeep Jaitly. Towards better decoding and language model integration in sequence to sequence models. arXiv, 2016.
  • Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. In Neural Information Processing Systems, 2015.
  • Clagett (1941) A Henry Clagett. Pronunciation of medical terms. Journal of the American Medical Association, 1941.
  • Clark et al. (2017) Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, and Niki Trigoni. Vinet: Visual-inertial odometry as a sequence-to-sequence learning problem. In Association for the Advancement of Artificial Intelligence, 2017.
  • CMS (2016) CMS. Icd-10-cm official guidelines for coding and reporting, 2016.
  • Cruz et al. (2014) Jonathan E Cruz, John C Shabosky, Matthew Albrecht, Ted R Clark, Joseph C Milbrandt, Steven J Markwell, and Jason A Kegg. Typed versus voice recognition for data entry in electronic health records: emergency physician time use and interruptions. Western Journal of Emergency Medicine, 2014.
  • Devine et al. (2000) Eric G Devine, Stephan A Gaehde, and Arthur C Curtis. Comparative evaluation of three continuous speech recognition software packages in the generation of medical reports. Journal of the American Medical Informatics Association, 2000.
  • Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Computer Vision an Pattern Recognition, 2015.
  • Dutilleux (1990) Pierre Dutilleux. An implementation of the “algorithme à trous” to compute the wavelet transform. In Wavelets. Springer, 1990.
  • Farkas & Szarvas (2008) Richárd Farkas and György Szarvas. Automatic construction of rule-based icd-9-cm coding systems. In BMC Bioinformatics, 2008.
  • Gelfand (2009) Stanley A Gelfand. Essentials of Audiology. Thieme, 2009.
  • Gellert et al. (2015) George A Gellert, Ricardo Ramirez, and S Luke Webster. The rise of the medical scribe industry: implications for the advancement of electronic health records. Journal of the American Medical Association, 2015.
  • Graves (2012) Alex Graves. Supervised sequence labelling with recurrent neural networks, 2012.
  • Graves & Jaitly (2014) Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, 2014.
  • Graves et al. (2006) Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In International Conference on Machine Learning, 2006.
  • Graves et al. (2013a) Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding Workshop, pp. 273–278, 2013a.
  • Graves et al. (2013b) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In International Conference on Acoustics, Speech, and Signal Processing, 2013b.
  • Gröschel et al. (2004) J Gröschel, F Philipp, St Skonetzki, H Genzwürker, Th Wetter, and K Ellinger. Automated speech recognition for time recording in out-of-hospital emergency medicine—an experimental approach. Resuscitation, 2004.
  • Hannun (2017) Awni Hannun. Sequence modeling with ctc. Distill, 2017.
  • Hannun et al. (2014) Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv, 2014.
  • Haque et al. (2018) Albert Haque, Michelle Guo, and Prateek Verma. Conditional end-to-end audio transforms. arXiv, 2018.
  • Hart & Staveland (1988) Sandra G Hart and Lowell E Staveland. Development of nasa-tlx (task load index): Results of empirical and theoretical research. In Advances in Psychology, volume 52, pp. 139–183. Elsevier, 1988.
  • Hartel et al. (2011) Maximilian J Hartel, Lukas P Staub, Christoph Röder, and Stefan Eggli. High incidence of medication documentation errors in a swiss university hospital due to the handwritten prescription process. BMC Health Services Research, 2011.
  • Haykin et al. (2001) Simon S Haykin et al. Kalman Filtering and Neural Networks. Wiley Online Library, 2001.
  • He et al. (2011) Xiaodong He, Li Deng, and Alex Acero. Why word error rate is not a good metric for speech recognizer training for the speech translation task? In International Conference on Acoustics, Speech, and Signal Processing, 2011.
  • Henderson et al. (2006) Toni Henderson, Jennie Shepheard, and Vijaya Sundararajan. Quality of diagnosis and procedure coding in icd-10 administrative data. Medical Care, 2006.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv, 2015.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 1997.
  • Hooper et al. (2010) Crystal Hooper, Janet Craig, David R Janvrin, Margaret A Wetsel, and Elaine Reimels. Compassion satisfaction, burnout, and compassion fatigue among emergency nurses compared with nurses in other selected inpatient specialties. Journal of Emergency Nursing, 2010.
  • Huang et al. (2018) Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes. arXiv, 2018.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv, 2015.
  • James (2013) John T James. A new, evidence-based estimate of patient harms associated with hospital care. Journal of Patient Safety, 2013.
  • Kannan et al. (2017) Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N Sainath, Zhifeng Chen, and Rohit Prabhavalkar. An analysis of incorporating an external language model into a sequence-to-sequence model. arXiv, 2017.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • Kinoshita et al. (2013) Keisuke Kinoshita, Marc Delcroix, Takuya Yoshioka, Tomohiro Nakatani, Armin Sehr, Walter Kellermann, and Roland Maas. The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In Applications of Signal Processing to Audio and Acoustics, 2013.
  • Koopman et al. (2015) Bevan Koopman, Guido Zuccon, Anthony Nguyen, Anton Bergheim, and Narelle Grayson. Automatic icd-10 classification of cancers from free-text death certificates. International Journal of Medical Informatics, 2015.
  • Kopetz (2011) Hermann Kopetz. Internet of things. In Real-time Systems. Springer, 2011.
  • Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. A clockwork rnn. arXiv, 2014.
  • Lang (2007) Dee Lang. Natural language processing in the health care industry. Cincinnati Children’s Hospital Medical Center, Winter, 2007.
  • Larkey & Croft (1995) Leah S Larkey and W Bruce Croft. Automatic assignment of icd9 codes to discharge summaries. Technical report, University of Massachusetts at Amherst, 1995.
  • Li et al. (2017) Bo Li, Tara Sainath, Arun Narayanan, Joe Caroselli, Michiel Bacchiani, Ananya Misra, Izhak Shafran, Hasim Sak, Golan Pundak, Kean Chin, et al. Acoustic modeling for google home. Interspeech, 2017.
  • Li et al. (2018) Min Li, Zhihui Fei, Min Zeng, Fangxiang Wu, Yaohang Li, Yi Pan, and Jianxin Wang. Automated icd-9 coding via a deep learning approach. Transactions on Computational Biology and Bioinformatics, 2018.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. Empirical Methods in Natural Language Processing, 2015.
  • MacIntyre et al. (1997) C Raina MacIntyre, Michael J Ackland, Eugene J Chandraraj, and John E Pilla. Accuracy of icd–9–cm codes in hospital morbidity data, victoria: implications for public health research. Australian and New Zealand Journal of Public Health, 1997.
  • Mandel (1992) Mark A Mandel. A commercial large-vocabulary discrete speech recognition system: Dragondictate. Language and Speech, 1992.
  • Mascharka et al. (2018) D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. arXiv, 2018.
  • Mulloy & Hughes (2008) Deborah F Mulloy and Ronda G Hughes. Wrong-site surgery: a preventable medical error. Patient Safety and Quality; An Evidence-Based Handbook for Nurses, 2008.
  • NHS (2017) NHS. National clinical coding standards icd-10, 2017.
  • NYT (1995) NYT. Doctor who cut off wrong leg is defended by colleagues. The New York Times, 1995.
  • O’malley et al. (2005) Kimberly J O’malley, Karon F Cook, Matt D Price, Kimberly Raiford Wildes, John F Hurdle, and Carol M Ashton. Measuring diagnoses: Icd code accuracy. Health Services Research, 2005.
  • Oord et al. (2016) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv, 2016.
  • Sainath et al. (2015) Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals. Learning the speech front-end with raw waveform cldnns. In Interspeech, 2015.
  • Schiff et al. (2009) Gordon D Schiff, Omar Hasan, Seijeoung Kim, Richard Abrams, Karen Cosby, Bruce L Lambert, Arthur S Elstein, Scott Hasler, Martin L Kabongo, Nela Krosnjar, et al. Diagnostic error in medicine: analysis of 583 physician-reported errors. Archives of Internal Medicine, 2009.
  • Shen et al. (2018) J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In Interspeech, 2018.
  • Shi et al. (2017) Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, and Eric P Xing. Towards automated icd coding using deep learning. arXiv, 2017.
  • Skinner (1961) Henry Alan Skinner. The Origin of Medical Terms. Williams & Wilkins, 1961.
  • Sriram et al. (2017) Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, and Adam Coates. Cold fusion: Training seq2seq models together with language models. arXiv, 2017.
  • Subotin & Davis (2016) Michael Subotin and Anthony R Davis. A method for modeling co-occurrence propensity of clinical codes with application to icd-10-pcs auto-coding. Journal of the American Medical Informatics Association, 2016.
  • Sun et al. (2013) Benjamin C Sun, Renee Y Hsia, Robert E Weiss, David Zingmond, Li-Jung Liang, Weijuan Han, Heather McCreath, and Steven M Asch. Effect of emergency department crowding on outcomes of admitted patients. Annals of Emergency Medicine, 61(6):605–611, 2013.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014.
  • Tritschler & Gopinath (1999) Alain Tritschler and Ramesh A Gopinath. Improved speaker segmentation and segments clustering using the bayesian information criterion. In European Conference on Speech Communication and Technology, 1999.
  • Tutubalina & Miftahutdinov (2017) Elena Tutubalina and Zulfat Miftahutdinov. An encoder-decoder model for icd-10 coding of death certificates. arXiv, 2017.
  • US GAO (2009) US GAO. Hospital emergency departments: Crowding continues to occur, and some patients wait longer than recommended time frames. Government Accountability Office, GAO-09-347, 2009.
  • Vinyals et al. (2015) Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In Neural Information Processing Systems, 2015.
  • Wang et al. (2017) Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv, 2017.
  • Wang et al. (2018) Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ Skerry-Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv, 2018.
  • Weng et al. (2015) Chao Weng, Dong Yu, Michael L Seltzer, and Jasha Droppo. Deep neural networks for single-channel multi-talker speech recognition. Transactions on Audio, Speech and Language Processing, 2015.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv, 2016.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, 2015.
  • Yu & Koltun (2015) Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv, 2015.
  • Zafar et al. (1999) Atif Zafar, J Marc Overhage, and Clement J McDonald. Continuous speech recognition for clinicians. Journal of the American Medical Informatics Association, 1999.
  • Zafar et al. (2004) Atif Zafar, Burke Mamlin, Susan Perkins, Anne M Belsito, J Marc Overhage, and Clement J McDonald. A simple error classification system for understanding sources of error in automatic speech recognition and human transcription. International Journal of Medical Informatics, 2004.
  • Zick & Olsen (2001) Robert G Zick and Jon Olsen. Voice recognition software versus a traditional transcription service for physician charting in the ed. The American Journal of Emergency Medicine, 2001.


Appendix A NASA Task Load Index

Data collection and annotation is known to have high cognitive burden, especially over long periods of time. In our data collection campaign, each participant spoke continuously for approximately 30 minutes. After each data collection participant finished the task, we asked each of them to complete a NASA Task Load Index (TLX) questionnaire (Hart & Staveland, 1988). The results are summarized in Figure 6.

Figure 6: NASA Task Load Index results. Higher values denote higher perceived requirements. We asked each participant to complete the NASA-TLX questionnaire immediately after speaking all the required vocabulary words.

Overall, the non-native English speakers felt more rushed and more cognitive overhead than native English speakers. This is to be expected because we required each participant to read, comprehend, and speak each word in 2-3 seconds. If one is to scale-up the data collection pipeline, one must be considerate of these cognitive demands.

Appendix B Implementation Details

b.1 Audio Hardware & Software

The microphone was a Yeti Pro condenser microphone with a pop filter, connected to a Behringer U-Phoria UMC204 HD pre-amplifier, which was then connected to a computer. The microphone was placed on a microphone stand such that it was the same height as the speaker’s head. Audio was recorded using Audacity at 192 kHz with 24-bit depth and saved as lossless .flac audio files. For all experiments, the audio was downsampled to 44.1 kHz.

b.2 Audio Processing

All experiments used audio with a sampling rate of 44.1 kHz and pre-emphasis of 0.97. Spectrograms were computed with a short-time Fourier transform (STFT) with Hann windowing, 50 ms window size, 12.5 ms hop size, and a 2205-point Fourier transform. Only the real-valued magnitude component was used. Mel-spectrograms were computed using 80 mel filters spanning 125 Hz to 7.6 kHz. Log dynamic range compression was not used.

b.3 Network Architecture

All network modules used the ReLU activation function. The encoder’s convolutional network was a simple 3-layer CNN. Each layer had 32 kernels of size . Batch normalization (Ioffe & Szegedy, 2015) was not used. The encoder RNN consisted of 3 LSTM layers. Each LSTM had 256 hidden units and was stacked in a pyramidal manner as outlined in Section 3. Dropout was not used either. The CNN weights were randomly initialized with a Gaussian distribution with mean 0 and standard deviation 0.1. The RNN initial states were all zeros. The decoder consisted of an RNN, also with 3 LSTM layers with 256 hidden units each. The RNN initial states were the decoder’s output states. The decoder RNN used “teacher-forcing” (Bengio et al., 2015) to randomly use the ground truth sequence as input into the decoder. The ground truth was used 80% of the time.

b.4 Optimization

The model was implemented in TensorFlow (Abadi et al., 2016) and optimized with Adam (Kingma & Ba, 2015) with and . The mean squared error was used as the loss objective. An initial learning rate of was used. All models were trained for 20 epochs with a batch size of 16. On a single Nvidia V100 GPU, each model converged in approximately 3 hours. The language model sampling probability was fixed at 0.2 throughout training. A beam width of 10 was used. The acoustic model was trained with regular backpropagation (i.e., not truncated backpropagation through time). The resulting input spectrogram from the audio processing section above was padded to the max spectrogram length in the dataset. An internal loss mask prevented inference and backpropagation to/from these padded values.

Appendix C Human Baselines

The experiments in Table 1 and ambulance sirens in Section 4.2 both contain human transcription baselines. In this section, we provide more details about these experiments.

c.1 Medically-Trained and Untrained Human Transcription

A total of four participants not involved in data collection were invited to participate in the transcription experiment. Two people were medically-trained (i.e., enrolled or have completed formal medical training/certification) and two people are untrained. We refer to these people as transcriptionists.

For the experiments in Table 1, to maintain fair comparison with the audio given to the machine learning algorithms, we replayed the concatenated ICD audio to each transcriptionist using headphones. Because we evaluated the algorithms using cross-validation, we could not ask the participants to transcribe the full dataset which was used to evaluate the algorithms. As a compromise, we asked each participant to transcribe 100 full ICD codes from each speaker, for a total of 600 ICD codes. Half of the ICD codes were spoken by native English speakers and the other half were from non-native speakers.

Each ICD code was semantically and acoustically unique and was selected randomly from the test set. As the ICD codes were played, the transcriptionist was asked to type what they heard onto a laptop. If the transcriptionist asked to repeat an ICD code, their request was denied. The average typing speed for a secretary is 75 words per minute, which translates to 4,500 words per hour. In our experiments, the process took approximately 1 hour per transcriptionist. After the human transcription process was complete, WER and BLEU were computed as normal. The values are reported in Table 1

Note that this paper’s focus on far-field speech recognition refers to distance between the source and microphone. Once the audio has been recorded from far-field, the human transcriptionist should have access to the highest fidelity version of the recording, just as the algorithms do.

c.2 Ambulance Siren Experiments

In Section 4.2, we perform an experiment where we play ambulance sirens over the far-field ICD speech. For this experiment, we used two transcriptionist from Section C.1. One siren type was played to each transcriptionist. A total of 100 ICD codes were played, all of which had sirens. The transcriptionist was asked to type what they heard and understood (if at all). Audio clips were not played twice. After the transcription process was complete, WER and BLEU were computed as normal and are reported in Section 4.2.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description