Deep Context: End-to-End Contextual Speech Recognition
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to as Contextual Listen, Attend and Spell (CLAS) jointly-optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain out-of-vocabulary (OOV) terms not seen during training. We compare our proposed system to a more traditional contextualization approach, which performs shallow-fusion between independently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the proposed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components.
DEEP CONTEXT: END-TO-END CONTEXTUAL SPEECH RECOGNITION
|Golan Pundak, Tara N. Sainath, Rohit Prabhavalkar, Anjuli Kannan, Ding Zhao|
|Google Inc., USA|
Index Terms: speech recognition, sequence-to-sequence models, listen attend and spell, LAS, attention, embedded speech recognition.
As speech technologies become increasingly pervasive, speech is emerging as one of the main input modalities on mobile devices and in intelligent personal assistants . In such applications, speech recognition performance can be improved significantly by incorporating information about the speaker’s context into the recognition process . Examples of such context include the dialog state (e.g., we might want “stop” or “cancel” to be more likely when an alarm is ringing), the speaker’s location (which might make nearby restaurants or locations more likely) , as well as personalized information about the user such as her contacts or song playlists .
There has been growing interest recently in building sequence-to-sequence models for automatic speech recognition (ASR), which directly output words, word-pieces , or graphemes given an input speech utterance. Such models implicitly subsume the components of a traditional ASR system - the acoustic model (AM), the pronunciation model (PM), and the language model (LM) - into a single neural network which is jointly trained to optimize log-likelihood or task-specific objectives such as the expected word error rate (WER) . Representative examples of this approach include connectionist temporal classification (CTC)  with word output targets , the recurrent neural network transducer (RNN-T) [9, 10], and the “Listen, Attend, and Spell” (LAS) encoder-decoder architecture [11, 12]. In recent work, we have shown that such approaches can outperform a state-of-the-art conventional ASR system when trained on hours of transcribed speech utterances .
In the present work, we consider techniques for incorporating contextual information dynamically into the recognition process. In traditional ASR systems, one of the dominant paradigms for incorporating such information involves the use of an independently-trained on-the-fly (OTF) rescoring framework which dynamically adjusts the LM weights of a small number of n-grams relevant to the particular recognition context . Extending such techniques to sequence-to-sequence models is important for improving system performance, and is an active area of research. In this context, previous works have examined the inclusion of a separate LM component into the recognition process through either shallow fusion , or cold fusion  which can bias the recognition process towards a task-specific LM. A shallow fusion approach was also directly used to contextualize LAS in  where output probabilities were modified using a special weighted finite state transducer (WFST) constructed from the speaker’s context, and was shown to be effective in improving performance.
The use of an external independently-trained LM for OTF rescoring, as in previous approaches, goes against the benefits derived from the joint optimization of the components of a sequence-to-sequence model. Therefore, in this work, we propose Contextual-LAS (CLAS), a novel, all-neural mechanism which can leverage contextual information – provided as a list of contextual phrases – to improve recognition performance. Our technique consists of first embedding each phrase, represented as a sequence of graphemes, into a fixed-dimensional representation, and then employing an attention mechanism  to summarize the available context at each step of the model’s output predictions. Our approach can be considered to be a generalization of the technique proposed in  in the context of streaming keyword spotting, by allowing for a variable number of contextual phrases during inference. The proposed method does not require that the particular context information be available at training time, and crucially, unlike previous works [16, 2], the method does not require careful tuning of rescoring weights, while still being able to incorporate out-of-vocabulary (OOV) terms. In experimental evaluations, we find that CLAS – which trains the contextualization components jointly with the rest of the model – significantly outperforms online rescoring techniques when handling hundreds of context phrases, and is comparable to these techniques when handling thousands of phrases.
The organization of the rest of this paper is as follows. In Section 2.1 we describe the standard LAS model, and the standard contextualization approach in Section 2.2. We present the proposed modifications to the LAS model in order to obtain the CLAS model in Section 3. We describe our experimental setup and discuss results in Sections 4 and 5, respectively, before concluding in Section 6.
2.1 The LAS model
We now briefly describe the LAS model. For more details see [11, 13]. The LAS model outputs a probability distribution over sequences of output labels, , (graphemes, in this work) conditioned on a sequence of input audio frames, (log-mel features, in this work): .
The model consists of three modules: an encoder, decoder and attention network, which are trained jointly to predict a sequence of graphemes from a sequence of acoustic feature frames (Figure 1a).
The encoder is comprised of a stacked recurrent neural network (RNN) [19, 20] (unidirectional, in this work) that reads acoustic features, , and outputs a sequence of high-level features (hidden states), = (). The encoder is similar to the acoustic model in an ASR system.
The decoder is a stacked unidirectional RNN that computes the probability of a sequence of output tokens (characters, in this work) as follows:
The conditional dependence on the encoder state vectors, , is modeled using a context vector , which is computed using Multi-Head-attention [21, 13] as a function of the current decoder hidden state, , and the full encoder state sequence, .
The hidden state of the decoder, , which captures the previous character context , is given by:
where is the previous hidden state of the decoder, and is an embedding vector for . The posterior distribution of the output at time step is given by:
where and are again learnable parameters, and represents the concatenation of the two vectors. The model is trained to minimize the discriminative loss:
2.2 On-the-fly Rescoring
On-the-fly rescoring (similar to ) serves as one of our baseline approaches. Specifically, we assume that a set of word-level biasing phrases are known ahead of time, and compile them into a weighted finite state transducer (WFST) . This word-level WFST, , is then left-composed with a “speller” FST, , which transduces a sequence of graphemes/word-pieces into the corresponding word. Following the procedure used by  for a general language model, we obtain the contextual LM, . The scores from the contextualized LM, , can then be incorporated into the decoding criterion, by augmenting the standard log-likelihood term with a scaled contribution from the contextualized LM:
where, is a tunable hyperparameter controlling how much the contextual LM influences the overall model score during beam search.
Note that in , no weight pushing was applied. Consequently, the overall score in Equation 5 is only applied at word boundaries. This is shown in Figure 2(a). Thus, this technique cannot improve performance if the relevant word does not first appear on the beam. Furthermore, we observe that while this approach works reasonably well when the number of contextual phrases is small (e.g., yes, no, cancel) , it does not work well when the list of contextual phrases contains many proper nouns (e.g., song names or contacts). If weight pushing is used, similarly to , the score will only be applied to the beginning subword unit of each word as shown in Figure 2b, which might cause over-biasing problems as we might artificially boost words early on. Therefore, we explore pushing weights to each subword unit of the word, illustrated in Figure 2c. To avoid artificially giving weight to prefixes which are boosted early on but do not match the entire phrase, we also include a subtractive cost, as indicated by the negative weights in the figure. We compare all three approaches in the results section.
3 Contextual LAS (CLAS)
We will now introduce the Contextualized LAS (CLAS) model which uses additional context through a list of provided bias phrases, , thus effectively modeling . The individual elements in represent phrases such as personalized contact names, song lists, etc., which are relevant to the particular recognition context.
We now describe the modification made to the standard LAS model (Figure 1a) in order to obtain the CLAS model (Figure 1b). The main difference between the two models is the presence of an additional bias-encoder with a corresponding attention mechanism. These components are described below.
In order to contextualize the model, we assume that the model has access to a list of additional sequences of bias-phrases, denoted as . The purpose of the bias phrases is to bias the model towards outputting particular phrases. However, not all bias phrases are necessarily relevant given the current utterance, and it is up to the model to determine which phrases (if any) might be relevant and to use these to modify the target distribution .
We augment LAS with a bias-encoder which embeds the bias-phrases into a set of vectors (we use superscript to distinguish bias-attention variables from audio-related variables). is an embedding of if . Since the bias phrases may not be relevant for the current utterance, we include an additional learnable vector, , that corresponds to the the no-bias option, that is not using any of the bias phrases to produce the output. This option enables the model to backoff to a “bias-less” decoding strategy when none of the bias-phrases matches the audio, and allows the model to ignore the bias phrases altogether. The bias-encoder is a multilayer long short-term memory network (LSTM) ; the embedding, , is obtained by feeding the bias-encoder with the sequence of embeddings of subwords in (i.e., the same grapheme or word-piece units used by the decoder) and using the last state of the LSTM as the embedding of the entire phrase .
Attention is then computed over , using a separate attention mechanism from the one used for the audio-encoder. A secondary context vector is computed using the decoder state . This context vector summarizes at time step :
The LAS context vector, which feeds into the decoder, is then modified by setting , the concatenation of context vectors obtained with respect to and . The other components of the CLAS model (i.e., decoder and audio-encoder) are identical to the corresponding components in the standard LAS model.
It is worth noting that CLAS explicitly models the probability of seeing a particular bias phrase given the audio and previous outputs:
We refer to as bias-attention-probability and an example of it is presented in Figure 3.
The CLAS model is trained to minimize the loss:
where, the bias list, , is randomly generated at run time during training for each training batch. This is done to allow flexibility in inference, as the model does not make any assumption about what bias phrases will be used during inference. The training bias-phrase list is randomly created from the reference transcripts associated with the utterances in the training batch. The bias list creation process takes a list of reference transcripts, , corresponding to the audio in a training batch, and randomly selects a list, , of n-gram phrases that appear as substrings in some of the reference transcripts.
To exercise the ‘no-bias’ option, which means that does not match some of the utterances in the batch, we exclude each reference with probability from the creation process. When a reference is discarded we still keep the utterance in the batch, but do not extract any bias phrases from its transcript. If we set no bias phrases will be presented to the training batch, and means that each utterance in the batch will have at least one matching bias phrase.
Next, word -grams are randomly selected from each kept reference, where is picked uniformly from and is picked uniformly from .
and are hyperparameters of the training process. For example, if we set, , one unigram will be selected from each reference transcript. Other choices will be discussed in the experimental section.
Once a set is (randomly) selected, we proceed by computing the intersection of with each reference transcript . Every time a match is found a special </bias> symbol is inserted after the match. For example, if the reference transcript is play a song. and the matching bias phrase is play, the target sequence will be modified to play</bias> a song. The purpose of </bias> is to introduce a training error which can be corrected only by considering the correct bias phrase . In other words, to be able to predict </bias> the model has to attend to the correct bias phrase, thus ensuring that the bias-encoder will receive updates during training.
During inference, the user provides the system with a sequence of audio feature vectors, , and a set of context sequences, , possibly never seen in training. Using the bias-encoder, is embedded into . This embedding can take place before audio streaming begins. The audio frames, , are then fed into the audio encoder, and the decoder is run as in standard LAS to produce N-best hypotheses using beam-search decoding .
When thousands of phrases are presented to CLAS, retrieving a meaningful bias context vector becomes challenging, since it is the weighted sum of many different bias-embeddings, and might be far from any context vector seen in training. Bias-Conditioning attempts to alleviate this problem. Here we assume that during inference the model is provided with both a list of bias phrases, , as well as a list of bias prefixes, . With this technique a bias phrase is “enabled” at step only when was detected on the partial hypothesis (the partially decoded hypothesis on the beam). In practice, we do this by updating the bias-attention-probabilities in Equation 7 by setting:
where, is string inclusion. The list of bias prefixes can be constructed arbitrarily. For example, we might want to condition the bias-phrase the cat sat on the bias-prefix the cat. In this case we will compute an embedding for the cat sat but “enable” it only once the cat is detected in .
A good choice of prefixes will minimize the number of phrases sharing the same prefix, so the bias-attention is not “overloaded”, while at the same time, not spliting each phrase into too many segments, to allow distinctive bias embeddings. This may be achived by an algorithm which starts from empty prefixes () and iterativly extends each prefix by one word (from ) as long as the same prefix is not shared by too many phrases. In the Section 5.3.3 we discuss a rule-based prefix construction, and leave full algorithmic treatment as future work.
4.1 Experimental Setup
Our training setup is similar to , though our experiments focus on graphemes and our model architecture is smaller. Our experiments are conducted on a 25,000 hour training set consisting of 33 million English utterances. The training utterances are anonymized and hand-transcribed, and are representative of Google’s voice search traffic. This data set is augmented by artificially corrupting clean utterances using a room simulator, adding varying degrees of noise and reverberation such that the overall SNR is between 0dB and 30dB, with an average SNR of 12dB . The noise sources are from YouTube and daily life noisy environmental recordings.
The models evaluated in this section are trained on Tensor Processing Units (TPU) slices with global batch size of 4,096. Each training core operates on a shard-size of 32 utterences in each training step. From this shard, bias phrases are randomized and thus each shard sees a maximum of 32 bias phrases during training.
We use 80-dimensional log-mel acoustic features computed every 10ms over a 25ms window. Following  we stack 3 consecutive frames and stride the stacked frames by a factor of 3. This downsampling enables us to use a simpler encoder architecture than .
The encoder’s architecture consists of 10 unidirectional LSTM layers, each with nodes. The encoder-attention is computed over dimensions, using attention heads. The bias-encoder consists of a single LSTM layer with nodes and the bias-attention is computed over dimensions. Finally, the decoder consists of 4 LSTM layers with nodes. In total, the model has about 58 million trainable parameters. Our model is implemented using TensorFlow .
In all our experiments we set to promote robustness to the ‘no-bias’ case. We set and . This leads to a bias list with expected size of (half of the shard size, plus one for ‘no-bias’).
4.2 Test sets
We test our model on a number of test sets, which we describe below. A summary of the biasing setup of each of the test sets is given in Table 1.
The Voice Search test set contains voice search queries which are about 3 seconds long. The Dictation test set is contains longer utterances, such as dictations of text messages. Both Voice Search and Dictation are in matched conditions to portions of the training data, and are not used to test biasing but rather the performance of the model in a bias-free setting.
Each of the remaining test sets: Songs, Contacts, and Talk-To, contain utterances with a distinct list of contextual phrases which vary from four phrases upto more than three thousand, and are not necessarily identical across utterances.
The Songs, Contacts and Talk-To test sets are artificially generated using a text-to-speech (TTS) engine. We use Parallel WaveNet  as our TTS engine and corrupt the produced samples with noise similarly to the way we corrupt the training data.
The Songs test set contains requests to play music (e.g. play rihanna music) with a bias set that contains popular american song and artist names. The Contacts test set contains call requests (e.g. call demetri mobile) with a bias set that contains an arbitrary list of contact names.
The Talk-To test set contains requests to talk with one of many chatbots (e.g. talk to trivia game). We note that the list of available chatbots is rather large compared to previous sets. See Table 1 for more details.
In this section, we present the performance of CLAS across a variety of test sets.
5.1 CLAS without bias phrases
First, to check if our biasing components hurt decoding in cases where no bias phrases are present, we compare our model to a similar ‘vanilla’ LAS system in table 2. We note that the CLAS model is trained with random bias phrases, but evaluated with an empty list of phrases during inference (i.e., only ‘no-bias’ is presented at inference time), we denote this model by CLAS-NB. Somewhat surprisingly, we observe that CLAS-NB does better than LAS, and conclude that CLAS can be used even without any biasing phrases. Therefore, in the experiments that follow, to get accurate comparison, instead of comparing to LAS directly we use CLAS-NB as proxy for a LAS baseline.
5.2 On-the-fly (OTF) Rescoring with LAS Baseline
Table 3 compares different OTF rescoring variants, which differ in how weights are assigned to subword units as outlined in Section 2.2. We only report numbers for the Songs test set; similar trends were observed on the other test sets, which are omitted in the interest of brevity. The table indicates that if we bias at the end of the word, as done in , we get very little improvement over the no-bias baseline. While a small improvement comes from biasing at the beginning of each word , the best system biases each subword unit, which helps to keep the word on the beam. All subsequent experiments with OTF rescoring will thus bias each subword unit.
5.3 CLAS with bias phrases
5.3.1 Comparison of Biasing Approaches
We compare CLAS to two baseline approaches in Table 4: (1) A LAS baseline, using CLAS-NB as explained in Section 5.1, (2) LAS + OTF rescoring as described in Section 2.2, with estimated on the same test sets. We find that on sets that have hundreds of biasing phrases with high rate of OOVs (Songs, Contacts), CLAS performs significantly better compared to traditional approaches without requiring any additional hyperparameter tuning. However, CLAS degrades on the Talk-To set which has thousands of phrases. This scalability issue will be addressed with bias-conditioning (see Section 5.3.3).
5.3.2 CLAS with varying number of bias phrases
To better understand the CLAS failure with Talk-To we evaluate CLAS while restricting to phrases that appear in the reference transcript plus distractors (phrases which are not present in the transcript) chosen randomly from the complete bias list. The results are presented in Figure 4. We observe gradual degradation in WER as a function of number of distractors.
We hypothesise that the reason for this behavior is that with a large number of bias phrases, correlations start to appear between their embeddings. For example, the embeddings of talking pal and talkative ai have a correlation (normalized inner product) of 0.6, while the average correlation is 0.2.
5.3.3 Overcoming the scaling problem: CLAS with Bias-conditioning (Cond) and OTF rescoring
Next, we try to combine CLAS with bias-conditioning (Section 3.4). Since Talk-to has the largest number of biasing phrases, we test scalability of CLAS by applying bias-conditioning to this set only, with prefixes constructed in a rule-based manner: First we create a prefix from talk to + the next word, (e.g. the phrase talk to pharmacy flashcards, would be split into a prefix and a suffix ). In addition we found it useful to condition the first word after talk to on its first letter (e.g. pharmacy will be conditioned on talk to p). This construction restricts the number of phrases sharing the same prefix to 225 (vs. 3255) while increasing the overall number of bias phrase segments by only 10%.
In this work we presented CLAS, a novel, all-neural, end-to-end contexualized ASR model, which incorporates contextual information by embedding full context phrases. In experimental evaluations, we demonstrated how the proposed model outperforms standard shallow-fusion biasing techniques on several test sets. We investigated CLAS’s ability to handle a large set of context phrases, and suggested a conditioning method to further imporve its quality. Our future work includes scaling CLAS to tens of thousands of bias phrases.
We would like to thank Ian Williams, Petar Aleksic, Assaf Hurwitz Michaely, Uri Alon, Justin Scheiner, Yanzhang (Ryan) He, Deepti Bhatia, Johan Schalkwyk and Michiel Bacchiani for their help and useful comments.
-  I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F. Beaufays, and C. Parada, “Personalized speech recognition on mobile devices,” in Proc. of ICASSP, 2016, pp. 5955–5959.
-  Petar Aleksic, Mohammadreza Ghodsi, Assaf Michaely, Cyril Allauzen, Keith Hall, Brian Roark, David Rybach, and Pedro Moreno, “Bringing contextual information to google speech recognition,” in in Proc. Interspeech, 2015.
-  J. Scheiner, I. Williams, and P. Aleksic, “Voice search language model adaptation using contextual information,” in Proc. SLT, 2016, pp. 253–257.
-  P. Aleksic, C. Allauzen, D. Elson, A. Kracun, D. M. Casado, and P. J. Moreno, “Improved recognition of contact names in voice commands,” in Proc. of ICASSP, 2015, pp. 5172–5175.
-  M. Schuster and K. Nakajima, “Japanese and korean voice search,” in Proc. of ICASSP, 2012, pp. 5149–5152.
-  R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, and A. Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in Proc. of ICASSP, 2018.
-  Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006, pp. 369–376.
-  H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition,” CoRR, vol. abs/1610.09975, 2016.
-  Alex Graves, “Sequence transduction with recurrent neural networks,” in Proceedings of the 29th International Conference on Machine Learning (ICML 2012), 2012.
-  K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer,” in Proc. of ASRU, 2017.
-  W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” CoRR, vol. abs/1508.01211, 2015.
-  D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end attention-based large vocabulary speech recognition,” in Proc. of ICASSP, 2016.
-  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, N. Jaitly, B. Li, and J. Chorowski, “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP, 2018.
-  A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. of ICASSP, 2018.
-  A. Sriram, H. Jun, S. Sateesh, and A. Coates, “Cold fusion: Training seq2seq models together with language models,” CoRR, vol. abs/1708.06426, 2017.
-  Ian Williams, Anjuli Kannan, Petar Aleksic, David Rybach, and Tara N. Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search,” in Proc. of Interspeech, 2018.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. of ICLR, 2015.
-  Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. McGraw, “Streaming Small-footprint Keyword Spotting Using Sequence-to-Sequence Models,” in Proc. ASRU, 2017.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, Nov. 1997.
-  Mike Schuster, Kuldip K. Paliwal, and A. General, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, 1997.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,” CoRR, vol. abs/1706.03762, 2017.
-  Keith B. Hall, Eunjoon Cho, Cyril Allauzen, Francoise Beaufays, Noah Coccaro, Kaisuke Nakajima, Michael Riley, Brian Roark, David Rybach, and Linda Zhang, “Composition-based on-the-fly rescoring for salient n-gram biasing,” in Interspeech 2015, 2015.
-  Mehryar Mohri, Fernando Pereira, and Michael Riley, “Weighted finite-state transducers in speech recognition,” Computer Speech & Language, vol. 16, no. 1, pp. 69–88, 2002.
-  Ilya Sutskever and Quoc V. Oriol Vinyals, Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014.
-  Chanwoo Kim, Ananya Misra, Kean Chin, Thad Hughes, Arun Narayanan, Tara Sainath, and Michiel Bacchiani, “Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in google home,” in Proc. of Interspeech, 2017.
-  M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems,” Available online: http://download.tensorflow.org/paper/whitepaper2015.pdf, 2015.
-  Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433, 2017.