Predicting detection filters for small footprint open-vocabulary keyword spotting
In this paper, we propose a fully-neural approach to open-vocabulary keyword spotting, that allows the users to include a customizable voice interface to their device and that does not require task-specific data. We present a keyword detection neural network weighing less than 250KB, in which the topmost layer performing keyword detection is predicted by an auxiliary network, that may be run offline to generate a detector for any keyword. We show that the proposed model outperforms acoustic keyword spotting baselines by a large margin on two tasks of detecting keywords in utterances and three tasks of detecting isolated speech commands. We also propose a method to fine-tune the model when specific training data is available for some keywords, which yields a performance similar to a standard speech command neural network while keeping the ability of the model to be applied to new keywords.
Théodore Bluche, Thibault Gisselbrecht \address Sonos Inc., Paris, France \email@example.com
Index Terms: speech recognition, keyword spotting, neural networks
The recent advances in automatic speech recognition (ASR), reaching close to human recognition performance , paved the way to natural language interaction in everyday life, making voice become a natural interface for the communication with objects. Such large-vocabulary ASR systems demand a lot of resources and computing power, but it has been shown  that the voice interface can run on device when the tasks are known, in a closed-ontology setting. For many interactions it may even be reduced to the detection of specific keywords, allowing to build systems small enough to run on micro-controllers (MCUs), which are cheap and have a low energy consumption .
A significant amount of tiny neural networks for keyword spotting (KWS) have been proposed in the past few years [20, 6]. These models yield very good keyword classification results and can run on MCUs. However, they require specific training data containing the keywords to be detected at inference. They lack flexibility because data should be collected and a new model should be trained every time a new keyword is added. On the other hand, traditional KWS approaches are either based on the output of an ASR system, looking for keywords in the transcript  or in the word [14, 15] or phone [3, 5] lattice, or based on acoustic models of phones, allowing to build models for keywords and “background” and computing likelihood ratios between the two . End-to-end acoustic models predicting the character or phone sequence directly lead to efficient decoding and keyword spotting [9, 12, 4, 2] and can take into account the confusions of the network to improve the keyword models [13, 25]. These methods do not need keyword-specific training data, but still require some post-processing and a non-trivial confidence score calibration to transform the frame-wise phone scores into keyword confidences.
Recently, ASR-free approaches have been proposed, which consist in computing embeddings for both the audio and the keyword pronunciation and directly predict a keyword detection score [1, 19]. They combine the simplicity of end-to-end KWS methods and the flexibility of acoustic KWS, and do not require specific training data. In , the whole spoken utterance is embedded into a single vector with a recurrent auto-encoder. Similarly, the keyword is embedded into a vector using an auto-encoder of the phone sequence. The concatenation of both vectors is fed to a small neural network predicting whether the keyword appears in the utterance. In , different recurrent neural networks are trained to predict the word and phone embeddings. The classification is based on the distance between the keyword and utterance embedding.
In a similar vein, we propose a fully neural architecture for KWS which can be trained on generic ASR data, without specific examples of the keywords to be detected at inference. It is made of three components. An acoustic encoder, composed of a stack of recurrent layers, is pretrained as a quantized ASR acoustic model. Its intermediate features are fed to a convolutional keyword detector network trained to output keyword confidences. The weights of the latter are predicted by a keyword encoder neural network: a recurrent neural network applied to the keyword pronunciations to predict the weights of the topmost convolution kernel of the keyword detector network. This idea is similar to other works on dynamic convolution filters in computer vision for weather prediction , visual question answering , or video and stereo prediction .
We experimented this approach on two tasks: a continuous KWS task where keywords are detected inside queries formulated in natural language, and a speech command task where the goal is to predict one command among a predefined set. We compare this system to acoustic KWS approaches and we show that the proposed neural approach outperforms them by a large margin. We also show how the model may be fine-tuned with specific training data to get close to the performance of an end-to-end KWS classification model, without losing the ability of the model to detect new out-of-vocabulary keywords.
2 Keyword spotting neural network
When keywords to detect are known in advance, and when training data containing those keywords are available, a neural network can be trained in an end-to-end fashion to detect them [20, 6]. In this paper, we present a method to create such a neural network for any arbitrary keyword defined post-training, which does not require training data specific to these keywords.
2.1 Keyword spotting neural network
The neural network is made of a stack of unidirectional LSTM layers, followed by two convolutional layers. It has one sigmoid output for each keyword, and is trained on a generic speech dataset. The top convolutional layer of the neural network computes the probability of detecting each keyword at each timestep. For keyword , the output sequence is computed as follows
where is the sigmoid function, is the convolution operation, represents the lower layers of the neural network with parameters applied to input , and is the convolution kernel corresponding to keyword .
Since the keywords to be detected are not known during the training phase, and because the network is an open-vocabulary KWS model, the parameters of the top layer cannot be trained directly. They are predicted by an auxiliary neural network: a keyword encoder, as shown in Figure 1.
The keyword encoder is a neural network applied to the phone sequence of a keyword . It outputs the parameters used to detect the keyword with the KWS model:
where represents the parameters of the keyword encoder network. In this work, the keyword encoder is a bidirectional LSTM network, followed by an affine transform.
At inference, the network is first configured to detect a set of keywords . For each keyword , the phone sequence is retrieved from a pronunciation lexicon or a grapheme-to-phoneme converter, and convolution kernel is computed by the keyword encoder (eq. 2). Then, the top convolution layer can be created with the set of computed kernels , and the KWS network is ready to be applied to the input audio (eq. 1).
We want when the phone sequence appears in the audio input and ends at , and otherwise. It is therefore possible to jointly train and from a generic speech training set with the cross-entropy loss, without knowing what keywords will be used at inference.
For each time step of each training sample , we create a set of positive keyword examples (i.e. are subsequences of ending at time ), and a set of negative keyword examples , and we minimize the following cross-entropy loss:
2.4 Dataset creation
In order to build the set of positive keyword examples for dataset sample at frame , we need to know what are the last phones appearing in the utterance at time . They may be inferred from the forced alignment of the utterance with the ground-truth phone sequence.
To obtain them, we first train the stack of LSTMs on to predict the phone sequences with the connectionist temporal classification loss (CTC ) :
where is a sequence of phone or blank labels and is the CTC collapse function that removes label repetitions and blank labels. This network is used to align the dataset, i.e., for each , compute:
Let be the receptive field of the convolutional network. By removing blanks and label repetitions, will actually yield the sequence of phones in the utterance between time and . Any suffix of that sequence may be considered an example of positive keyword to be detected at . The set of positive examples at time is created by randomly sampling suffixes of of length to . Note that the sampled “keyword” phone sequences do not necessarily correspond to actual words during training.
Since the network is trained with batches of dataset samples, the set of negative keyword examples for one sample can merely be the union of sets of positive examples for the other samples in the batch.
2.5 Adaptation on specific dataset
Acoustic KWS approaches such as  rely on phone predictions and might not gain much from fine-tuning the model on a keyword-specific dataset. In the proposed approach, the weights generated by the keyword encoder could serve as a starting point for re-training on keyword-specific data, when available. That would amount to rewrite eq. 2 as:
Given a training dataset for a set of keywords made of positive example of keywords and negative data, the data-specific parameters can be adjusted by gradient descent, to optimize the loss of eq. 5. By adjusting only those parameters, the ability of the model to detect any other arbitrary keyword is not lost.
3 Experimental results
3.1 Experimental setup
We trained quantized LSTM networks on Librispeech  with CTC and data augmentation. The inputs are MFCC and outputs are 46 phones plus a blank class. The details of the training and quantization procedure can be found in . The weights of the LSTM cells are frozen after CTC training and not fine-tuned during the training of the KWS network, allowing to use them for ASR as well or with the method of . We evaluated systems based on two such LSTM networks, with five layers of 64 and 96 units.
The top softmax layer in that model is replaced by two convolutional layers to build the KWS network (shown in purple and green in Figure 1). The first convolutional layer has a kernel of five frames, and 96 output tanh channels, followed by a max-pooling of size three and stride two. The top convolutional layer has a kernel size of 12. The total receptive field of the convolutional part has a size of frames. The keyword encoder is made of a bidirectional LSTM layer with 128 units in each direction, followed by a linear transform, predicting weights for each keyword. Overall, the model based on the 5x64 (resp. 5x96) LSTM have 208.8k (resp. 440.7k) parameters plus 1.2k parameters per keyword.
The keyword detector and encoder are jointly trained for five epochs using minibatches of 128 audio samples and two positive synthetic keyword samples in each , using the Adam optimizer and a learning rate of 1e-4, following the procedure presented in Section 2.3. All the weights, including those predicted by the keyword encoder are quantized to 8 bits.
3.2 Keyword spotting results
For this task, we crowd-sourced queries for two use-cases: a smart light scenario and a washing machine scenario
The proposed KWS model is compared to five baselines, presented in detail in . They all include the quantized LSTM acoustic model which was used as a base model for the KWS neural network. The Viterbi and Lattice baselines are LVCSR baselines, where the keywords are detected in the Viterbi decoding or in the decoding lattice of the utterance with a vocabulary of 200k and a trigram language model. The Filler baseline employs a standard keyword-filler model, where the filler model is a phone loop. The Greedy and Sequence baselines correspond to the approach presented in , similar to , with two post-processing methods.
We trained five models with different random initialization. The results are displayed in Figure 2 and compared to the Sequence post-processing approach of  in Table 1. The LVCSR-based results are low, although using recognition lattices provides a big improvement over Viterbi decoding. The keyword-filler model is the best of the traditional methods. The baselines from  are competitive with the filler model. With the same model size, they are almost always better, with both the greedy and sequence post-processors. With a smaller model, the sequence post-processor yields better results than the filler model. The proposed approach outperforms all baselines, even with the small model, which has two times less parameters.
3.3 Analysis of learned filters
To analyze what the keyword encoder has learned, we compare the predicted convolution filters for different inputs. In particular, we measure the Euclidean distance between predicted filters. Indeed, if the filters are close, the KWS network will tend to make similar predictions and potentially confuse the corresponding words.
|Keyword||Closest vocabulary words|
|turn on||anon, non, turnin, fernand, maranon|
|decrease||crease, increase, encrease, greece|
|brightness||rightness, uprightness, triteness, greatness|
|bedroom||bathroom, begloom, broom, broome|
|play||flay, clay, blaye, splay, ley, lay|
|start||astarte, tart, stuart, upstart, sturt|
The filters for all the words in a vocabulary are computed and compared to the predicted filters for some keywords. In Table 2, we show the words with the closest filters to those of the keywords. We observe that they tend to be words with similar pronunciations, of about the same length or shorter, and mostly with the same or very similar suffixes.
Figure 4 shows the pairwise distances between the filters for the keywords in the set for the lights dataset. We see that increase and decrease on the one hand, and turn on and turn off on the other hand are very close to one another, which correlates with the confusions of the KWS network: 37% of the confusions are between increase and decrease, 29% for turn on and turn off.
3.4 Speech commands results
For the speech command task, the goal is to detect a single command among a pre-defined set. We evaluate our approach
on Google’s speech command dataset 
We compare our results with the 5x64 model to a model trained on specific training data, similar in size (186K parameters) and performance to the res15 ResNet architecture of , and to the Greedy acoustic KWS baseline using the same 5x64 base network . The results are depicted in Figure 3. As expected, the model trained on specific training data (“with data” in black) is much better than the other two open-vocabulary approaches. Nonetheless, the proposed method outperforms the acoustic KWS baseline. The results on Google speech commands look worse, but that dataset mainly consists of short commands of two or three phonemes, which are harder to discriminate, and the negative data in this case only contains commands too.
Moreover, the proposed model is mostly similar to the one used in the acoustic KWS baseline, since only the top layers are retrained, so they could be used jointly with low computation overhead. Finally, the models labeled “with data” are models trained on specific training data while the proposed model can readily be applied to any keyword set without retraining.
3.5 Fine-tuning on keyword-specific training data
Since training data is available for these datasets, we fine-tuned the filters with the method presented in Section 2.5. The results are also shown (in red) in Figure 3. We see that after fine-tuning, the performance of the proposed model is similar to that of the model trained exclusively on specific training data. The results on Google speech commands are not as close: it might also be due to the fact that positive and negative data are short keywords in this dataset.
In these speech command tasks, the keywords are isolated: they are not inside a sentence. Before fine-tuning, the network was only trained on sentences. It is therefore possible that the gap between the network without and with fine-tuning (and “with data”) is merely due to learning to detect silences surrounding the commands. This should be explored in future work. It is also worth noting that the fine-tuned network has not lost its ability to detect any arbitrary keyword, since only the weights of the top layer are modified. New keywords may then be added to the network without having to retrain it all. The baseline network does not offer these possibilities.
We presented an open-vocabulary keyword spotting system, which does not require training data specific to the keywords to be detected at inference. In contrast to most acoustic keyword spotting models, it directly predicts a confidence score at the keyword-level, alleviating the need of a confidence calibration. We have shown that the proposed model outperforms acoustic KWS baselines for the detection of keyword both inside utterances and as isolated speech commands. We proposed a method to fine-tune the model to specific training data, which makes it as good as a speech command detector trained on specific data while retaining its ability to detect other arbitrary keywords.
- (2017) End-to-end asr-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1351–1359. Cited by: §1.
- (2020) Small-footprint open-vocabulary keyword spotting with quantized lstm networks. arXiv preprint arXiv:2002.10851. Cited by: §1, §2.5, Figure 3, §3.1, §3.2, §3.2, §3.2, §3.4, Table 1.
- (1997) Open-vocabulary speech indexing for voice and video mail retrieval. In Proceedings of the fourth ACM international conference on Multimedia, pp. 307–316. Cited by: §1.
- (2018) Sequence discriminative training for deep learning based acoustic keyword spotting. Speech Communication 102, pp. 100–111. Cited by: §1.
- (2017) Confidence measures for ctc-based phone synchronous decoding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4850–4854. Cited by: §1.
- (2019) Efficient keyword spotting using dilated convolutions and gating. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6351–6355. Cited by: §1, §2.
- (2000) The trec spoken document retrieval track: a success story.. NIST SPECIAL PUBLICATION SP 500 (246), pp. 107–130. Cited by: §1.
- (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §2.4.
- (2015) Online keyword spotting with a character-level recurrent neural network. arXiv preprint arXiv:1512.08903. Cited by: §1, §3.2.
- (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: §1.
- (2015) A dynamic convolutional layer for short range weather prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4840–4848. Cited by: §1.
- (2016) An end-to-end architecture for keyword spotting and voice activity detection. arXiv preprint arXiv:1611.09405. Cited by: §1.
- (2018) DONUT: ctc-based query-by-example keyword spotting. arXiv preprint arXiv:1811.10736. Cited by: §1.
- (2006) Spoken document retrieval from call-center conversations. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 51–58. Cited by: §1.
- (2007) Rapid and accurate spoken term detection. In Eighth Annual Conference of the international speech communication association, Cited by: §1.
- (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 30–38. Cited by: §1.
- (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.1.
- (2019) Spoken language understanding on the edge. NeurIPS Workshop on Energy Efficient Machine Learning and Cognitive Computing. Cited by: §1.
- (2019) Open-vocabulary keyword spotting with audio and text embeddings. In INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, Cited by: §1.
- (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §2, footnote 2.
- (2005) Comparison of keyword spotting approaches for informal continuous speech. In Ninth European conference on speech communication and technology, Cited by: §1.
- (2018) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §3.4, footnote 2.
- (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §3.4.
- (2016) Achieving human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 99. Cited by: §1.
- (2018) Automatic grammar augmentation for robust voice command recognition. arXiv preprint arXiv:1811.06096. Cited by: §1.
- (2017) Hello edge: keyword spotting on microcontrollers. CoRR abs/1711.07128. External Links: Cited by: §1.