Predicting detection filters for small footprint open-vocabulary keyword spotting

Predicting detection filters for small footprint open-vocabulary keyword spotting

Abstract

In this paper, we propose a fully-neural approach to open-vocabulary keyword spotting, that allows the users to include a customizable voice interface to their device and that does not require task-specific data. We present a keyword detection neural network weighing less than 250KB, in which the topmost layer performing keyword detection is predicted by an auxiliary network, that may be run offline to generate a detector for any keyword. We show that the proposed model outperforms acoustic keyword spotting baselines by a large margin on two tasks of detecting keywords in utterances and three tasks of detecting isolated speech commands. We also propose a method to fine-tune the model when specific training data is available for some keywords, which yields a performance similar to a standard speech command neural network while keeping the ability of the model to be applied to new keywords.

\name

Théodore Bluche, Thibault Gisselbrecht \address Sonos Inc., Paris, France \emailfirst.last@sonos.com

Index Terms: speech recognition, keyword spotting, neural networks

1 Introduction

The recent advances in automatic speech recognition (ASR), reaching close to human recognition performance [24], paved the way to natural language interaction in everyday life, making voice become a natural interface for the communication with objects. Such large-vocabulary ASR systems demand a lot of resources and computing power, but it has been shown [18] that the voice interface can run on device when the tasks are known, in a closed-ontology setting. For many interactions it may even be reduced to the detection of specific keywords, allowing to build systems small enough to run on micro-controllers (MCUs), which are cheap and have a low energy consumption [26].

A significant amount of tiny neural networks for keyword spotting (KWS) have been proposed in the past few years [20, 6]. These models yield very good keyword classification results and can run on MCUs. However, they require specific training data containing the keywords to be detected at inference. They lack flexibility because data should be collected and a new model should be trained every time a new keyword is added. On the other hand, traditional KWS approaches are either based on the output of an ASR system, looking for keywords in the transcript [7] or in the word [14, 15] or phone [3, 5] lattice, or based on acoustic models of phones, allowing to build models for keywords and “background” and computing likelihood ratios between the two [21]. End-to-end acoustic models predicting the character or phone sequence directly lead to efficient decoding and keyword spotting [9, 12, 4, 2] and can take into account the confusions of the network to improve the keyword models [13, 25]. These methods do not need keyword-specific training data, but still require some post-processing and a non-trivial confidence score calibration to transform the frame-wise phone scores into keyword confidences.

Recently, ASR-free approaches have been proposed, which consist in computing embeddings for both the audio and the keyword pronunciation and directly predict a keyword detection score [1, 19]. They combine the simplicity of end-to-end KWS methods and the flexibility of acoustic KWS, and do not require specific training data. In [1], the whole spoken utterance is embedded into a single vector with a recurrent auto-encoder. Similarly, the keyword is embedded into a vector using an auto-encoder of the phone sequence. The concatenation of both vectors is fed to a small neural network predicting whether the keyword appears in the utterance. In [19], different recurrent neural networks are trained to predict the word and phone embeddings. The classification is based on the distance between the keyword and utterance embedding.

In a similar vein, we propose a fully neural architecture for KWS which can be trained on generic ASR data, without specific examples of the keywords to be detected at inference. It is made of three components. An acoustic encoder, composed of a stack of recurrent layers, is pretrained as a quantized ASR acoustic model. Its intermediate features are fed to a convolutional keyword detector network trained to output keyword confidences. The weights of the latter are predicted by a keyword encoder neural network: a recurrent neural network applied to the keyword pronunciations to predict the weights of the topmost convolution kernel of the keyword detector network. This idea is similar to other works on dynamic convolution filters in computer vision for weather prediction [11], visual question answering [16], or video and stereo prediction [10].

We experimented this approach on two tasks: a continuous KWS task where keywords are detected inside queries formulated in natural language, and a speech command task where the goal is to predict one command among a predefined set. We compare this system to acoustic KWS approaches and we show that the proposed neural approach outperforms them by a large margin. We also show how the model may be fine-tuned with specific training data to get close to the performance of an end-to-end KWS classification model, without losing the ability of the model to detect new out-of-vocabulary keywords.

The remaining of the paper is divided as follows. The proposed model is described in Section 2. We report the experimental results on the two tasks in Section 3 and conclude the paper in Section 4.

2 Keyword spotting neural network

When keywords to detect are known in advance, and when training data containing those keywords are available, a neural network can be trained in an end-to-end fashion to detect them [20, 6]. In this paper, we present a method to create such a neural network for any arbitrary keyword defined post-training, which does not require training data specific to these keywords.

2.1 Keyword spotting neural network

The neural network is made of a stack of unidirectional LSTM layers, followed by two convolutional layers. It has one sigmoid output for each keyword, and is trained on a generic speech dataset. The top convolutional layer of the neural network computes the probability of detecting each keyword at each timestep. For keyword , the output sequence is computed as follows

(1)

where is the sigmoid function, is the convolution operation, represents the lower layers of the neural network with parameters applied to input , and is the convolution kernel corresponding to keyword .

Since the keywords to be detected are not known during the training phase, and because the network is an open-vocabulary KWS model, the parameters of the top layer cannot be trained directly. They are predicted by an auxiliary neural network: a keyword encoder, as shown in Figure 1.

Figure 1: Proposed model. A keyword encoder (right) predicts the weights of the top convolutional filter of the keyword spotting network (left), used to detect the keyword.

The keyword encoder is a neural network applied to the phone sequence of a keyword . It outputs the parameters used to detect the keyword with the KWS model:

(2)

where represents the parameters of the keyword encoder network. In this work, the keyword encoder is a bidirectional LSTM network, followed by an affine transform.

2.2 Inference

At inference, the network is first configured to detect a set of keywords . For each keyword , the phone sequence is retrieved from a pronunciation lexicon or a grapheme-to-phoneme converter, and convolution kernel is computed by the keyword encoder (eq. 2). Then, the top convolution layer can be created with the set of computed kernels , and the KWS network is ready to be applied to the input audio (eq. 1).

2.3 Training

The combination of eq. 1 and 2 gives:

(3)

We want when the phone sequence appears in the audio input and ends at , and otherwise. It is therefore possible to jointly train and from a generic speech training set with the cross-entropy loss, without knowing what keywords will be used at inference.

For each time step of each training sample , we create a set of positive keyword examples (i.e. are subsequences of ending at time ), and a set of negative keyword examples , and we minimize the following cross-entropy loss:

(4)

where

(5)

2.4 Dataset creation

In order to build the set of positive keyword examples for dataset sample at frame , we need to know what are the last phones appearing in the utterance at time . They may be inferred from the forced alignment of the utterance with the ground-truth phone sequence.

To obtain them, we first train the stack of LSTMs on to predict the phone sequences with the connectionist temporal classification loss (CTC [8]) :

(6)

where is a sequence of phone or blank labels and is the CTC collapse function that removes label repetitions and blank labels. This network is used to align the dataset, i.e., for each , compute:

(7)

Let be the receptive field of the convolutional network. By removing blanks and label repetitions, will actually yield the sequence of phones in the utterance between time and . Any suffix of that sequence may be considered an example of positive keyword to be detected at . The set of positive examples at time is created by randomly sampling suffixes of of length to . Note that the sampled “keyword” phone sequences do not necessarily correspond to actual words during training.

Since the network is trained with batches of dataset samples, the set of negative keyword examples for one sample can merely be the union of sets of positive examples for the other samples in the batch.

2.5 Adaptation on specific dataset

Acoustic KWS approaches such as [2] rely on phone predictions and might not gain much from fine-tuning the model on a keyword-specific dataset. In the proposed approach, the weights generated by the keyword encoder could serve as a starting point for re-training on keyword-specific data, when available. That would amount to rewrite eq. 2 as:

(8)

Given a training dataset for a set of keywords made of positive example of keywords and negative data, the data-specific parameters can be adjusted by gradient descent, to optimize the loss of eq. 5. By adjusting only those parameters, the ability of the model to detect any other arbitrary keyword is not lost.

3 Experimental results

3.1 Experimental setup

We trained quantized LSTM networks on Librispeech [17] with CTC and data augmentation. The inputs are MFCC and outputs are 46 phones plus a blank class. The details of the training and quantization procedure can be found in  [2]. The weights of the LSTM cells are frozen after CTC training and not fine-tuned during the training of the KWS network, allowing to use them for ASR as well or with the method of [2]. We evaluated systems based on two such LSTM networks, with five layers of 64 and 96 units.

The top softmax layer in that model is replaced by two convolutional layers to build the KWS network (shown in purple and green in Figure 1). The first convolutional layer has a kernel of five frames, and 96 output tanh channels, followed by a max-pooling of size three and stride two. The top convolutional layer has a kernel size of 12. The total receptive field of the convolutional part has a size of frames. The keyword encoder is made of a bidirectional LSTM layer with 128 units in each direction, followed by a linear transform, predicting weights for each keyword. Overall, the model based on the 5x64 (resp. 5x96) LSTM have 208.8k (resp. 440.7k) parameters plus 1.2k parameters per keyword.

The keyword detector and encoder are jointly trained for five epochs using minibatches of 128 audio samples and two positive synthetic keyword samples in each , using the Adam optimizer and a learning rate of 1e-4, following the procedure presented in Section 2.3. All the weights, including those predicted by the keyword encoder are quantized to 8 bits.

3.2 Keyword spotting results

For this task, we crowd-sourced queries for two use-cases: a smart light scenario and a washing machine scenario1 (cf. [2] for details). Each dataset was re-recorded in clean and noisy, reverberated far-field conditions with a SNR of 5dB. Each query contains between one and four keywords, and is expressed in natural language (e.g. “could you [turn on] the lights in the [bedroom]”). We measure the ratio of exactly parsed queries, i.e. those for which the sequence of detected keywords exactly matches the expected one, and the F1 score as a measure of performance at the keyword level.

The proposed KWS model is compared to five baselines, presented in detail in [2]. They all include the quantized LSTM acoustic model which was used as a base model for the KWS neural network. The Viterbi and Lattice baselines are LVCSR baselines, where the keywords are detected in the Viterbi decoding or in the decoding lattice of the utterance with a vocabulary of 200k and a trigram language model. The Filler baseline employs a standard keyword-filler model, where the filler model is a phone loop. The Greedy and Sequence baselines correspond to the approach presented in [2], similar to [9], with two post-processing methods.

Figure 2: F1 scores and exact rates of the proposed method (in red and green, showing the averaged results over 5 runs and the worst and best ones), compared to baselines on two keyword spotting tasks in clean and noisy conditions
Dataset lights washing
Model clean noisy clean noisy
5x64 [2] 0.754 0.522 0.857 0.694
Proposed 0.850 0.671 0.884 0.765
(worst) 0.835 0.654 0.878 0.751
(best) 0.864 0.687 0.889 0.784
5x96 [2] 0.808 0.588 0.871 0.725
Proposed 0.873 0.694 0.900 0.773
(worst) 0.856 0.665 0.892 0.764
(best) 0.883 0.709 0.911 0.782
Table 1: F1 score for the proposed model and the Sequence baseline from [2] on two keyword spotting tasks in clean and noisy conditions.

We trained five models with different random initialization. The results are displayed in Figure 2 and compared to the Sequence post-processing approach of [2] in Table 1. The LVCSR-based results are low, although using recognition lattices provides a big improvement over Viterbi decoding. The keyword-filler model is the best of the traditional methods. The baselines from [2] are competitive with the filler model. With the same model size, they are almost always better, with both the greedy and sequence post-processors. With a smaller model, the sequence post-processor yields better results than the filler model. The proposed approach outperforms all baselines, even with the small model, which has two times less parameters.

3.3 Analysis of learned filters

(a) Google speech commands
(b) Laundry commands
(c) Audio controls
Figure 3: Speech commands results before (Section 3.4, green) and after (Section 3.5, red) fine-tuning, averaged over 5 runs, compared with a classification model fully trained on specific training data (dashed black line) and to the acoustic KWS approach of [2] (blue). For Google speech commands, the standard evaluation measuring the false alarm rate on other commands is applied.

To analyze what the keyword encoder has learned, we compare the predicted convolution filters for different inputs. In particular, we measure the Euclidean distance between predicted filters. Indeed, if the filters are close, the KWS network will tend to make similar predictions and potentially confuse the corresponding words.

Keyword Closest vocabulary words
turn on anon, non, turnin, fernand, maranon
decrease crease, increase, encrease, greece
brightness rightness, uprightness, triteness, greatness
bedroom bathroom, begloom, broom, broome
play flay, clay, blaye, splay, ley, lay
start astarte, tart, stuart, upstart, sturt
Table 2: Keywords and closest words in a vocabulary, measured as the Euclidean distance between the predicted filters.

The filters for all the words in a vocabulary are computed and compared to the predicted filters for some keywords. In Table 2, we show the words with the closest filters to those of the keywords. We observe that they tend to be words with similar pronunciations, of about the same length or shorter, and mostly with the same or very similar suffixes.

Figure 4: Pairwise filter Euclidean distances between keywords of the lights dataset (darker is closer).

Figure 4 shows the pairwise distances between the filters for the keywords in the set for the lights dataset. We see that increase and decrease on the one hand, and turn on and turn off on the other hand are very close to one another, which correlates with the confusions of the KWS network: 37% of the confusions are between increase and decrease, 29% for turn on and turn off.

3.4 Speech commands results

For the speech command task, the goal is to detect a single command among a pre-defined set. We evaluate our approach on Google’s speech command dataset [23]2 and on two in-house crowd-sourced datasets of speech commands: a dataset of audio control commands with 10 commands (“turn on, turn off, play, pause, start, stop, next track, previous track, volume up, volume down”) and a dataset of laundry commands with 5 commands (“cancel, wash whites, wash delicate, wash heavy duty, wash normal”). We measure the false rejection rate across all commands on the command datasets and the number of false alarms per hour on a dataset of 45 hours of negative data made of music and speech. For the Google speech command dataset, we follow the usual evaluation procedure and measure the false alarm rate using the other commands as negative data.

We compare our results with the 5x64 model to a model trained on specific training data, similar in size (186K parameters) and performance to the res15 ResNet architecture of [22], and to the Greedy acoustic KWS baseline using the same 5x64 base network [2]. The results are depicted in Figure 3. As expected, the model trained on specific training data (“with data” in black) is much better than the other two open-vocabulary approaches. Nonetheless, the proposed method outperforms the acoustic KWS baseline. The results on Google speech commands look worse, but that dataset mainly consists of short commands of two or three phonemes, which are harder to discriminate, and the negative data in this case only contains commands too.

Moreover, the proposed model is mostly similar to the one used in the acoustic KWS baseline, since only the top layers are retrained, so they could be used jointly with low computation overhead. Finally, the models labeled “with data” are models trained on specific training data while the proposed model can readily be applied to any keyword set without retraining.

3.5 Fine-tuning on keyword-specific training data

Since training data is available for these datasets, we fine-tuned the filters with the method presented in Section 2.5. The results are also shown (in red) in Figure 3. We see that after fine-tuning, the performance of the proposed model is similar to that of the model trained exclusively on specific training data. The results on Google speech commands are not as close: it might also be due to the fact that positive and negative data are short keywords in this dataset.

In these speech command tasks, the keywords are isolated: they are not inside a sentence. Before fine-tuning, the network was only trained on sentences. It is therefore possible that the gap between the network without and with fine-tuning (and “with data”) is merely due to learning to detect silences surrounding the commands. This should be explored in future work. It is also worth noting that the fine-tuned network has not lost its ability to detect any arbitrary keyword, since only the weights of the top layer are modified. New keywords may then be added to the network without having to retrain it all. The baseline network does not offer these possibilities.

4 Conclusion

We presented an open-vocabulary keyword spotting system, which does not require training data specific to the keywords to be detected at inference. In contrast to most acoustic keyword spotting models, it directly predicts a confidence score at the keyword-level, alleviating the need of a confidence calibration. We have shown that the proposed model outperforms acoustic KWS baselines for the detection of keyword both inside utterances and as isolated speech commands. We proposed a method to fine-tune the model to specific training data, which makes it as good as a speech command detector trained on specific data while retaining its ability to detect other arbitrary keywords.

Footnotes

  1. The datasets are publicly available at https://bit.ly/39YV1te
  2. evaluating on the same 12 commands as in [20, 22]

References

  1. K. Audhkhasi, A. Rosenberg, A. Sethy, B. Ramabhadran and B. Kingsbury (2017) End-to-end asr-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1351–1359. Cited by: §1.
  2. T. Bluche, M. Primet and T. Gisselbrecht (2020) Small-footprint open-vocabulary keyword spotting with quantized lstm networks. arXiv preprint arXiv:2002.10851. Cited by: §1, §2.5, Figure 3, §3.1, §3.2, §3.2, §3.2, §3.4, Table 1.
  3. M. G. Brown, J. T. Foote, G. J. Jones, K. S. Jones and S. J. Young (1997) Open-vocabulary speech indexing for voice and video mail retrieval. In Proceedings of the fourth ACM international conference on Multimedia, pp. 307–316. Cited by: §1.
  4. Z. Chen, Y. Qian and K. Yu (2018) Sequence discriminative training for deep learning based acoustic keyword spotting. Speech Communication 102, pp. 100–111. Cited by: §1.
  5. Z. Chen, Y. Zhuang and K. Yu (2017) Confidence measures for ctc-based phone synchronous decoding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4850–4854. Cited by: §1.
  6. A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy, M. Poumeyrol and T. Lavril (2019) Efficient keyword spotting using dilated convolutions and gating. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6351–6355. Cited by: §1, §2.
  7. J. S. Garofolo, C. G. Auzanne and E. M. Voorhees (2000) The trec spoken document retrieval track: a success story.. NIST SPECIAL PUBLICATION SP 500 (246), pp. 107–130. Cited by: §1.
  8. A. Graves, S. Fernandez, F. Gomez and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §2.4.
  9. K. Hwang, M. Lee and W. Sung (2015) Online keyword spotting with a character-level recurrent neural network. arXiv preprint arXiv:1512.08903. Cited by: §1, §3.2.
  10. X. Jia, B. De Brabandere, T. Tuytelaars and L. V. Gool (2016) Dynamic filter networks. In Advances in Neural Information Processing Systems, pp. 667–675. Cited by: §1.
  11. B. Klein, L. Wolf and Y. Afek (2015) A dynamic convolutional layer for short range weather prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4840–4848. Cited by: §1.
  12. C. Lengerich and A. Hannun (2016) An end-to-end architecture for keyword spotting and voice activity detection. arXiv preprint arXiv:1611.09405. Cited by: §1.
  13. L. Lugosch, S. Myer and V. S. Tomar (2018) DONUT: ctc-based query-by-example keyword spotting. arXiv preprint arXiv:1811.10736. Cited by: §1.
  14. J. Mamou, D. Carmel and R. Hoory (2006) Spoken document retrieval from call-center conversations. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 51–58. Cited by: §1.
  15. D. R. Miller, M. Kleber, C. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz and H. Gish (2007) Rapid and accurate spoken term detection. In Eighth Annual Conference of the international speech communication association, Cited by: §1.
  16. H. Noh, P. Hongsuck Seo and B. Han (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 30–38. Cited by: §1.
  17. V. Panayotov, G. Chen, D. Povey and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.1.
  18. A. Saade, A. Coucke, A. Caulier, J. Dureau, A. Ball, T. Bluche, D. Leroy, C. Doumouro, T. Gisselbrecht and F. Caltagirone (2019) Spoken language understanding on the edge. NeurIPS Workshop on Energy Efficient Machine Learning and Cognitive Computing. Cited by: §1.
  19. N. Sacchi, A. Nanchen, M. Jaggi and M. Cernak (2019) Open-vocabulary keyword spotting with audio and text embeddings. In INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, Cited by: §1.
  20. T. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §2, footnote 2.
  21. I. Szoke, P. Schwarz, P. Matejka, L. Burget, M. Karafiát, M. Fapso and J. Cernocky (2005) Comparison of keyword spotting approaches for informal continuous speech. In Ninth European conference on speech communication and technology, Cited by: §1.
  22. R. Tang and J. Lin (2018) Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Cited by: §3.4, footnote 2.
  23. P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §3.4.
  24. W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig (2016) Achieving human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 99. Cited by: §1.
  25. Y. Yang, A. Lalitha, J. Lee and C. Lott (2018) Automatic grammar augmentation for robust voice command recognition. arXiv preprint arXiv:1811.06096. Cited by: §1.
  26. Y. Zhang, N. Suda, L. Lai and V. Chandra (2017) Hello edge: keyword spotting on microcontrollers. CoRR abs/1711.07128. External Links: Link, 1711.07128 Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
414585
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description