Speech Enhancement for Wake-Up-Word detection in Voice Assistants

# Speech Enhancement for Wake-Up-Word detection in Voice Assistants

## Abstract

Keyword spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants. A very common issue of voice assistants is that they get easily activated by background noise like music, TV or background speech that accidentally triggers the device. In this paper, we propose a Speech Enhancement (SE) model adapted to the task of WUW detection that aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises. The SE model is a fully-convolutional denoising auto-encoder at waveform level and is trained using a log-Mel Spectrogram and waveform reconstruction losses together with the BCE loss of a simple WUW classification network. A new database has been purposely prepared for the task of recognizing the WUW in challenging conditions containing negative samples that are very phonetically similar to the keyword. The database is extended with public databases and an exhaustive data augmentation to simulate different noises and environments. The results obtained by concatenating the SE with a simple and state-of-the-art WUW detectors show that the SE does not have a negative impact on the recognition rate in quiet environments while increasing the performance in the presence of noise, especially when the SE and WUW detector are trained jointly end-to-end.

\name

David Bonet, Guillermo Cámbara, Fernando López,
Pablo Gómez, Carlos Segura, Jordi Luque \address Universitat PolitÃ¨cnica de Catalunya (UPC), Spain
Universitat Pompeu Fabra (UPF), Spain
TelefÃ³nica, Spain
TelefÃ³nica Research, Spain \emailjordi.luque@telefonica.com

Index Terms: keyword spotting, speech enhancement, wake-up-word, deep learning, convolutional neural network

## 1 Introduction

Voice interaction with devices is becoming ubiquitous. Most of them use a mechanism to avoid the excessive usage of resources, a trigger word detector. This ensures the efficient use of resources, using a Speech-To-Text tool only when needed and with the consequent start of a conversation. It is key to only start this conversation when the user is addressing the device, otherwise the user experience is notoriously degraded. Thus, the wake-up-word detection system must be robust enough to avoid wake-ups with TV, music, speech and sounds that do not contain the key phrase.

A common approach to reduce the impact of this type of noise in the system is the adoption of speech enhancement algorithms. Speech enhancement consists of the task of improving the perceptual intelligibility and quality of speech by removing background noise [15]. Its main application is in the field of mobile and internet communications [21] and related to hearing aids [29], but SE has also been applied successfully to automatic speech recognition systems [31, 16, 27].

Traditional SE methods involved a characterization step of the noise spectrum which is then used to try reduce the noise from the regenerated speech signal. Examples of these approaches are spectral subtraction [29], Wiener filtering [17] and subspace algorithms [9]. One of the main drawbacks of the classical approaches is that they are not very robust against non-stationary noises or other type of noises that can mask speech, like background speech. In the last years, Deep Learning approaches have been widely applied to SE at the waveform level [22, 20] and spectral level [27, 18]. In the first case, a common architecture falls within the encoder-decoder paradigm. In [19], authors proposed a fully convolutional generative adversarial network architecture structured as an auto-encoder with U-Net like skip-connections. Other recent work [8] proposes a similar architecture at the waveform level that includes a LSTM between the encoder and the decoder and it is trained directly with a regression loss combined with a spectrogram domain loss.

Inspired by these recent models, we propose a similar SE auto-encoder architecture in the time domain that is optimized not only by minimizing waveform and Mel-spectrogram regression losses, but also includes a task-dependent classification loss provided by a simple WUW classifier acting as a Quality-Net [11]. This last term serves as a task-dependent objective quality measure that trains the model to enhance important speech features that might be degraded otherwise.

## 2 Speech Enhancement

Speech enhancement is interesting for triggering phrase detection since it tries to remove noise that could trigger the device, and at the same time improves speech quality and intelligibility for a better detection. In this case, we try to tackle the most common noisy environments where voice assistants are used: TV, music, background conversations, office noise and living room noise. Some of these types of background noise, such as TV and background conversations, are the most likely to trigger the voice assistant and are also the most challenging to remove.

### 2.1 Model

Our model has a fully-convolutional denoising auto-encoder architecture with skip connections (Fig. 1), which has proven to be very effective in SE tasks [19], working end-to-end at waveform level. In training, we input a noisy audio , comprised of clean speech signal and background noise so that .

The encoder compresses the input signal and expands the number of channels. It is composed of six convolutional blocks (ConvBlock1D), each consisting of a convolutional layer, followed by an instance normalization and a rectified linear unit (ReLU). Kernel size and stride are used, except in the first layer where and . The compressed signal goes through an intermediate stage where the shape is preserved, consisting of three residual blocks (ResBlock1D), each formed by two ConvBlock1D with and where a skip connection is added from the input of the residual block to its output. The last stage of the SE model is the decoder, where the original shape of the raw audio is recovered at the output. Its architecture follows the inverse structure of the encoder, where deconvolutional blocks (DeconvBlock1D) replace the convolutional layers of the ConvBlock1D with transposed convolutional layers. Skip connections from the encoder blocks to the decoder blocks are also used to ensure low-level detail when reconstructing the waveform.

We use a regression loss function (L1 loss) at raw waveform level together with another L1 loss over the log-Mel Spectrogram as proposed in [28] to reconstruct a ”cleaned” signal at the output. Finally, we include the classification loss (BCE Loss) when training the SE model jointly with the classifier or concatenating a pretrained classifier at its output. Thus, we also try to optimize the SE model to the specific task of WUW classification. Our final loss function is defined as a linear combination of the three losses:

 LT=αLraw(y,^y)+βLspec(S(y),S(^y))+γLBCE (1)

where , and are hyperparamters weighting each loss term and denotes the log-Mel Spectrogram of the signal, which is computed using 512 FFT bins, a window of 20 ms with 10 ms of shift and 40 filters in the Mel Scale.

## 3 Methodology

### 3.1 Databases

The database used for conducting the experiments here presented consists of WUW samples labeled as positive, and other non-WUW samples labeled as negative. Since the chosen keyword is ”OK Aura”, which triggers TelefÃ³nica’s home assistant, Aura, positive samples are drawn from company’s in-house databases. Some of the negative samples have been also recorded in such databases, but we also add speech and other types of acoustic events from external data sets, so the models gain robustness with further data. Information about all data used is detailed in this section.

#### OK Aura Database

In a first round, around 4300 WUW samples from 360 speakers have been recorded, resulting in 2.8 hours of audio. Furthermore, office ambient noise has been recorded as well, with the aim of having samples for noise data augmentation. The second data collection round has been done in order to study and improve some sensitive cases where WUW modules typically underperform. For instance, such dataset contains rich metadata about positive and negative utterances, like room distance, speech accent, emotion, age or gender. Furthermore, the negative utterances contain phonetically similar words to ”OK Aura”, since these are the most ambiguous to recognize for a classifier. Detailed information about data acquisition is explained in the following subsection.

#### Data acquisition

A web-based Jotform form has been designed for data collection. Such form is open and actually still receiving input data from volunteers, so readers are also invited to contribute to the dataset1. Until the date of this work, 1096 samples from 80 speakers have been recorded, which consists of 1.2 hours of audio. Volunteers are asked to pronounce various scripted utterances at a close distance and also at two meters from the device mic. The similarity levels are the following:

1. Exact WUW, in an isolated manner: OK Aura.

2. Exact WUW, in a context: Perfecto, voy a mirar quÃ© dan hoy. OK Aura.

3. Contains ”Aura”: Hay un aura de paz y tranquilidad.

4. Contains ”OK”: OK, a ver quÃ© ponen en la tele.

5. Contains similar word units to ”Aura”: Hola Laura.

6. Contains similar word units to ”OK”: Prefiero el hockey al baloncesto.

7. Contains similar word units to ”OK Aura”: Porque Laura, Â¿quÃ© te pareciÃ³ la pelÃ­cula?

#### External data

General negative examples have been randomly chosen from the publicly available Spanish Common Voice corpus [3] that currently holds over 300 hours of validated audio. However, we keep a 10:1 ratio between negative and positive samples, since such ratio proves to yield good results in [12], thus avoiding bigger ratios that lead to increasing computational times. Therefore, we have used a Common Voice partition consisting of 55h for training, 7h for development and 7h for testing.

Background noises were selected from various public datasets according to different use case scenarios. Living room background noise (HOME-LIVINGB) from the QUT-NOISE Database [7], TV audios from the IberSpeech-RTVE Challenge [14], and music2 and conversations3 from free libraries.

#### Data processing

All the audio samples are monoaural signals stored in Waveform Audio File Format (WAV) with a sampling rate of 16kHz. The speech data that has been collected was processed with a Speech Activity Detection (SAD) module producing timestamps where speech occurs. For this purpose the tool from pyannote.audio [4] has been used, which has been trained with the AMI corpus [6]. This helped us to only use the valid speech segments of the audios we collected.

As features to train the models we mainly used two: Mel-Frequency Cesptral Coefficients (MFCCs) and log-Mel Spectrogram. The MFCCs were constructed first filtering the audio with a band pass filter (20Hz to 8kHz) and then, extracting the first thirteen coefficients with 100 milliseconds of windows size and frame shifting of 50 milliseconds. The procedure to extract the log-Mel Spectrogram () is detailed in section 2.1.

Train, development and test partitions are split ensuring that neither speaker nor background noise is repeated between partitions, trying to maintain a 80-10-10 proportion, respectively. Total data, containing internal and external datasets, consists of 50.737 non-WUW samples and 4.651 WUW samples.

### 3.2 Data augmentation

Several Room Impulse Responses (RIR) were created based on the Image Source Method (ISM) [2], for a room of dimensions where meters, with microphone and source randomly located at any point within a height of meters. Every TV and music original recordings were convolved with different RIRs to simulate the signal picked up by the microphone of the device in the room.

Adding background noise to clean speech signals is the main data augmentation technique used in the training stage. We use background noises of different scenarios (TV, music, background conversations, office noise and living room noise) and a wide range of SNRs to improve the performance of the models against noisy environments. In each epoch, we create different noisy samples by randomly selecting a sample of background noise for each speech event and combining them with a randomly chosen SNR in a specified range.

### 3.3 Wake-Up Word Detection Models

With the aim of assessing the quality of the trained SE models, we use several trigger word detection classifier models, reporting the impact of the SE module at WUW classification performance. The WUW classifiers used here are a LeNet, a well-known standard classifier, easy to optimize [13]; Res15, Res15-narrow and Res8 based on a reimplementation by Tang and Lin [26] of Sainath and Parada’s Convolutional Neural Networks (CNNs) for keyword spotting [23], using residual learning techniques with dilated convolutions [25]; a SGRU and SGRU2, two Recurrent Neural Network (RNNs) models, based on the open source tool named Mycroft Precise[24], which is a lightweight wake-up-word detection tool implemented in TensorFlow. These are two bigger variations that we have implemented in PyTorch. We also use a CNN-FAT2019, a CNN architecture adapted from a kernel [1] in Kaggleâs FAT 2019 competition [10], which has shown good performance in tasks like audio tagging or detection of gender, identity and speech events from pulse signal [5].

### 3.4 Training

Speech signals and background noises are combined randomly following the procedure explained in 3.2 with a given SNR range. The SE model is trained to cover a wide SNR range of dBs, whereas WUW models are trained to cover two scenarios: a classifier trained with the same SNR range as the SE model, and a classifier less aware of noise with a narrower SNR range of dBs. This way, it is possible to study the impact of the SE model regarding if the classifier has been trained with more or less noise.

Data imbalance is addressed balancing the classes in each batch using a weighted sampler. We use a fixed window length of 1.5 seconds based on the annotated timestamps for our collected database, and random cuts for the rest of the Common Voice samples.

All the models are trained with early stopping based on the validation loss with 10 epochs of patience. We use the Adam optimizer with a learning rate of 0.001 and a batch size of 50. Loss (1) allows to train the models in multiple ways and we define different SE models and classifiers based on the loss function used:

1. Classifier: we remove the auto-encoder from the architecture (Fig. 1) and train any of the classifiers using the noisy audio as input: and

2. SE model (SimpleSE): we remove the classifier from the architecture and optimize the auto-encoder based on the reconstruction losses only: and

3. SE model + frozen classifier (FrozenSE): operations of the classifier are dropped from the backward graph for gradient calculation, optimizing only the SE model for a given pretrained classifier (LeNet).

4. SE model + classifier (JointSE): auto-encoder and LeNet are trained jointly using the three losses:

### 3.5 Tests

All the models take as input windows of 1.5 seconds of audio, to ensure that common WUW utterances are fully within it, since the average ”OK Aura” is about 0.8 seconds long. Therefore, we perform an atomic test evaluating if a single window contains the WUW or not. Both negative and positive samples are assigned a background noise sample with which they are combined with a random SNR between certain ranges, as described in section 3.4.

Given the output scores of the models, the threshold to decide if a sample test is a WUW or not is chosen as the one yielding the biggest difference between true and false positive rates, based on Youden’s J statistic [30]. Once the threshold is decided, macro F1-score is computed in order to balance WUW/non-WUW proportions in the results. We average such scores across all WUW classifiers described in section 3.3, for every SNR range.

## 4 Results

Figure 2 illustrates the improvement of the WUW detection in noisy scenarios by concatenating our FrozenSE model with all WUW classifiers described in section 3.3 trained with low noise ( dB SNR), which we could find in simple voice assistant systems. Applying SE in quiet scenarios maintains fairly good results, and improves them in lower SNR ranges.

If we train the classifiers with more data augmentation ( dB SNR), the baseline classifier results for noisier scenarios improve. Results using the FrozenSE do not decrease but the improvement in ranges of severe noise is not as large as in Figure 2, see Figure 3.

In section 3.4 we have defined the parameters of the loss function (1) to train a classifier (case 1), and different approaches to train the SE model, either standalone (2, 3) or in conjunction with the classifier (4). In Figure 4 we can see how JointSE performs better than all the other cases in almost every SNR range. From 40 dB to 10 dB of SNR, the results are very similar for the 4 models. In contrast, in the noisiest ranges we can see how the classifier without SE model is the worst performer, followed by the SimpleSE case where only the waveform and spectral reconstruction losses are used. We found that the FrozenSE case, which includes the classification loss in the training stage, improves the results for the wake-up-word detection task. However, the best results are obtained with the JointSE case where the SE model + LeNet are trained jointly using all three losses.

We compared the WUW detection results of our JointSE with other SOTA SE models (SEGAN [19] and Denoiser [8]), followed by a classifier (data augmented LeNet) in different noise scenarios. In Table 1, it can be observed how when training the models together with the task loss, the results in our setup are better than with other more powerful but more general SE models, since there is no mismatch between the SE and classifier in the end-to-end and it is also more adapted to common home noises. JointSE improves the detection over the no SE model case, especially in scenarios with background conversations, loud office noise or loud TV, see Table 2.

## 5 Conclusions

In this paper we proposed a SE model adapted to the task of WUW in voice assistants for the home environment. The SE model is a fully-convolutional denoising auto-encoder at waveform level and it is trained using a log-Mel Spectrogram and waveform regression losses together with a task-dependent WUW classification loss. Results show that for clean and slightly noisy conditions, SE in general does not bring a substantial improvement over a classifier trained with proper data augmentation, but in the case of very noisy conditions SE does improve the performance, especially when the SE and WUW detector are trained jointly end-to-end.

### Footnotes

1. https://form.jotform.com/201694606537056
2. https://freemusicarchive.org/
3. http://www.podcastsinspanish.org/

### References

1. M. H. ”mhiro2” (2019) Freesound Audio Tagging 2019: Simple 2D-CNN Classifier with PyTorch. Note: \urlhttps://www.kaggle.com/mhiro2/simple-2d-cnn-classifier-with-pytorch/ Cited by: §3.3.
2. J. B. Allen and D. A. Berkley (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.2.
3. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers and G. Weber (2019) Common Voice: a massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670. Cited by: §3.1.3.
4. H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz and M. Gill (2020-05) pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain. Cited by: §3.1.4.
5. G. Cámbara, J. Luque and M. Farrús (2020) Detection of speech events and speaker characteristics through photo-plethysmographic signal neural processing. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7564–7568. Cited by: §3.3.
6. J. Carletta (2007) Unleashing the killer corpus: experiences in creating the multi-everything ami meeting corpus. Language Resources and Evaluation 41 (2), pp. 181–190. Cited by: §3.1.4.
7. D. B. Dean, S. Sridharan, R. J. Vogt and M. W. Mason (2010) The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010. Cited by: §3.1.3.
8. A. Défossez, G. Synnaeve and Y. Adi (2020) Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847. Cited by: §1, §4.
9. Y. Ephraim and H. L. Van Trees (1995) A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing 3 (4), pp. 251–266. Cited by: §1.
10. E. Fonseca, M. Plakal, F. Font, D. P. Ellis and X. Serra (2019) Audio tagging with noisy labels and minimal supervision. arXiv preprint arXiv:1906.02975. Cited by: §3.3.
11. S. Fu, C. Liao and Y. Tsao (2019) Learning with learned loss function: speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Processing Letters 27, pp. 26–30. Cited by: §1.
12. J. Hou, Y. Shi, M. Ostendorf, M. Hwang and L. Xie (2020) Mining effective negative training samples for keyword spotting. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7444–7448. Cited by: §3.1.3.
13. Y. LeCun (2015) LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet 20 (5), pp. 14. Cited by: §3.3.
14. E. Lleida, A. Ortega, A. Miguel, V. Bazán-Gil, C. Pérez, M. Gómez and A. de Prada (2019) Albayzin 2018 evaluation: the IberSpeech-RTVE challenge on speech technologies for Spanish broadcast media. Applied Sciences 9 (24), pp. 5412. Cited by: §3.1.3.
15. P. C. Loizou (2013) Speech enhancement: theory and practice. CRC press. Cited by: §1.
16. A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen and A. Y. Ng (2012) Recurrent neural networks for noise reduction in robust ASR. In Thirteenth Annual Conference of the International Speech Communication Association, Cited by: §1.
17. J. Meyer and K. U. Simmer (1997) Multi-channel speech enhancement in a car environment using wiener filtering and spectral subtraction. In 1997 IEEE international conference on acoustics, speech, and signal processing, Vol. 2, pp. 1167–1170. Cited by: §1.
18. S. R. Park and J. Lee (2016) A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132. Cited by: §1.
19. S. Pascual, A. Bonafonte and J. Serra (2017) SEGAN: speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452. Cited by: §1, §2.1, §4.
20. H. Phan, I. V. McLoughlin, L. Pham, O. Y. Chén, P. Koch, M. De Vos and A. Mertins (2020) Improving GANs for speech enhancement. arXiv preprint arXiv:2001.05532. Cited by: §1.
21. C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan and J. Gehrke (2019) A scalable noisy speech dataset and online subjective test framework. arXiv preprint arXiv:1909.08050. Cited by: §1.
22. D. Rethage, J. Pons and X. Serra (2018) A wavenet for speech denoising. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5069–5073. Cited by: §1.
23. T. N. Sainath and C. Parada (2015) Convolutional neural networks for small-footprint keyword spotting. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §3.3.
24. M. D. Scholefield (2019) Mycroft Precise. Note: \urlhttps://github.com/MycroftAI/mycroft-precise Cited by: §3.3.
25. R. Tang and J. Lin (2017) Deep residual learning for small-footprint keyword spotting. CoRR abs/1710.10361. External Links: Link, 1710.10361 Cited by: §3.3.
26. R. Tang and J. Lin (2017) Honk: A PyTorch reimplementation of convolutional neural networks for keyword spotting. CoRR abs/1710.06554. External Links: Link, 1710.06554 Cited by: §3.3.
27. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey and B. Schuller (2015) Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In International Conference on Latent Variable Analysis and Signal Separation, pp. 91–99. Cited by: §1, §1.
28. R. Yamamoto, E. Song and J. Kim (2020) Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. Cited by: §2.1.
29. L. Yang and Q. Fu (2005) Spectral subtraction-based speech enhancement for cochlear implant patients in background noise. The journal of the Acoustical Society of America 117 (3), pp. 1001–1004. Cited by: §1, §1.
30. W. J. Youden (1950) Index for rating diagnostic tests. Cancer 3 (1), pp. 32–35. Cited by: §3.5.
31. C. Zorilă, C. Boeddeker, R. Doddipatla and R. Haeb-Umbach (2019) An investigation into the effectiveness of enhancement in ASR training and test for chime-5 dinner party transcription. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 47–53. Cited by: §1.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters