Residual acoustic echo suppression based on efficient multi-task convolutional neural network

Residual acoustic echo suppression based on efficient multi-task convolutional neural network


Acoustic echo degrades the user experience in voice communication systems thus needs to be suppressed completely. We propose a real-time residual acoustic echo suppression (RAES) method using an efficient convolutional neural network. The double talk detector is used as an auxiliary task to improve the performance of RAES in the context of multi-task learning. The training criterion is based on a novel loss function, which we call as the suppression loss, to balance the suppression of residual echo and the distortion of near-end signals. The experimental results show that the proposed method can efficiently suppress the residual echo under different circumstances.


Xinquan Zhou   Yanhong Leng \address Mutilmedia technology group, Bytedance Inc, China {keywords} residual acoustic echo suppression, convolutional neural network, multi-task learning, suppression loss

1 Introduction

In voice communication systems, acoustic echo cancellation (AEC) is needed when the microphone, locating in an enclosed space with the speaker, is capturing the echo signals which is generated due to the coupling between the microphone and loudspeaker. Traditional AEC algorithm consists of two parts: adaptive linear filter (AF) [3] and nonlinear echo processor (NLP) [16]. Many challenges exist in AEC such as the nonlinearity caused by loudspeaker characteristics and it is not easy to find the nonlinear relationship between AF output and the far-end signal. In other words, NLP in AEC systems are highly likely to damage the near-end signal substantially in order to totally remove the residual acoustic echo.

In recent years, machine learning has been introduced to acoustic echo cancellation and suppression. Artificial neural network (ANN) with two hidden layers is utilized to estimate the residual echo based on the far-end signal and its nonlinear transformation signals [15]. Training a Deep neural network (DNN) with the far-end signal and AF ouput signal can predict more accurate masks [9, 10]. However, due to the lack of phase information, feeding the neural network with magnitude spectrum and estimating the output magnitude spectrum masks can hardly remain the near-end signal while removing all of the acoustic echo [23]. Whereas adding more input features such phase spectrum makes the model overwhelmingly complicated to be employed in most personal terminals [5, 22]. In a recent study, phase-sensitive weight is used to revise the mask exploiting the phase relationship between AF output and near-end signal [2].

In this paper, we propose a new residual acoustic echo suppression (RAES) method using an efficient multi-task convolutional neural network (CNN) with far-end reference and AF output signal as inputs and phase-sensitive mask (PSM) as targets. A novel suppression loss is applied to balance the trade-off between suppressing residual echo and preserving near-end signal. An accurate double talk detector (DTD) is essential even in a traditional AEC and the double talk state is estimated as an auxiliary task to improve the accuracy of mask prediction in our work. The experimental results prove that the proposed method is able to effectively suppress residual echo and significantly reduce the distortion of near-end signal in both simulated and real acoustic environments.

The rest of this paper is organized as follows. Section 2 introduces the traditional AEC system. The proposed method is presented in Section 3 and the comparative experimental results are shown in Section 4. Finally, Section 5 concludes the paper.

2 AEC framework

In the AEC framework, as shown in Fig. 1, the signal received by the microphone is composed of the near-end signal and the acoustic echo :


The purpose of AEC is to remove the echo signal while remain the near-end signal .

Figure 1: Linear AEC framework.

The acoustic echo contains two parts: the linear echo including the direct far-end signal plus its reflected signals, and the nonlinear echo caused by the loudspeaker. The AF module adaptively estimates the linear echo and subtract it from the microphone signal to get the output signal . Traditionally, the NLP calculates a suppression gain from and to further suppress the residual echo. However, near-end signals are highly likely to be damaged severely in the double talk segment using this kind of methods.

3 Proposed method

3.1 Feature extraction

AF module is applied to cancel a part of linear echo in the microphone signal. There are many ways to implement the linear AF algorithm. Theoretically, the proposed RAES can work with any standard AF algorithm and we use a subband normalized least-mean-square (NLMS) algorithm in this paper.

The input feature includes the log-spectrum of the AF output error signal and the far-end reference signal as mentioned above. We convert and to the frequency domain using Short Time Fourier Transform (STFT) with a square-root Hanning window with the size of . Therefore, the actual number of frequency bins is with the direct current bin discarded. We concatenate frames as the input features to provide more time reference information. Another advantage of the concatenation is that it can push the network to learn the delay between echo and far-end signal.

3.2 Network architecture

The backbone of the networks in this paper is inspired by MobileNetV2, where most of the full convolutional operation is replaced with depthwise and pointwise convolution to reduce the computational cost [14]. The overall network architecture is displayed in Fig.2 where the first three parameters in Conv() and Residual BottleNeck() are the number of output channel, kernel size and stride size respectively, and the default stride size is 1 if not specified. FC means full connection layer with input and output dimension. The detailed architecture of the Residual BottleNeck() is shown in Fig.2 (a), where the residual connection fuses high-dimension and low-dimension features together.

It is well worth mentioning that masks prediction during double talk is a challenging task. Once the features are extracted through four Residual BottleNeck blocks, we exploit a DTD prediction task in the right branch to reduce the burden on the left masks prediction branch with an conditional attention mechanism. Thus the multi-task learning can make the network focus more on the prediction of double-talk masks where masks can be set to 1 or 0 easily if the DTD task detects single talk period.

(a) Inverted Residual BottleNeck(, kernel, stride) (b) Total network
Figure 2: Proposed network architecture when .

3.3 Training targets and loss

Ideal amplitude mask (IAM) is often used as training targets in speech enhancement and residual echo suppression without considering phase information. In this paper, we use phase-sensitive mask (PSM) [4] with the expression as follows,


where . and express near-end and AF output signal in the th frame, th frequency bin. PSM is truncated between 0 and 1 in the network. Then the frequency domain output of the proposed RAES in frequency bin is calculated through


Minimum square error (MSE) is used as the loss function in the training process. It is considered inevitable to distort near-end signal in some extent in order to remove the echo completely. As long as the estimation of the networks is not perfect, the RAES will either distort the near-end signal or remain some residual echo, or even worse, both. On the one hand, the main purpose of the AEC, intrinsically, is to remove all the echo from the microphone signal while remain the near-end signal as much as possible. Therefore, suppressing the echo is, more or less, more demanding than retaining the near-end signal quality. On the other hand, MSE loss is a symmetric metric in that the same amount of the negative and positive deviation will be counted as exactly the same loss. Therefore, using MSE directly is unable to control the trade-off between suppressing echo and preserving near-end signal. The solution in this paper is to apply a parametric leaky Rectified Linear Unit (ReLU) function to calculate a weighted mean square distance between the target and estimated mask in the frequency bin with a suppression ratio ,


where the and are the target and the estimated phase-sensitive mask in the frequency bin respectively, which we call it as the suppression loss. Suppression ratio in frequency bin as a parameter is set between 0 and 1, and the smaller is, the more severe the suppression will be. The suppression extent can be adjust in each frequency bin by setting different value. To simplify, we just set the same value in all frequency bins.

The DTD state in the th frame is obtained according to the following rules:


where the DTD states 0, 1, 2 correspond to signal near-end talk, single far-end talk and double talk respectively. Due to the imbalance between single and double talk in the dataset, focal loss[11] with focusing parameter is used as the loss function of DTD training task and we combine the two losses as the way in [7] with two weights updated by the network.

4 Experimental results

4.1 Datasets

In the experiments, TIMIT [8] and THCHS30 [20] dataset are used to generate training, validating and testing datasets. In the training dataset, we randomly select 423 speakers with 4230 utterances from TIMIT and 40 speakers with 5690 utterances from THCHS30. While the validating and testing dataset includes 160 different speakers with 1600 utterances from TIMIT and 16 different speakers with 2083 utterances from THCHS30. Speakers are randomly chosen as pairs where the male-female, male-male, female-female ratio are (0.3, 0.4, 0.3) in TIMIT dataset and (0.3, 0.2, 0.5) in THCHS30 dataset respectively. One far-end signal is created by concatenating three utterances from one speaker. One utterance from another speaker is used as the near-end signal and concatenated repeatedly to the same length with the far-end signal. Moreover, considering other types of signals, especially music which greatly differs from speech in the frequency and time characteristics, often played by loudspeakers as well, we intentionally mix music signals from MUSAN [17] with 10% of the far-end signals randomly. There are totally 5400 training mixtures generated where 2400 mixtures are from TIMIT and the rest are from THCHS30.

Various types of devices exhibit different nonlinear characteristics and different system intrinsic delays between the far-end and microphone signal. To simulate different devices, firstly, hard clip is applied to 70% of the far-end signals to simulate different clips of power amplifier:


where the is randomly chosen from 0.75 to 0.99. Loudspeaker nonlinearity is simulated using the memoryless sigmoid function[12].


where . The gain is set randomly from 0.15 to 0.3. The slope is set randomly from 0.05 to 0.45 when is greater than 0 and from 0.1 to 0.4 otherwise.

Then we need to generate the echo signal based on the distorted far-end signal . A delay ranging from 8 ms to 40 ms is added to the distorted far-end signal to simulate the inner system delay. Both simulated and real recording room impulse response (RIR) are used to convolve with the distorted far-end signals above generating the final echo signals. The simulated RIRs are generated using the image method [1] with room size . Two typical rooms with size [6.5 m, 4.1 m, 2.95 m] and [4.2 m, 3.83 m, 2.75 m] are used in this paper. The reverberation time ranges among [0.3 s, 0.4 s, 0.5 s, 0.6 s] with sample length [2048, 2048, 4096, 4096] respectively. We generate 4 different microphone positions in each room and 5 different speaker positions around each microphone. The real recording room impulse responses are selected from AIR[6], BUT[18] and MARDY[21] with microphone-speaker distance within 1.2 m. Ninety percent of the far-end signals are convolved with randomly selected RIRs from the simulated and real recording RIR datasets above. Half of the near-end signals in the training dataset are replaced with silent signals to generate single far-end talks. The SER during double talk period is randomly selected from -13 dB to 0 dB. The same procedure is implemented when generating validating datasets.

4.2 Experimental configurations

The window length of STFT is set to 128 with 50% overlap. And 20 frames of far-end and AF output signals are concatenated forming the input features with the shape of with the direct current and the negative frequency bins discarded. Then we reshape it to as an individual input and the batch size is set to 1024. Adam optimizer is applied with initial learning rate 0.003. Suppression ratio is set to 0.5 or 1.0.

4.3 Evaluation metrics

The proposed method is evaluated in terms of perceptual evaluation of speech quality (PESQ)[13] and short-time objective intelligibility (STOI) [19] during double talk periods and echo return loss enhancement (ERLE) during far-end single talk periods. The ERLE in linear AEC framework is calculated by


and we extended it to measure the echo suppression extent in nonlinear RAES framework by replacing with .

4.4 Performance comparisons

In most of hardware devices, the distance between the microphone and loudspeaker are relatively close resulting in low SER. We generate test mixtures with 0 dB, -5 dB and -10 dB SER and compare the proposed method with AEC3 in WebRTC1 and DNN method [9]. DNN architecture is compose of three hidden layer with 2048 nodes in each layer and without restricted Boltzmann machines (RBM) being pre-trained to initialize the DNN parameter. For DNN method, we also concatenate 20 frames as input features. Ideal amplitude mask and MSE loss function are chosen for the DNN training with the same learning rate with RAES.

We generate 50 pairs of TIMIT and 50 pairs of THCHS30 testing mixtures for each case. Table 1 shows the ERLE results of different algorithms during single far-end talk scenarios. The proposed RAES method yield more than 40 dB ERLE showing the ability to suppress echo is better than AEC3 and DNN method especially when speech and music both exist in echo signals. What stands in the Table 2 and 3 is that PESQ and STOI scores of different methods. The scores of RAES outperforms AEC3 and DNN method, which indicate that RAES could preserve better speech quality and intelligibility during the double talk periods. The suppression ratio can be used to adjust the suppression extent of the model and using smaller will suppress harder on both echo and near-end signal. The F1 value of DTD task in training and validating process is 93.0% and 90.3% respectively. These results suggest that further post-processing can be done to the masks according to the reliable DTD.

speech speech + music
AF 16.612 17.973
AEC3 29.976 25.376
AF+DNN 27.832 31.492
AF+RAES() 40.786 43.144
AF+RAES() 35.597 36.454
Table 1: Average ERLE(dB) during single far-end talk
SER 0dB -5 dB -10 dB
PESQ Origin 1.908 1.519 1.248
AEC3 1.610 1.520 1.292
AF+DNN 2.666 2.463 2.094
AF+RAES() 2.816 2.591 2.163
AF+RAES() 2.809 2.598 2.200

Origin 0.728 0.582 0.485
AEC3 0.623 0.569 0.494
AF+DNN 0.856 0.809 0.727
AF+RAES() 0.875 0.836 0.760
AF+RAES() 0.889 0.851 0.776
Table 2: Average PESQ and STOI during double talk (speech)
SER 0dB -5 dB -10 dB
PESQ Origin 1.874 1.594 1.322
AEC3 1.734 1.563 1.297
AF+DNN 2.729 2.507 2.141
AF+RAES() 2.864 2.620 2.223
AF+RAES() 2.849 2.626 2.283

Origin 0.689 0.610 0.475
AEC3 0.641 0.583 0.518
AF+DNN 0.848 0.808 0.733
AF+RAES() 0.872 0.838 0.771
AF+RAES() 0.882 0.851 0.788
Table 3: Average PESQ and STOI during double talk (speech+music)

The computational complexity comparison is displayed in Table 4. SSE2 optimization is on in the AEC3. We run the DNN and RAES models based on a self-developed neural network inference library. The real-time rate (RT) of DNN and RAES is 0.89 and 0.05 respectively when processing a 60 s-long speech with a 2.5 GHz CPU, x86_64 processor, which indicates that RAES can be easily implemented on personal platforms.

Model Size MFLOPs RT
AEC3 - - 0.01
DNN 53.2 M 13.8 0.89
RAES 1.2 M 6.9 0.05
Table 4: Operation complexity comparison

5 Conclusions

An efficient and effective multi-task residual acoustic echo suppression method is proposed. We evaluated the method in different simulated and real rooms under various SER talk situations. The experimental results show that proposed RAES can achieve better echo suppression performance than traditional echo cancellation methods and fairly easy to be deployed and run in real-time on most personal devices.




  1. J. B. Allen and D. A. Berkley (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §4.1.
  2. G. Carbajal, R. Serizel, E. Vincent and E. Humbert (2018) Multiple-input neural network-based residual echo suppression. In ICASSP, pp. 231–235. External Links: ISBN 1-5386-4658-7 Cited by: §1.
  3. A. Deb, A. Kar and M. Chandra (2014) A technical review on adaptive algorithms for acoustic echo cancellation. In International Conference on Communication and Signal Processing, pp. 041–045. Cited by: §1.
  4. H. Erdogan, J. R. Hershey, S. Watanabe and J. Le Roux (2015) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In ICASSP, pp. 708–712. External Links: ISBN 1-4673-6997-7 Cited by: §3.3.
  5. A. Fazel, M. El-Khamy and J. Lee (2020) CAD-AEC: Context-Aware Deep Acoustic Echo Cancellation. In ICASSP, pp. 6919–6923. External Links: ISSN 2379-190X, Document Cited by: §1.
  6. M. Jeub, M. Schafer and P. Vary (2009) A binaural room impulse response database for the evaluation of dereverberation algorithms. In 16th International Conference on Digital Signal Processing, pp. 1–5. External Links: Document Cited by: §4.1.
  7. A. Kendall, Y. Gal and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §3.3.
  8. L. F. Lamel, R. H. Kassel and S. Seneff (1989) Speech database development: design and analysis of the acoustic-phonetic corpus. In Speech Input/Output Assessment and Speech Databases, Cited by: §4.1.
  9. C. M. Lee, J. W. Shin and N. S. Kim (2015) Dnn-based residual echo suppression. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §4.4.
  10. Q. Lei, H. Chen, J. Hou, L. Chen and L. Dai (2019) Deep Neural Network Based Regression Approach for Acoustic Echo Cancellation. In Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing, pp. 94–98. Cited by: §1.
  11. T. Lin, P. Goyal, R. Girshick, K. He and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §3.3.
  12. S. Malik and G. Enzner (2012) State-space frequency-domain adaptive filtering for nonlinear acoustic echo cancellation. IEEE Transactions on audio, speech, and language processing 20 (7), pp. 2065–2079. Cited by: §4.1.
  13. A. W. Rix, J. G. Beerends, M. P. Hollier and A. P. Hekstra (2001) Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP, Vol. 2, pp. 749–752. Cited by: §4.3.
  14. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520. Cited by: §3.2.
  15. A. Schwarz, C. Hofmann and W. Kellermann (2013) Spectral feature-based nonlinear residual echo suppression. In WASPAA, pp. 1–4. External Links: ISBN 1-4799-0972-6 Cited by: §1.
  16. K. Shi, X. Ma and G. Tong Zhou (2008) A residual echo suppression technique for systems with nonlinear acoustic echo paths. In ICASSP, pp. 257–260. External Links: ISSN 2379-190X, Document Cited by: §1.
  17. D. Snyder, G. Chen and D. Povey (2015) Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484. Cited by: §4.1.
  18. I. Szöke, M. Skácel, L. Mošner, J. Paliesek and J. H. Černockỳ (2019) Building and evaluation of a real room impulse response dataset. IEEE Journal of Selected Topics in Signal Processing 13 (4), pp. 863–876. Cited by: §4.1.
  19. C. H. Taal, R. C. Hendriks, R. Heusdens and J. Jensen (2011) An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing 19 (7), pp. 2125–2136. Cited by: §4.3.
  20. D. Wang and X. Zhang (2015) Thchs-30: a free chinese speech corpus. arXiv preprint arXiv:1512.01882. Cited by: §4.1.
  21. J. Y. Wen, N. D. Gaubitch, E. A. Habets, T. Myatt and P. A. Naylor (2006) Evaluation of speech dereverberation algorithms using the MARDY database. In In Proc. Intl. Workshop Acoust. Echo Noise Control, Cited by: §4.1.
  22. H. Zhang, K. Tan and D. Wang (2019) Deep Learning for Joint Acoustic Echo and Noise Cancellation with Nonlinear Distortions. In Proc. Interspeech, pp. 4255–4259. Cited by: §1.
  23. H. Zhang and D. Wang (2018) Deep Learning for Acoustic Echo Cancellation in Noisy and Double-Talk Scenarios. In Proc. Interspeech, Vol. 161, pp. 322. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description