Dual-path RNN: efficient long sequence modeling for
time-domain single-channel speech separation
Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one-dimensional convolutional neural networks (1-D CNNs) cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length. In this paper, we propose dual-path recurrent neural network (DPRNN), a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential input into smaller chunks and applies intra- and inter-chunk operations iteratively, where the input length can be made proportional to the square root of the original sequence length in each operation. Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.
Yi Luo\sthanksWork done during internship at Microsoft Research., Zhuo Chen, Takuya Yoshioka
\addressDepartment of Electrical Engineering, Columbia University, NY, USA
Microsoft, One Microsoft Way, Redmond, WA, USA \ninept
Speech separation, deep learning, time domain, recurrent neural networks
Recent progress in deep learning-based speech separation has ignited the interest of the research community in time-domain approaches [24, 29, 30, 25, 28, 3]. Compared with standard time-frequency domain methods, time-domain methods are designed to jointly model the magnitude and phase information and allow direct optimization with respect to both time- and frequency-domain differentiable criteria [7, 16, 21].
Current time-domain separation systems can be mainly categorized into adaptive front-end and direct regression approaches. The adaptive front-end approaches aim at replacing the short-time Fourier transform (STFT) with a differentiable transform to build a front-end that can be learned jointly with the separation network. Separation is applied to the front-end output as with the conventional time-frequency domain methods applying the separation processes to spectrogram inputs [30, 25, 28]. Being independent of the traditional time-frequency analysis paradigm, these systems are able to have a much more flexible choice on the window size and the number of basis functions for the front-end. On the other hand, the direct regression approaches learn a regression function from an input mixture to the underlying clean signals without an explicit front-end, typically by using some form of one-dimensional convolutional neural networks (1-D CNNs) [29, 20, 7].
A commonality between the two categories is that they both rely on effective modeling of extremely long input sequences. The direct regression methods perform separation at the waveform sample level, while the number of the samples can usually be tens of thousands, or sometimes even more. The performance of the adaptive front-end methods also depend on selection of the window size, where a smaller window improves the separation performance at the cost of a significantly longer front-end representation [25, 13]. This poses an additional challenge as conventional sequential modeling networks, including RNNs and 1-D CNNs, have difficulty on learning such long-term temporal dependency . Moreover, unlike RNNs that have dynamic receptive fields, 1-D CNNs with fixed receptive fields that are smaller than the sequence length are not able to fully utilize the sequence-level dependency .
In this paper, we propose a simple network architecture, which we refer to as dual-path RNN (DPRNN), that organizes any kinds of RNN layers to model long sequential inputs in a very simple way. The intuition is to split the input sequence into shorter chunks and interleave two RNNs, an intra-chunk RNN and an inter-chunk RNN, for local and global modeling, respectively. In a DPRNN block, the intra-chunk RNN first processes the local chunks independently, and then the inter-chunk RNN aggregates the information from all the chunks to perform utterance-level processing. For a sequential input of length , DPRNN with chunk size and chunk hop size contains chunks, where and corresponds to the input lengths for the inter- and intra-chunk RNNs, respectively. When , the two RNNs have a sublinear input length () as opposed to the original input length (), which greatly decreases the optimization difficulty that arises when is extremely large.
Compared with other approaches for arranging local and global RNN layers, or more general the hierarchical RNNs that perform sequence modeling in multiple time scales [17, 6, 26, 11, 35, 5], the stacked DPRNN blocks iteratively and alternately perform the intra- and inter-chunk operations, which can be treated as an interleaved processing between local and global inputs. Moreover, the first RNN layer in most hierarchical RNNs still receives the entire input sequence, while in stacked DPRNN each intra- or inter-chunk RNN receives the same sublinear input size across all blocks. Compared with CNN-based architectures such as temporal convolutional networks (TCNs) that only perform local modeling due to the fixed receptive fields [25, 28, 19], DPRNN is able to fully utilize global information via the inter-chunk RNNs and achieve superior performance with an even smaller model size. In Section 4 we will show that by simply replacing TCN by DPRNN in a previously proposed time-domain separation system , the model is able to achieve a 0.7 dB (4.6%) relative improvement with respect to scale-invariant signal-to-noise ratio (SI-SNR)  on WSJ0-2mix with a 49% smaller model size. By performing the separation at the waveform sample level, i.e. with window size of 2 samples and hop size of 1 sample, a new state-of-the-art performance is achieved with a 20 times smaller model than the previous best system.
2 Dual-path Recurrent Neural Network
2.1 Model design
A dual-path RNN (DPRNN) consists of three stages: segmentation, block processing, and overlap-add. The segmentation stage splits a sequential input into overlapped chunks and concatenates all the chunks into a 3-D tensor. The tensor is then passed to stacked DPRNN blocks to iteratively apply local (intra-chunk) and global (inter-chunk) modeling in an alternate fashion. The output from the last layer is transformed back to a sequential output with overlap-add method. Figure 1 shows the flowchart of the model.
For a sequential input where is the feature dimension and is the number of time steps, the segmentation stage splits into chunks of length and hop size . The first and last chunks are zero-padded so that every sample in appears and only appears in chunks, generating equal size chunks . All chunks are then concatenated together to form a 3-D tensor .
2.1.2 Block processing
The segmentation output is then passed to the stack of DPRNN blocks. Each block transforms an input 3-D tensor into another tensor with the same shape. We denote the input tensor for block as , where . Each block contains two sub-modules corresponding to intra- and inter-chunk processing, respectively. The intra-chunk RNN is always bi-directional and is applied to the second dimension of , i.e., within each of the blocks:
where is the output of the RNN, is the mapping function defined by the RNN, and is the sequence defined by chunk . A linear fully-connected (FC) layer is then applied to transform the feature dimension of back to that of
where is the transformed feature, and are the weight and bias of the FC layer, respectively, and represents chunk in . Layer normalization (LN)  is then applied to , which we empirically found to be important for the model to have a good generalization ability:
where are the rescaling factors, is a small positive number for numerical stability, and denotes the Hadamard product. and are the mean and variance of the 3-D tensor defined as
A residual connection is then added between the output of LN operation and the input :
is then served as the input to the inter-chunk RNN sub-module, where the RNN is applied to the last dimension, i.e. the aligned time steps in each of the blocks:
where is the output of RNN, is the mapping function defined by the RNN, and is the sequence defined by the -th time step in all chunks. As the intra-chunk RNN is bi-directional, each time step in contains the entire information of the chunk it belongs to, which allows the inter-chunk RNN to perform fully sequence-level modeling. As with the intra-chunk RNN, a linear FC layer and the LN operation are applied on top of . A residual connection is also added between the output and to form the output for DPRNN block . For , the output is served as the input to the next block .
Denote the output of the last DPRNN block as . To transform it back to a sequence, the overlap-add method is applied to the chunks to form output .
Consider the sum of the input sequence lengths for the intra- and inter-chunk RNNs in a single block denoted by where the hop size is set to be 50% (i.e. ) as in Figure 1. It is simple to see that where is the ceiling function. To achieve minimum total input length , should be selected such that , and then also satisfies . This gives us sublinear input length () rather than the original linear input length ().
For tasks that require online processing, the inter-chunk RNN can be made uni-directional, scanning from the first up to the current chunks. The later chunks can still utilize the information from all previous chunks, and the minimal system latency is thus defined by the chunk size . This is unlike standard CNN-based models that can only perform local processing due to the fixed receptive field or conventional RNN-based models that perform frame-level instead of chunk-level modeling. The performance difference between the online and offline settings, however, is beyond the scope of this paper.
3 Experimental procedures
3.1 Model configurations
Although DPRNN can be applied to any systems that require long-term sequential modeling, we investigate its application to the time-domain audio separation network (TasNet) [24, 23, 25], an adaptive front-end method that achieves high speech separation performance on a benchmarking dataset. TasNet contains three parts: (1) a linear 1-D convolutional encoder that encapsulates the input mixture waveform into an adaptive 2-D front-end representation, (2) a separator that estimates masking matrices for target sources, and (3) a linear 1-D transposed convolutional decoder that converts the masked 2-D representations back to waveforms. We use the same encoder and decoder design as in  while the number of filters is set to be 64. As for the separator, we compare the proposed deep DPRNN with the optimally configured TCN described in . We use 6 DPRNN blocks using BLSTM  as the intra- and inter-chunk RNNs with 128 hidden units in each direction. The chunk size for DPRNN is defined empirically according to the length of the front-end representation such that as discussed in Section 2.2.
We evaluate our approach on two-speaker speech separation and recognition tasks. The separation-only experiment is conducted on the widely-used WSJ0-2mix dataset . WSJ0-2mix contains 30 hours of 8k Hz training data that are generated from the Wall Street Journal (WSJ0) si_tr_s set. It also has 10 hours of validation data and 5 hours of test data generated by using the si_dt_05 and si_et_05 sets, respectively. Each mixture is artificially generated by randomly selecting different speakers from the corresponding set and mixing them at a random signal-to-noise ratio (SNR) between -5 and 5 dB.
For the speech separation and recognition experiment, we create 200 hours and 10 hours of artificially mixed noisy reverberant mixtures sampled from the Librispeech dataset  for training and validation, respectively. The 16 kHz signals were convolved with room impulse responses generated by the image method . The length and width of the room are randomly sampled between 2 and 10 meters, and the height is randomly sampled between 2 and 5 meters. The reverberation time (T60) is randomly sampled between 0.1 and 0.5 seconds. The locations for the speakers as well as the single microphone are all randomly sampled. The two reverberated signals are rescaled to a random SNR between -5 and 5 dB, and further shifted such that the overlap ratio between the two speakers is 50% on average. The resultant mixture is further corrupted by random isotropic noise at a random SNR between 10 and 20 dB . For evaluation, we generate mixture in the same manner sampled from Microsoft’s internal gender-balanced clean speech collection consisting of 44 speakers. The target for separation is the reverberant clean speech for both speakers.
3.3 Experiment configurations
We train all models for 100 epochs on 4-second long segments. The learning rate is initialized to and decays by 0.98 for every two epochs. Early stopping is applied if no best model is found in the validation set for 10 consecutive epochs. Adam  is used as the optimizer. Gradient clipping with maximum -norm of 5 is applied for all experiments. All models are trained with utterance-level permutation invariant training (uPIT)  to maximize scale-invariant SNR (SI-SNR) .
The effectiveness of the systems is assessed both in terms of signal fidelity and speech recognition accuracy. The degree of improvement in the signal fidelity is measured by signal-to-distortion ratio improvement (SDRi)  as well as SI-SNR improvement (SI-SNRi). The speech recognition accuracy is measured by the word error rate (WER) on both separated speakers.
4 Results and discussions
4.1 Results on WSJ0-2mix
We first report the results on the WSJ0-2mix dataset. Table 1 compares the TasNet-based systems with different separator networks. We can see that simply replacing TCN by DPRNN improves the separation performance by 4.6% with a 49% smaller model. This shows the superiority of the proposed local-global modeling to the previous CNN-based local-only modeling. Moreover, the performance can be consistently improved by further decreasing the filter length (and the hop size as a consequence) in the encoder and decoder. The best performance is obtained when the filter length is 2 samples with an encoder output of more than 30000 frames. This can be extremely hard or even impossible for standard RNNs or CNNs to model, while with the proposed DPRNN the use of such a short filter becomes possible and achieves the best performance.
Table 2 compares the DPRNN-TasNet with other previous systems on WSJ0-2mix. We can see that DPRNN-TasNet achieves a new record on SI-SNRi with a 20 times smaller model than the previous state-of-the-art system . The small model size and the superior performance of DPRNN-TasNet indicate that speech separation on WSJ0-2mix dataset can be solved without using enormous or complex models, revealing the need for using more challenging and realistic datasets in future research.
|Separator network||Model size||Window (samples)||Chunk size (frames)||SI-SNRi (dB)||SDRi (dB)|
|Method||Model size||SI-SNRi (dB)||SDRi (dB)|
|Sign Prediction Net ||55.2M||15.3||15.6|
|Deep CASA ||12.8M||17.7||18.0|
4.2 Speech separation and recognition results
We use a conventional hybrid system for speech recognition. Our recognition system is trained on large-scale single-speaker noisy reverberant speech collected from various sources . Table 3 compares TCN- and DPRNN-based TasNet models with a 2-ms window. We can observe that DPRNN-TasNet significantly outperforms TCN-TasNet in SI-SNRi and WER, showing that the speriority of DPRNN even under challenging noisy and reverberant conditions. This further indicates that DPRNN can replace conventional sequential modeling modules across a range of tasks and scenarios.
|Separator network||Model size||SI-SNRi (dB)||WER (%)|
|Noise-free reverberant speech||–||–||9.1|
In this paper, we proposed dual-path recurrent neural network (DPRNN), a simple yet effective way of organizing any types of RNN layers for modeling an extremely long sequence. DPRNN splits the sequential input into overlapping chunks and performs intra-chunk (local) and inter-chunk (global) processing with two RNNs alternately and iteratively. This design allows the length of each RNN input to be propotional to the square root of the original input length, enabling sublinear processing and alleviating optimization challenges. We also described an application to single-channel time-domain speech separation using time-domain audio separation network (TasNet). By replacing 1-D CNN modules with deep DPRNN and performing sample-level separation in the TasNet framework, a new state-of-the-art performance was obtained on WSJ0-2mix with a 20 times smaller model than the previously reported best system. Experimental results of noisy reverberant speech separation and recognition were also reported, proving DPRNN’s effectiveness in challenging acoustic conditions. These results demonstrate the superiority of the proposed approach in various scenarios and tasks.
-  (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §3.2.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §2.1.2.
-  (2019) A comprehensive study of speech separation: spectrogram vs waveform separation. arXiv preprint arXiv:1905.07497. Cited by: §1.
-  (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. Cited by: §1.
-  (2017) Dilated recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 77–87. Cited by: §1.
-  (2016) Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704. Cited by: §1.
-  (2018) End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 26 (9), pp. 1570–1584. Cited by: §1, §1.
-  (2007) Generating sensor signals in isotropic noise fields. The Journal of the Acoustical Society of America 122 (6), pp. 3464–3470. Cited by: §3.2.
-  (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 31–35. Cited by: §3.2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
-  (2017) Chunk-based decoder for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1901–1912. Cited by: §1.
-  (2016) Single-channel multi-speaker separation using deep clustering. Interspeech 2016, pp. 545–549. Cited by: Table 2.
-  (2019) Universal sound separation. arXiv preprint arXiv:1905.03330. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.3.
-  (2017) Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10), pp. 1901–1913. Cited by: §3.3, Table 2.
-  (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §1, §1, §3.3.
-  (2015) A hierarchical neural autoencoder for paragraphs and documents. arXiv preprint arXiv:1506.01057. Cited by: §1.
-  (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5457–5466. Cited by: §1.
-  (2019) Divide and conquer: a deep casa approach to talker-independent monaural speaker separation. arXiv preprint arXiv:1904.11148. Cited by: §1, Table 2.
-  (2018) End-to-end music source separation: is it possible in the waveform domain?. arXiv preprint arXiv:1810.12187. Cited by: §1.
-  (2019) FaSNet: low-latency adaptive beamforming for multi-microphone audio processing. arXiv preprint arXiv:1909.13387. Cited by: §1.
-  (2018) Speaker-independent speech separation with deep attractor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (4), pp. 787–796. External Links: Cited by: Table 2.
-  (2018) Real-time single-channel dereverberation and separation with time-domain audio separation network. Proc. Interspeech 2018, pp. 342–346. Cited by: §3.1.
-  (2018) TasNet: time-domain audio separation network for real-time, single-channel speech separation. In Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on, Cited by: §1, §3.1.
-  (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §1, §1, §1, §1, §3.1, Table 2.
-  (2016) SampleRNN: an unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837. Cited by: §1.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.2.
-  (2019) FurcaNeXt: end-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks. arXiv preprint arXiv:1902.04891. Cited by: §1, §1, §1, §4.1, Table 2.
-  (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185. Cited by: §1, §1.
-  (2018) End-to-end source separation with adaptive front-ends. In 2018 52nd Asilomar Conference on Signals, Systems, and Computers, pp. 684–688. Cited by: §1, §1.
-  (2006) Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §3.3.
-  (2018) End-to-end speech separation with unfolded iterative phase reconstruction. arXiv preprint arXiv:1804.10204. Cited by: Table 2.
-  (2019) Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71–75. Cited by: Table 2.
-  (2019-Revised July) Meeting transcription using virtual microphone arrays. Technical report Technical Report MSR-TR-2019-11, Microsoft Research. Note: Available as https://arxiv.org/abs/1905.02545 External Links: Cited by: §4.2.
-  (2017) Chunk-based bi-scale decoder for neural machine translation. arXiv preprint arXiv:1705.01452. Cited by: §1.