Twostep Sound Source Separation: Training on Learned Latent Targets
Abstract
In this paper, we propose a twostep training procedure for source separation via a deep neural network. In the first step we learn a transform (and it’s inverse) to a latent space where maskingbased separation performance using oracles is optimal. For the second step, we train a separation module that operates on the previously learned space. In order to do so, we also make use of a scaleinvariant signal to distortion ratio (SISDR) loss function that works in the latent space, and we prove that it lowerbounds the SISDR in the time domain. We run various sound separation experiments that show how this approach can obtain better performance as compared to systems that learn the transform and the separation module jointly. The proposed methodology is general enough to be applicable to a large class of neural network endtoend separation systems.
Efthymios Tzinis Shrikant Venkataramani Zhepei Wang Cem Subakan Paris Smaragdis^{†}^{†}thanks: Supported by NSF grant #1453104
\address University of Illinois at UrbanaChampaign
Mila–Quebec Artificial Intelligence Institute
Adobe Research
\ninept
{keywords}
Audio source separation, signal representation, cost function, deep learning
1 Introduction
Singlechannel audio source separation is a fundamental problem in audio analysis, where one extracts the individual sources that constitute a mixture signal [1]. Popular algorithms for source separation include independent component analysis [2], nonnegative matrix factorization [14] and more recently supervised [8, 6, 11, 19, 26] and unsupervised [24, 22, 4] deep learning approaches. In many of the recent approaches, separation is performed by applying a mask on a latent representation, which is often a Fourierbased or a learned domain. Specifically, a separation module produces an estimated masked latent representation for the input sources and a decoder translates them back to the time domain.
Many approaches have used the shorttime Fourier transform (STFT) as an encoder to obtain this latent representation, and conversely the inverse STFT (iSTFT) as a decoder. Using this representation, separation networks have been trained using a loss defined over various targets, such as: raw magnitude spectrogram representations [8], ideal STFT masks [9, 7] and ideal affinity matrices [6, 10]. Other works have supplemented this by additionally reconstructing the phase of the sources [26, 27]. However, the ideal STFT masks impose an upper bound on the separation performance the aforementioned criteria do not necessarily translate to optimal separation. In order to address this, recent works have proposed endtoend separation schemes where the encoder, decoder and separation modules are jointly optimized using a timedomain loss between the reconstructed sources waveforms and their clean targets [19, 25, 12]. However, a joint timedomain endtoend training approach might not always yield an optimal decomposition of the input mixtures resulting to worse performance than the fixed STFT bases [12].
Some studies have reported significant benefits when performing sourceseparation in two stages. In [5], first the sources are separated and in a second stage the interference between the estimated sources is reduced. Similarly, an iterative scheme is proposed in [12], where the separation estimates from the first network are used as input to the final separation network. In [17], speaker separation is performed by first separating framelevel spectral components of speakers and later sequentially grouping them using a clustering network. Lately, stateoftheart results in most natural language processing tasks have been achieved by pretraining the encoder transformation network [3].
In this work, we propose a general twostep approach for performing source separation which can be used in any maskbased separation architecture. First we pretrain an encoder and decoder in order to learn a suitable latent representation. In the second step, we train a separation module using as loss the negative permutation invariant [28] scale invariant SDR (SISDR) [16] w.r.t. the learned latent representation. Moreover, we prove that for the case that the decoder is a transpose convolutional layer [19, 12], SISDR on the latent space bounds from below timedomain SISDR. Our experiments show that by maximizing SISDR on the learned latent targets, a consistent performance improvement is achieved across multiple sound separation tasks compared to the timedomain endtoend training approach when using the exact same model architecture. The SISDR upper bound using the learned latent space is also significantly higher than that of STFTdomain masks. Finally, we also observe that the pretrained encoder representations are also more sparse and structured compared to the joint training approach.
2 Twostep source separation
Assuming a mixture that consists of sources with samples each in the timedomain, we propose to perform source separation in two independent steps: A) We first obtain a latent representation for the source signals and for the input mixture. B) Then, we train a separation module which operates on the latent representation of the mixture and is trained to estimate the latent representation of the clean sources (or their masks in that space).
2.1 Step 1: Learning the Latent Targets
As a first step we train an encoder in order to obtain a latent representation for the mixture . We also provide the clean sources as inputs to this encoder to obtain and apply a softmax function (across the dimension of the sources) in order to obtain separation masks for each source. An elementwise multiplication of these masks with the latent representation of the mixture , can be used as an estimate for each source. The decoder module is then trained to transform these latent representations back to timedomain using . In order to train the encoder and the decoder we optimize the permutation invariant [28] SISDR [16] between the clean sources s and the estimated sources :
(1) 
where denotes the permutation of the sources that maximizes SISDR and the scalar ensures that the loss is scale invariant. A schematic representation of the aforementioned step for two sources is depicted in Fig. 0(a). The objective of this step is to find a latent representation transformation, which facilitates source separation through masking.
2.2 Step 2: Training the Separation Module
Once the weights of the encoder and decoder modules are fixed using the training recipe described in Step 1, we can train a separation module . Given the latent representation of an input mixture , is trained to produce an estimate of the latent representation for each clean source , i.e. . During inference, we can use the pretrained decoder to transform the source estimates back into the timedomain . The block diagram describing the training of the separation module with a fixed encoder and decoder is shown in Fig. 0(b).
2.2.1 Training using SISDR on the Latent Separation Targets
In contrast to recent timedomain sourceseparation approaches [18, 19] which train all modules , , and using a variant of the loss defined in Eq. 1, we propose to use the permutation invariant SISDR directly on the latent representation. For simplicity of notation we assume that each source has a vector latent representation in a high dimensional space. The loss for training the separation module could then be: . The exact same training procedure could be followed, but now we can use as targets the optimal separation targets on the latent space as opposed to the time domain signals. The premise is that if the separation module is trained on producing latent representations which are close to the ideal ones (assuming ideal permutation order) then the estimates of the sources after the decoding layer would also approximate the clean sources in timedomain . The latter might not hold for any arbitrary embedding process, but in the next section we prove that SISDR in the latent representations lowerbounds the SISDR in the timedomain.
2.2.2 Relation to maximization of SISDR on TimeDomain
We restrict ourselves to a decoder that consists of a 1D transposed convolutional layer which is the same as the decoder selection in most of the current endtoend source separation approaches [18, 19, 12, 25]. For this part we focus on the th target latent representation that corresponds to a source timedomain signal . Because the encoderdecoder modules are trained as described in Section 2.1, the separation target produced by the autoencoder would be close to the clean source , namely:
(2) 
The separation network produces an estimated latent vector that corresponds to an estimated timedomain signal . Because the decoder is just a convolutional layer we can express it as a linear projection using the matrix :
(3) 
Assuming the MoorePenrose pseudoinverse of P is well defined, we express the inverse mapping from time to the latentspace as:
(4) 
Proposition 1.
Let and their corresponding projections through to defined as Ay and , respectively. If then the absolute value of their inner product on the projection space is bounded above from the absolute value of their inner product in , namely: , where and depends only on the values of A.
Proof.
The inner product in the projection space can be rewritten as:
(5) 
Moreover, we can bound the first term of Eq. 5 by applying CauchySchwarz inequality to the inner products and using the fact that as shown next:
(6) 
Similarly, we use CauchySchwarz inequality and inequality 6 in order to bound the second term of Eq. 5 as well:
(7) 
Then by applying inequalities 6 and 7 to Eq. 5 we get:
(8) 
where always . Finally, we conclude that . ∎
Proposition 2.
Let , with unit norms, then maximizing w.r.t. is equivalent to maximizing w.r.t. .
Proof.
By assuming that there is an optimal solution :
(9) 
Which means that the two optimization goals are equivalent. ∎
Now we focus on the relationship of the maximization of SISDR for the th source when it is performed directly on the latent space and when it is performed on the timedomain using the clean source as a target . Again because all the SISDR measures are scaleinvariant, we can assume that the separation targets and the estimates vectors have unit norms on both the timedomain and the latent space, namely . By using Proposition 1 we get:
(10) 
Thus, by using the autoencoder property (Eq. 2) and Proposition 2 we conclude that on the latent space lower bounds the corresponding value on the timedomain. The same proof holds for any encoder and for other targets on the latent space such as the masks . Empirically, we indeed notice that the maximization of on the latent space leads to the maximization of on the timedomain.
3 Experimental Framework
To experimentally verify our approach we perform a set of source separation experiments as described in the following sections.
3.1 Audio Data
We use two audio data collections. For speech sources we use speech utterances from Wall street journal (WSJ0) corpus [20]. Training, validation and test speaker mixtures are generated by randomly selecting various speakers from the sets si_tr_s, si_dt_05 and si_et_05, respectively.
For nonspeech sounds we use the secs audio clips which are equally balanced between classes from the environmental sound classification (ESC50) data collection [21]. ESC50 spans various sound categories such as: nonspeech human sounds, animal sounds, natural soundscapes, interior sounds and urban noises. We split the data to train, validation and test sets with a ratio of , respectively. For each set, the same prior is used across classes (e.g., each class has the same number of clips). Also, the sets do not share clips which originate from the same initial source file.
3.2 Sound Source Separation Tasks
In order to develop a system capable of performing universal sound source separation [12], we evaluate our twostep approach under three distinct sound separation tasks. For all separation tasks, each input mixture consists of two sources which are always mixed using secs of their total duration. All audio clips are downsampled to kHz for efficient processing. We discuss the audio collection(s) that we utilize and the mixture generation process in the sections below.
3.2.1 Speech Separation
We only use audio clips containing human speech from WSJ0. In accordance to other studies performing experiments on singlechannel speech source separation [19, 23, 15, 27, 26], we use the publicly available WSJ02mix dataset [6]. In total there are , and mixtures for training, validation and testing, correspondingly.
3.2.2 NonSpeech Separation
We use audio clips only from ESC50. In this case, the total number of the available clean sources sounds is small, and thus, we propose an augmented mixture generation process which enables the generation of much more diverse mixtures. In order to generate each mixture, we randomly select a sec segment from two audio files from two distinct audio classes. We mix these two segments with a random signal to noise ratio (SNR)s between and dB. For each epoch, training mixtures are generated which generally are not the same with the ones generated for other epochs. For validation and test sets we fix their random seeds in order to always evaluate on the same and generated mixtures, respectively.
3.2.3 Mixed Separation
All four possible mixture combinations between speech and nonspeech audio are considered by using both WSJ0 and ESC50 sources. Building upon the data augmentation training idea, we also add a random variable which controls the data collection (ESC50 or WSJ0) from which a source waveform is going to be chosen. Specifically, we set an equal probability of choosing a source file from the two collections (ESC50 and WSJ0). For WSJ0 each speaker is considered a distinct sound class, thus, no mixture consists utterances from the same speaker. After the two source waveforms are chosen, we follow the mixture generation process described in Section 3.2.2.
3.3 Selected Network Architectures
Based on recent stateoftheart approaches on both speech and universal sound source separation with learnable encoder and decoder modules, we consider configurations for the encoderdecoder parts as well as the separation module which are based on a similar timedilated convolutional network (TDCN) architecture. In particular, we consider our implementations of ConvTasNet [19] that we refer simply as TDCN and its improved version proposed in [12] that we refer as residualTDCN (RTDCN).
3.3.1 EncoderDecoder Architecture
The encoder consists of one D convolutional layer and a ReLU activation on top in order to ensure a nonnegative latent representation of each audio input. Following the assumptions stated in Section 2.2.2, we use a D transposed convolutional layer for the decoder . Both encoder and decoder have the same number of channels (or number of bases) and their D kernels have a length corresponding to ms ( samples) and a hopsize equivalent to ms ( samples). For each task we select a different number of channels for the encoder and the decoder modules (, and for speech only, mixed and nonspeech only separation tasks, respectively).
3.3.2 Separation Modules Architectures
Our implementation of TDCN consists of the same architecture and parameter configuration for the separation module as described in [19] with an additional batch normalization layer before the final mask estimation which improved its performance over the original version on all separation tasks. Inspired by the original RTDCN separation module [12], we keep the same parameter configuration as TDCN and we additionally use a featurewise normalization between layers instead of global normalization. We also add longterm residual connections from previous layers. Moreover, before summing the residual connections, we concatenate them, normalize them and feed them through a dense layer as the latter yields some further improvement in separation performance. (Code is available online^{1}^{1}1Source code: github.com/etzinis/two_step_mask_learning.)
3.4 Training and Evaluation Details
In order to show the effectiveness of our proposed twostep approach, we use the same network architecture when we perform endtoend timedomain source separation and use as a loss the negative SISDR between the estimated signals on the timedomain and the clean waveforms . Instead in our twostep approach, we train the encoderdecoder parts separately as described in Section 2.1. In the second step, we use the pretrained encoders for each task and train the separation module using as loss the negative SISDR on the latent space targets or their corresponding masks (see Section 2.2). We train all models using the Adam optimizer [13], the batch size is equal to , the initial learning rate is set to and we decrease it by a factor of at the th epoch. We train TDCN and RTDCN separation networks for epochs and epochs, respectively. The encoderdecoder parts for each task are trained independently for epochs ( times faster than training the separation network). We evaluate the separation performance for all models using SISDR improvement (SISDRi) on time domain which is the difference of SISDR of the estimated signal and the input mixture signal [19, 12]. As the STFT oracle mask we choose the ideal ratio mask (IRM) using a Hanning window with ms length and ms hopsize [19].
4 Results & Discussion
4.1 Comparison with TimeDomain Separation
In Table 1, the mean separation performance of best models is reported for each task. We notice that the proposed twostep approach and training on the latent space leads to a consistent improvement over the endtoend approach where we train the same architecture using the timedomain SISDR loss. This observation holds when different separation modules are used and when we test them under different separation tasks. The nonspeech separation task seems the hardest one since the models have access to only a limited number of training mixtures which further underlines the importance of our proposed dataaugmentation technique as described in Section 3.2.2. Our twostep approach yields an absolute SISDR improvement over the endtoend baseline of up to dB, dB and dB for speech, nonspeech and mixed separation tasks, respectively. Notably, this performance improvement is achieved using the exact same architecture but instead of training it endtoend using a timedomain loss, we pretrain the autoencoder part and use a loss on the latent representations of the sources.



Separation  Target  Sound Separation Task  
Module  Domain  Speech  Nonspeech  Mixed 


TDCN  Time  15.4  7.7  11.7 
Latent  16.1  8.2  12.4  


RTDCN  Time  15.6  8.3  12.0 
Latent  16.2  8.4  12.6  


Oracle  STFT  13.0  14.8  14.5 
Masks  Latent  34.1  39.2  39.5 
4.2 Separation Targets in the Latent Space
In Table 1, we see that the oracle mask obtained from the twostep approach gives a much higher upper bound of separation performance, for all tasks, compared to ideal masks on the STFT domain. This is in line with the prior work that proposed to decompose signals using learned transforms [19, 25]. In Fig. 2 we can qualitatively compare the latent representations obtained from the same encoder when trained with our proposed twostep approach and with the baseline joint training of all modules. When the encoder and decoder are trained individually, a fewer number of bases are used to encode the input which leads to a sparser representation ( norm is roughly smaller compared to the joint training approach). Finally, the latent representations obtained from our proposed approach exhibit a spectrogramlike structure in a way that Speech is encoded using less bases than high frequency sounds like Bird Chirping.
5 Conclusion
We show how by prelearning an optimal latent space can result in better source separation performance compared to a timedomain endtoend training approach. Our experiments show that the proposed twostep approach yields a consistent performance improvement under multiple sound separation tasks. Additionally, the obtained sound latent representations remain sparse and structured while they also enjoy a much higher upper bound of separation performance compared to STFTdomain masks. Although this approach was demonstrated on TDCN architectures, it can be easily adapted for use with any other maskbased system.
References
 [1] (1998) Blind source separation based on timefrequency signal representations. IEEE Transactions on Signal Processing 46 (11), pp. 2888–2897. Cited by: §1.
 [2] (2005) Blind source separation and independent component analysis: a review. Neural Information ProcessingLetters and Reviews 6 (1), pp. 1–57. Cited by: §1.
 [3] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1.
 [4] (2019) Unsupervised training of a deep clustering model for multichannel blind source separation. In Proc. ICASSP, pp. 695–699. Cited by: §1.
 [5] (2017) Twostage singlechannel audio source separation using deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25 (9), pp. 1773–1783. Cited by: §1.
 [6] (2016) Deep clustering: discriminative embeddings for segmentation and separation. In Proc. ICASSP, pp. 31–35. Cited by: §1, §1, §3.2.1.
 [7] (2016) Neural network based spectral mask estimation for acoustic beamforming. In Proc. ICASSP, pp. 196–200. Cited by: §1.
 [8] (2014) Deep learning for monaural speech separation. In Proc. ICASSP, pp. 1562–1566. Cited by: §1, §1.
 [9] (2015) Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (12), pp. 2136–2147. Cited by: §1.
 [10] (2016) Singlechannel multispeaker separation using deep clustering. In Proc. Interspeech, Cited by: §1.
 [11] (2017) Singing voice separation with deep unet convolutional networks. In Proc. ISMIR, pp. 323–332. Cited by: §1.
 [12] (2019) Universal sound separation. Proc. WASPAA. Cited by: §1, §1, §1, §2.2.2, §3.2, §3.3.2, §3.3, §3.4.
 [13] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
 [14] (2015) Deep nmf for speech separation. In Proc. ICASSP, pp. 66–70. Cited by: §1.
 [15] (2019) Phasebook and friends: leveraging discrete representations for source separation. IEEE Journal of Selected Topics in Signal Processing 13 (2), pp. 370–382. Cited by: §3.2.1.
 [16] (2019) SDR–halfbaked or well done?. In Proc. ICASSP, pp. 626–630. Cited by: §1, §2.1.
 [17] (2019) Divide and conquer: a deep casa approach to talkerindependent monaural speaker separation. arXiv preprint arXiv:1904.11148. Cited by: §1.
 [18] (2018) Tasnet: timedomain audio separation network for realtime, singlechannel speech separation. In Proc. ICASSP, pp. 696–700. Cited by: §2.2.1, §2.2.2.
 [19] (2019) Convtasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8), pp. 1256–1266. Cited by: §1, §1, §1, §2.2.1, §2.2.2, §3.2.1, §3.3.2, §3.3, §3.4, §4.2.
 [20] (1992) The design for the wall street journalbased CSR corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 2326, 1992, Cited by: §3.1.
 [21] (2015) ESC: dataset for environmental sound classification. In Proc. ACM International Conference on Multimedia, pp. 1015–1018. Cited by: §3.1.
 [22] (2019) Bootstrapping singlechannel source separation via unsupervised spatial clustering on stereo mixtures. In Proc. ICASSP, pp. 356–360. Cited by: §1.
 [23] (2019) Furcax: endtoend monaural speech separation based on deep gated (de) convolutional neural networks with adversarial example training. In Proc. ICASSP, pp. 6985–6989. Cited by: §3.2.1.
 [24] (2019) Unsupervised deep clustering for source separation: direct learning from mixtures using spatial information. In Proc. ICASSP, pp. 81–85. Cited by: §1.
 [25] (2018) Endtoend source separation with adaptive frontends. In Proc. Asilomar Conference on Signals, Systems, and Computers, pp. 684–688. Cited by: §1, §2.2.2, §4.2.
 [26] (2019) Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. In Proc. ICASSP, pp. 71–75. Cited by: §1, §1, §3.2.1.
 [27] (2018) Phase reconstruction with learned timefrequency representations for singlechannel speech separation. In Proc. IWAENC, pp. 396–400. Cited by: §1, §3.2.1.
 [28] (2017) Permutation invariant training of deep models for speakerindependent multitalker speech separation. In Proc. ICASSP, pp. 241–245. Cited by: §1, §2.1.