The Cone of Silence: Speech Separation by Localization

The Cone of Silence: Speech Separation by Localization


Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region , given an angle of interest and angular window size . By exponentially decreasing , we can perform a binary search to localize and separate all sources in logarithmic time. Our algorithm allows for an arbitrary number of potentially moving speakers at test time, including more speakers than seen during training. Experiments demonstrate state-of-the-art performance for both source separation and source localization, particularly in high levels of background noise.

1 Introduction

The ability of humans to separate and localize sounds in noisy environments is a remarkable phenomenon known as the “cocktail party effect.” However, our natural ability only goes so far – we may still have trouble hearing a conversation partner in a noisy restaurant or during a call with other speakers in the background. One can imagine future earbuds or hearing aids that selectively cancel audio sources that you don’t want to listen to. As a step towards this goal, we introduce a deep neural network technique that can be steered to any direction at run time, cancelling all audio sources outside a specified angular window, aka cone of silence (CoS) 1.

But how do you know what direction to listen to? We further show that this directionally sensitive CoS network can be used as a building block to yield simple yet powerful solutions to 1) sound localization, and 2) audio source separation. Our experimental evaluation demonstrates state of the art performance in both domains. Furthermore, our ability to handle an unknown number of potentially moving sound sources combined with fast performance represents additional steps forward in generality. Audio demos can be found at our project website.2

We are particularly motivated by the recent increase of multi-microphone devices in everyday settings. This includes headphones, hearing aids, smart home devices, and many laptops. Indeed, most of these devices already employ directional sensitivity both in the design of the individual microphones and in the way they are combined together. In practice however, this directional sensitivity is limited to either being hard tuned to a fixed range of directions (e.g., cardioid), or providing only limited attenuation of audio outside that range (e.g., beam-forming). In contrast, our CoS approach enables true cancellation of audio sources outside a specified angular window that can be specified (and instantly changed) in software.

Our approach uses a novel deep network that can separate sources in the waveform domain within any angular region , parameterized by a direction of interest and angular window size . For simplicity, we focus only on azimuth angles, but the method could equally be applied to elevation as well. By exponentially decreasing , we perform a binary search to separate and localize all sources in logarithmic time (Figure 1). Unlike many traditional methods that perform direction based separation, we can also ignore background source types, such as music or ambient noise. Qualitative and quantitative results show state-of-the-art performance and a direct applicability to a wide variety of real world scenarios. Our key contribution is a logarithmic time algorithm for simultaneous localization and separation of speakers, particularly in high levels of noise, allowing for arbitrary number of speakers at test time, including more speakers than seen during training. We strongly encourage the reader to view our supplementary results for a demo of our method and audio results.

Figure 1: Overview of Separation by Localization running binary search on an example scenario with 3 sources. Each panel shows the spatial layout of the scene with the microphone array located at the center. During Step 1, the algorithm performs separation on candidate regions of . The quadrants with no sound get suppressed and disregarded. The algorithm continues doing separation on smaller partitions of candidate regions until reaching the final step where the angular window size is 2.

2 Related Work

Source separation has seen tremendous progress in recent years, particularly with the increasing popularity of learning methods, which improve over traditional methods such as Cardoso (1998); Nesta et al. (2010). In particular, unsupervised source modeling methods train a model for each source type and apply the model to the mixture for separation by using methods like NMF Raj et al. (2010); Mohammadiha et al. (2013), clustering Tzinis et al. (2019); Sawada et al. (2010); Luo et al. (2017), or bayesian methods Itakura et al. (2018); Jayaram and Thickstun (2020); Benaroya et al. (2005). Supervised source modeling methods train a model for each source from annotated isolated signals of each source type, e.g., pitch information for music Bertin et al. (2010). Separation based training methods like Halperin et al. (2018); Jansson et al. (2017); Nugraha et al. (2016) employ deep neural networks to learn source separation from mixtures given the ground truth signals as training data, also known as the mix-and-separate framework.

Recent trends include the move to operating directly on waveforms Stoller et al. (2018); Luo and Mesgarani (2018, 2019) yielding performance improvements over frequency-domain spectrogram techniques such as Hershey et al. (2016); Xu et al. (2018); Weng et al. (2015); Tzinis et al. (2019); Yoshioka et al. (2018); Chen et al. (2018); Zhang and Wang (2017). A second trend is increasing the numbers of microphones, as methods based on multi-channel microphone arrays Yoshioka et al. (2018); Chen et al. (2018); Gu et al. (2020) and binaural recordings Zhang and Wang (2017); Han et al. (2020) perform better than single-channel source separation techniques. Combined audio-visual techniques Zhao et al. (2018); Rouditchenko et al. (2019) have also shown promise.

Sound localization is a second active research area, often posed as direction of arrival (DOA) estimation Grondin and Glass (2019); Nadiri and Rafaely (2014); Pavlidi et al. (2013). Popular methods include beamforming DiBiase (2000), subspace methods Schmidt (1986); Wang and Kaveh (1985); Di Claudio and Parisi (2001); Yoon et al. (2006), and sampling-based methods Pan et al. (2017). A recent trend is the use of deep neural networks for multi-source DOA estimation, e.g., He et al. (2018); Adavanne et al. (2018).

One key challenge is that the number of speakers in real world scenarios is often unknown or non-constant. Many methods require a priori knowledge about the number of sources, e.g., Luo and Mesgarani (2018); Luo et al. (2020). Recent deep learning methods that address separation with an unknown number of speakers include Higuchi et al. (2017), Takahashi et al. (2019), and Nachmani et al. (2020). However, these methods use an additional model to help predict the number of speakers, and Nachmani et al. (2020) further uses a different separation model for different numbers of speakers.

Although DOA provides a practical approach for source separation, methods that take this approach suffer from another shortcoming: the direction of interest needs to be known in advance Adel et al. (2012). Without a known DOA for each source, these methods must perform a linear sweep of the entire angular space, which is computationally infeasible for state-of-the-art deep networks at fine-grained angular resolutions.

Some prior work has addressed joint localization and separation. For example, Traa and Smaragdis (2014); Mandel et al. (2009); Asano and Asoh (2004); Dorfan et al. (2015); Deleforge et al. (2015); Mandel et al. (2007) use expectation maximization to iteratively localize and separate sources. Traa et al. (2015) uses the idea of Directional NMF, while Johnson et al. (2018) poses separation and localization as a Bayesian inference problem based on inter-microphone phase differences. Our method improves on these approaches by combining deep learning in the waveform domain with efficient search.

3 Method

In this section we describe our Cone of Silence network for angle based separation. The target angle and window size are learned independently; Separation at is handled entirely by a pre-shift step, while an additional network input is used to produce the window of size . We also describe how to use the network for separation by localization via binary search.

Problem Formulation: Given a known-configuration microphone array with microphones and , the problem of -channel source separation and localization can be formulated in terms of estimating sources and their corresponding angular position from an -channel discrete waveform of the mixture of length , where


Here represents the background signal, which could be a point source like music or diffuse-field background noise without any specific location.

In this paper we explore circular microphone arrays, but we also describe possible modifications to support linear arrays. The center of our coordinate system is always the center of the microphone array, and the angular position of each source, , is defined based on this coordinate system. In the problem formulation we assume the sources are stationary, but we describe how to handle potentially moving sources in Section 4.5. In addition, we only focus on separation and localization by azimuth angle, meaning that we assume the sources have roughly the same elevation angle. As we show in the experimental section, this assumption is valid for most real world scenarios.

3.1 Cone of Silence Network (CoS)

We propose a network that performs source separation given an angle of interest and an angular window size . The network is tasked with separating speech only coming from azimuthal directions between and and disregarding speech coming from other directions. In the following sections we describe how to create a network with this property. Figure 2 shows our proposed network architecture. and are encoded in a shifted input and a one-hot vector as described in Section 3.1.2 and Section 3.1.3 respectively.

Figure 2: (left) our network architecture, (top-right) the encoder block, (bottom-right) the decoder block. In all diagrams, refers to the global conditioning variable corresponding to an angular window size .

Base Architecture

Our CoS network is adapted from the Demucs architecture Défossez et al. (2019), a music separation network, which is similar to the Wave U-Net architecture Jansson et al. (2017). We extend the original Demucs network to our problem formulation by modifying the number of input and output channels to match the number of microphones.

There are several reasons why this base architecture is well suited for our task. As mentioned in Section 2, networks that operate on the raw waveform have been recently shown to outperform spectrogram based methods. In addition, Demucs was specifically designed to work at sampling rates as high as , while other speech separation networks operate at rates as low as . Although human speech can be represented at lower sampling rates, we find that operating at higher sampling rates is beneficial for capturing small time difference of arrivals between the microphones. This would also allow our method to be extended to high resolution source types, like music, where a high sampling rate is necessary.

Target Angle

In order to make the network output sources from a specific target angle , we use a shifted mixture based on . We found that that this worked better than trying to directly condition the network based on both and . is created as follows: by calculating the time difference of arrival at each microphone, we shift each channel in the original signal such that signals coming from angle are temporally aligned across all channels in .

We use the fact that the time differences of arrival (TDOA) between the microphones are primarily based on the azimuthal angle for a far-field source. This assumption is valid when the sources are roughly on the same plane as the microphone array Valin et al. (2003). Let be the speed of sound, be our sampling rate, be the position of a far-field source at angle , and be a simple Euclidean distance. The TDOA in samples for the source to reach the -th microphone is


In our experiments, we chose as our canonical position, meaning that and all other channels are shifted to align with .


shift is a 1-D shift operation with one sided zero padding. This idea is similar to the first step of a Delay and Sum Beamformer Johnson and Dudgeon (1992). We then train the network to output sources which are temporally aligned in while ignoring all others. For , these shifts are unique for a specific angle , so sources from other angles will not be temporally aligned. If or the mic array is linear, then sources at angles and have the same per-channel shift leading to front-back confusion.

Angular Window Size

Although the network trained on shifted inputs as described in Section 3.1.2 can produce accurate output for a given angle , it requires prior knowledge about the target angle for each source of interest. In addition, real sources are not perfect point sources and have finite width, especially in the presence of slight movements.

To solve these problems, we introduce a second variable, an angular window size which is passed as a global conditioning parameter to the network. This angular window size facilitates the application of this network for a fast binary search approach. It also allows the localization and separation of moving sources. By using a larger angular window size and a smaller temporal input waveform, it is possible to localize and separate moving sources within that window.

Motivated by the global conditioning framework in WaveNet Oord et al. (2016), we use a one-hot encoded variable to represent different window sizes. In our experiments, we use of size 5 corresponding to window sizes from the set . By passing to the network with our shifted input , we can explicitly make the network separate sources from the region . We embed to all encoder and decoder blocks in the network, using a learning linear projection as shown in Figure 2. Formally, the equations for the encoder block and decoder block can be written as follows:


The notation denotes a 1-D convolution between the weights for the layer of an encoding/decoding block at level and an input . The notation denotes a transposed convolution operation. Empirically we found that passing to every encoder and decoder block worked significantly better than passing it to the network only once. Evidence that the CoS network learns the desired window size is presented in Figure 4.

Network Training

Consider an input mixture of sources with the corresponding locations along with a target angle and window size . The network is trained with the following objective function:


where and are the shifted signals of the input mixture and ground truth signal as described in Section 3.1.2 based on the target angle . is the output of the network using the shifted signal and the angular window . is an indicator function, indicating whether is present in the region . If no source is present in the region , the training target is a zero tensor .

3.2 Localization and Separation via Binary Search

By starting with a large window size and decreasing it exponentially, we can perform a binary search of the angular space in logarithmic time, while separating the sources simultaneously. More concretely, we start with our initial window size , our initial target angles , and our observed -channel mixture . In the first pass we run the network for all . This first step is the quadrant based separation illustrated in Step 1 of Figure 1. Because regions without sources will produce empty outputs, we can discard large angular regions early on with a simple cutoff. We then regress on a smaller window size, and the new candidate regions for regions with high energy outputs from . We continue to regress on smaller window sizes until reaching the desired resolution. The complete algorithm is written below and shown in Figure 1.

Inputs : -channel input mixture and the microphone array position
Output : Separated signals and their locations
1 Initialize , , and . for  do
2       for  do
3             Update accordingly by adding to if isn’t empty.
4       end for
6 end for
return Non-max suppression on sources at .
Algorithm 1 Separation by Localization via Binary Search

To avoid duplicate outputs from adjacent regions, we employ a non-maximum suppression step before outputting the final sources and locations. For this step, we consider both the angular proximity and similarity between the sources. If two outputted sources are physically close and have similar source content, we remove the one with the lower source energy. For example, for outputs , and with , we remove , if and .

3.3 Runtime Analysis

Suppose we have speakers and the angular space is discretized into angular bins. The binary search algorithm runs for at most steps and requires at most forward passes on every step. Thus, the total number of forward passes is while a linear sweep always runs in forward passes.

In most cases, , so the binary search is clearly superior. For instance, when operating at a resolution, the average number of forward passes our algorithm takes to separate 2 voices in the presence of background is , compared to 180 for a linear sweep. A forward pass of the network on a single GPU takes for a input waveform at , meaning that the binary search algorithm in this scenario could keep up with real-time while the linear search could not.

4 Experiments

In this section, we explain our synthetic dataset and manually collected real dataset. We show numerical results for separation and localization on the synthetic dataset and describe qualitative results on the real dataset.

4.1 Synthetic Dataset

Numerical results are demonstrated on synthetically rendered data. To generate the synthetic dataset, we create multi-speaker recordings in simulated environments with reverb and background noises. All voices come from the VCTK dataset Veaux et al. (2016), and the background samples consist of recordings from either noisy restaurant environments or loud music. The train and test splits are completely independent and there are no overlapping identities or samples. We chose VCTK over other widely used datasets like LibriSpeech Panayotov et al. (2015) and WSJ0 Garofalo et al. (2007) because VCTK is available at a high sampling rate of compared to as offered by others. In the supplementary materials, we show results and comparisons with lower sampling rates.

To synthesize a single example, we create a 3-second mixture at by randomly selecting speech samples and a background segment and placing them at arbitrary locations in a virtual room of a randomly chosen size. We then simulate room impulse responses (RIRs) using the image source method Allen and Berkley (1979) implemented in the pyroomacoustics library Scheibler et al. (2018). To approximate a diffuse-field background noise, the background source is placed further away, and the RIR for the background is generated with high-order images, causing indirect reflections off room walls Vorländer (2007). All signals are convolved with the corresponding RIRs and rendered to a 6-channel circular microphone array () of radius (). The volumes of the sources are chosen randomly in order to create challenging scenarios; the input SDR is between and for most of the dataset. For training our network, we use 10,000 examples with chosen uniformly between 1 and 4, inclusively, at random, and for evaluating we use 1,000 examples with dependent on the evaluation task.

4.2 Source Separation

To evaluate the source separation performance of our method, we create mixtures consisting of 2 voices () and 1 background, allowing comparisons with deep learning methods that require a fixed number of foreground sources. We use the popular metric scale-invariant signal-to-distortion ratio (SI-SDR) Le Roux et al. (2019). When reporting the increase from the input to output SI-SDR, we use the label SI-SDR improvement (SI-SDRi). For deep learning baselines in the waveform domain we chose TAC Luo et al. (2020), a recently proposed neural beamformer, and a multi-channel extension of Conv-TasNet Luo and Mesgarani (2019), a popular speech separation network. For this multi-channel Conv-TasNet, we changed the number of input channels to match the number of microphones in order to process the full mixture. To compare with spectrogram based methods, we use oracle baselines based on the time-frequency representation like Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM), and Multi-channel Wiener Filter (MWF). For more details on oracle baselines, please refer to Stöter et al. (2018). Table 1 and Figure 3 show the comparison between our proposed system and the baseline systems.

Notice that our method strongly outperforms the best possible results obtainable with spectrogram masking, and is slightly better than recent deep-learning baselines operating on the waveform domain. Furthermore, our network can accept explicitly known source locations (given by Ours-Oracle Location), allowing the separation performance to improve further when the source positions are given.

Method SI-SDRi ()
Conv-TasNet Luo and Mesgarani (2019) 15.526
TAC Luo et al. (2020) 15.121
Ours - Binary Search 17.059
Ours - Oracle Location 17.636
Oracle IBM Stöter et al. (2018); Wang (2005) 13.359
Oracle IRM Stöter et al. (2018); Liutkus and Badeau (2015) 4.193
Oracle MWF Stöter et al. (2018); Duong et al. (2010) 8.405
Table 1: Separation Performance. Larger SI-SDRi is better. The SI-SDRi is computed by finding the median of SI-SDR increases from Figure 3.
Figure 3: (left) Input SI-SDR vs Output SI-SDR for waveform based methods. Some methods are not shown to improve the visibility.
Figure 4: (right) Evidence that the network amplifies voices between and suppresses all others.

4.3 Source Localization

To evaluate the localization performance of our method, we explore two variants of the same dataset in Section 4.2. The first set contains 2 voices and 1 background, exactly as in the previous section, and the second contains 2 voices with no background, a slightly easier variation. Here, we report the CDF curve of the angular error, i.e., the fraction of the test set below a given angle error.

For baselines, we choose popular methods for direction of arrival estimation, both learning-based systems He et al. (2018) and learning-free systems Schmidt (1986); DiBiase (2000); Wang and Kaveh (1985); Di Claudio and Parisi (2001); Pan et al. (2017); Yoon et al. (2006). For the scenario with 2 voice sources and 1 background source, we let the learning-free baseline algorithms localize 3 sources and choose the 2 sources closest to the ground truth voice locations. This is a less strict evaluation that does not require the algorithm to distinguish between a voice and background source. For the learning-based method He et al. (2018), we retrained the network separately for each dataset in order to predict the 2 voice locations, even in the presence of a background source. Figure 5 shows the CDF plots for both scenarios.

Our method shows state-of-the-art performance in the simple scenario with 2 voices, but some baselines show similar performance to ours. However, when background noise is introduced, the gap between our method and the baselines increases greatly. Traditional methods struggle, even when evaluated less strictly than ours, and MLP-GCC He et al. (2018) cannot effectively learn to distinguish a voice location from background noise.

Method Median Angular Error
2 Voices 2 Voices + BG
MUSIC Schmidt (1986) 82.5 36.8
SRP-PHAT DiBiase (2000) 6.2 46.4
CSSM Wang and Kaveh (1985) 30.1 36.3
WAVES Di Claudio and Parisi (2001) 16.4 32.1
FRIDA Pan et al. (2017) 6.9 18.5
TOPS Yoon et al. (2006) 2.4 11.5
MLP-GCC He et al. (2018) 1.0 41.5
Ours 2.1 3.7
Table 2: Localization Performance
Figure 5: Localization Performance: (Left) error tolerance curve on mixtures of 2 voices, (right) error tolerance curve on mixtures with 2 voices and 1 background.

4.4 Varying Number of Speakers

To show that our method generalizes to an arbitrary number of speakers, we evaluate separation and localization on mixtures containing up to 8 speakers with no background. We train the network with mixtures of 1 background and up to 4 voices and evaluate the separation results with median SI-SDRi and the localization performance with median angular error. For a given number of speakers , we take the top outputs from the network and find the closest permutation between the outputs and ground truth. We report the results in Table 3. Notice that we are reporting results on scenarios where there are more speakers than seen during training.

We report the SI-SDRi and median angular error, together with the precision and recall of localizing the voices within of the ground truth when the algorithm has no information about the number of speakers. We remark that as the number of speakers increases, the recall drops as expected. The precision increases are due to the fact that there are fewer false positives when there are many speakers in the scene. The results suggest that our method generalizes and works even in scenarios with more speakers than seen in training.

Number of Speakers  2 3 4 5 6 7 8
SI-SDRi () 13.9 13.2 12.2 10.8 9.1 7.2 6.3
Median Angular Error 2.0 2.3 2.7 3.5 4.4 5.2 6.3
Precision 0.947 0.936 0.897 0.912 0.932 0.936 0.966
Recall 0.979 0.972 0.915 0.898 0.859 0.825 0.785
Table 3: Generalization to arbitrary many speakers. We report the separation and localization performance as the number of speakers varies.

4.5 Results on Real Data and Moving Sources

Dataset: To show results on real world examples, we use the ReSpeaker Mic Array v2.0 2, which contains microphones in a circle of radius (). Although a network trained purely on synthetic data works well, we find that it is useful to fine-tune with data captured by the microphone. To do this we recorded VCTK samples played over a speaker from known locations, and also recorded a variety of background sounds played over a speaker. We then created mixtures of this real recorded data and jointly re-trained with real and fully synthetic data. Complete details of this capture process are described in the supplementary materials.

Results: In the supplementary videos3 we explore a variety of real world scenarios. These include multiple people talking concurrently and multiple people talking while moving. For example, we show that we can separate people on different phone calls or 2 speakers walking around a table. To separate moving sources, we stop the algorithm at a coarser window size () and use inputs corresponding to seconds of audio. With these parameters, we find that it is possible to handle substantial movement because the angular window size captures each source for the duration of the input. We then concatenate sources that are in adjacent regions from one time step to the next. Because our real captured data does not have precise ground truth positions or perfectly clean source signals, numerical results are not as reliable as the synthetic experiment. However, we have included some numerical results on real data in the supplementary materials.

4.6 Limitations

There are several limitations of our method. One limitation is that we must reduce the angular resolution to support moving sources. This in contrast to specific speaker tracking methods that can localize moving sources to a greater precision Traa and Smaragdis (2013); Qian et al. (2017). Another limitation is that in the minimal two-microphone case, our approach is susceptible to front-back confusion. This is an ambiguity that can be resolved with binaural methods that leverage information like HRTFs Keyrouz (2017); Ma et al. (2017) A final limitation is that we assume the microphone array is rotationally symmetric. For ad-hoc microphone arrays, our pre-shift method would still allow for separation from a known position. However, the angular window size would have to be learned dependent on , making the binary search approach more difficult.

5 Conclusion

In this work, we introduced a novel method for joint localization and separation of audio sources in the waveform domain. Experimental results showed state-of-the-art results, and the ability to generalize to an arbitrary number of speakers, including more than seen during training. We described how to create a network that separates sources within a specific angular region, and how to use that network for a binary search approach to separation and localization. Examples on real world data also show that our proposed method is applicable to real-life scenarios. Our work also has the potential to be extended beyond speech to perform separation and localization of arbitrary sound types.


The authors thank the labmates from UW GRAIL lab. This work was supported by the UW Reality Lab, Facebook, Google, Futurewei, and Amazon.

Broader Impact Statement

We believe that our method has the potential to help people hear better in a variety of everyday scenarios. This work could be integrated with headphones, hearing aids, smart home devices, or laptops, to facilitate source separation and localization. Our localization output also provides a more privacy-friendly alternative to camera based detection for applications like robotics or optical tracking. We note that improved ability to separate speakers in noisy environments comes with potential privacy concerns. For example, this method could be used to better hear a conversation at a nearby table in a restaurant. Tracking speakers with microphone input also presents a similar range of privacy concerns as camera based tracking and recognition in everyday environments.

Supplementary Materials

Figure 6: Overview of Separation by Localization. Similar to the overview figure in the main paper. This figure is color-blind friendly, black and white printing friendly, and photocopy friendly.

Appendix A Hyperparameters and Training Details

Rendering Parameters For the simulated scenes, the origin of the scene is centered on the microphone array. The foregrounds are placed randomly between 1 and 5 meters away from the microphone array while the background is placed between 10 and 20 meters away. The walls of a virtual rectangular room are chosen between 15 and 20 meters away, expanding as necessary so the background is also within the room. The reverb absorption rate of the foreground is randomly chosen between 0.1 and 0.99, while the absorption rate of the background is chosen between 0.5 and 0.99.

Training Parameters We use a learning rate of and initialized our training from the pretrained single-channel Demucs weights. We use ADAM optimizer [31] for training the network with the following parameters: , , and . We found that training on our spatial dataset converged after roughly 20 epochs.

Data Augmentation As an additional data augmentation step we make the following perturbations to the data: Gaussian noise is added with a standard deviation of , and high-shelf and low-shelf gain of up to are randomly added using the sox library4.

Appendix B Real Dataset

Data Collection In order to fine-tune the network on real data, we played samples over a speaker and recorded the signals with the real microphone array. Approximately 3 hours of VCTK samples were played from a QSC K8 loudspeaker in a quiet room with the speaker volume set approximately to the volume of a human voice. The loudspeaker was placed at carefully measured positions between 1-4 meters away from the microphone array. We used azimuth angles in increments for a total of 12 different positions. The elevation angle was roughly the same as the microphone array. We maintained the train and test splits of the VCTK dataset to avoid overlapping identities. Because we could not record true diffuse background noise, we played various background noises over the loudspeaker such as music or recorded restaurant sounds. With these recorded samples, we could create mixtures with access to the ground truth voice samples. We found that jointly training with real and synthetic mixtures gave the best performance.

Numerical Results Because we did not have access to a true acoustic chamber, the ground truth samples and positions are not as reliable for evaluation as the fully synthetic data. However, we report separation results on mixtures of 2 voices and 1 background from the test set of real recorded data in Table 4. This, along with the qualitative samples, shows evidence that our method can generalize to real environments. We note that oracle baselines outperform our methods and other waveform-based baselines because oracle baselines have access to the ground-truth utterances. Additionally, our method outperforms other non-oracle baselines.

Method Median SI-SDRi ()
Ours 8.885
TAC [35] 8.427
Conv-TasNet [37] 6.497
Oracle IBM 9.220
Oracle IRM 10.327
Oracle MWF 9.925
Table 4: Separation performance on the real dataset

Appendix C Sample waveforms and spectrograms

In this section, we show sample waveforms of an input mixture and separated voices using our method. The input mixture contains two voices and one background, and we show an example of separation results in two different domains: waveform (Figure 7) and time-frequency spectrogram (Figure 8). Although the output closely matches the ground truth, we can see several differences. As illustrated by Figure 8, we observe that the network struggles in regions where the voice’s energy is low. Additionally, we find that the network can create artifacts in the high-frequency regions, which is why a simple denoising step or low pass filter is often helpful.

More example audio files are provided in the zip files.

Figure 7: We show an example of separation on an input mixture containing 2 voices and background. The topmost signal is the input mixture. (top) input mixture, (center + bottom) separated voices.
Figure 8: An example of separation results with 2 voices and 1 background. (top) the spectrogram of an input mixture, (left) the spectrograms of outputs from the network (right) the spectrograms of the ground truth reference voice signals.

Appendix D Sampling Rate

We show the effect of lowering the sample rate on both separation and localization in Table 5. We remark that our separation quality is worse at lower sample rates, showing that our model takes advantage of the higher sample rate.

Method Sampling Rate
44.1 16
Separation: Median SI-SDRi ()
Ours - Binary Search 17.059 14.132
Ours - Oracle Position 17.636 14.468
TAC [35] 15.104 13.613
Conv-TasNet [37] 15.526 15.559
Oracle IBM 13.359 13.611
Oracle IRM 4.193 4.289
Oracle MWF 8.405 8.893
Localization: Median Angular Error ()
Ours - 2 Voices 1 BG
Ours - 2 Voices No BG
Table 5: Separation and localization performances on datasets with different sampling rates


  1. footnotemark:
  3. Also available at


  1. Note: \url Cited by: §1.
  2. Note: \url Cited by: §4.5.
  3. S. Adavanne, A. Politis, J. Nikunen and T. Virtanen (2018) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 34–48. Cited by: §2.
  4. H. Adel, M. Souad, A. Alaqeeli and A. Hamid (2012) Beamforming techniques for multichannel audio signal separation. arXiv preprint arXiv:1212.6080. Cited by: §2.
  5. J. B. Allen and D. A. Berkley (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §4.1.
  6. F. Asano and H. Asoh (2004) Sound source localization and separation based on the em algorithm. In ISCA Tutorial and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing, Cited by: §2.
  7. L. Benaroya, F. Bimbot and R. Gribonval (2005) Audio source separation with a single sensor. IEEE Transactions on Audio, Speech, and Language Processing 14 (1), pp. 191–199. Cited by: §2.
  8. N. Bertin, R. Badeau and E. Vincent (2010) Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and Language Processing 18 (3), pp. 538–549. Cited by: §2.
  9. J. Cardoso (1998) Blind signal separation: statistical principles. Proceedings of the IEEE 86 (10), pp. 2009–2025. Cited by: §2.
  10. Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li and Y. Gong (2018) Multi-channel overlapped speech recognition with location guided speech extraction network. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 558–565. Cited by: §2.
  11. A. Défossez, N. Usunier, L. Bottou and F. Bach (2019) Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174. Cited by: §3.1.1.
  12. A. Deleforge, F. Forbes and R. Horaud (2015) Acoustic space learning for sound-source separation and localization on binaural manifolds. International journal of neural systems 25 (01), pp. 1440003. Cited by: §2.
  13. E. D. Di Claudio and R. Parisi (2001) WAVES: weighted average of signal subspaces for robust wideband direction finding. IEEE Transactions on Signal Processing 49 (10), pp. 2179–2191. Cited by: §2, §4.3, Table 2.
  14. J. H. DiBiase (2000) A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. Brown University Providence, RI. Cited by: §2, §4.3, Table 2.
  15. Y. Dorfan, D. Cherkassky and S. Gannot (2015) Speaker localization and separation using incremental distributed expectation-maximization. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1256–1260. Cited by: §2.
  16. N. Q. Duong, E. Vincent and R. Gribonval (2010) Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Transactions on Audio, Speech, and Language Processing 18 (7), pp. 1830–1840. Cited by: Table 1.
  17. J. Garofalo, D. Graff, D. Paul and D. Pallett (2007) Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia. Cited by: §4.1.
  18. F. Grondin and J. Glass (2019) Multiple sound source localization with svd-phat. arXiv preprint arXiv:1906.11913. Cited by: §2.
  19. R. Gu, S. Zhang, L. Chen, Y. Xu, M. Yu, D. Su, Y. Zou and D. Yu (2020) Enhancing end-to-end multi-channel speech separation via spatial feature learning. arXiv preprint arXiv:2003.03927. Cited by: §2.
  20. T. Halperin, A. Ephrat and Y. Hoshen (2018) Neural separation of observed and unobserved distributions. arXiv preprint arXiv:1811.12739. Cited by: §2.
  21. C. Han, Y. Luo and N. Mesgarani (2020) Real-time binaural speech separation with preserved spatial cues. arXiv preprint arXiv:2002.06637. Cited by: §2.
  22. W. He, P. Motlicek and J. Odobez (2018) Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 74–79. Cited by: §2, §4.3, §4.3, Table 2.
  23. J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe (2016) Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. Cited by: §2.
  24. T. Higuchi, K. Kinoshita, M. Delcroix, K. Zmolíková and T. Nakatani (2017) Deep clustering-based beamforming for separation with unknown number of sources.. In Interspeech, pp. 1183–1187. Cited by: §2.
  25. K. Itakura, Y. Bando, E. Nakamura, K. Itoyama, K. Yoshii and T. Kawahara (2018) Bayesian multichannel audio source separation based on integrated source and spatial models. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (4), pp. 831–846. Cited by: §2.
  26. A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. Cited by: §2, §3.1.1.
  27. V. Jayaram and J. Thickstun (2020) Source separation with deep generative priors. arXiv preprint arXiv:2002.07942. Cited by: §2.
  28. D. Johnson, D. Gorelik, R. E. Mawhorter, K. Suver, W. Gu, S. Xing, C. Gabriel and P. Sankhagowit (2018) Latent gaussian activity propagation: using smoothness and structure to separate and localize sounds in large noisy environments. In Advances in Neural Information Processing Systems, pp. 3465–3474. Cited by: §2.
  29. D. H. Johnson and D. E. Dudgeon (1992) Array signal processing: concepts and techniques. Simon & Schuster, Inc., USA. External Links: ISBN 0130485136 Cited by: §3.1.2.
  30. F. Keyrouz (2017) Robotic binaural localization and separation of multiple simultaneous sound sources. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC), Vol. , pp. 188–195. Cited by: §4.6.
  31. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
  32. J. Le Roux, S. Wisdom, H. Erdogan and J. R. Hershey (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §4.2.
  33. A. Liutkus and R. Badeau (2015) Generalized wiener filtering with fractional power spectrograms. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 266–270. Cited by: Table 1.
  34. Y. Luo, Z. Chen, J. R. Hershey, J. Le Roux and N. Mesgarani (2017) Deep clustering and conventional networks for music separation: stronger together. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 61–65. Cited by: §2.
  35. Y. Luo, Z. Chen, N. Mesgarani and T. Yoshioka (2020) End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. Cited by: Table 4, Table 5, §2, §4.2, Table 1.
  36. Y. Luo and N. Mesgarani (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. Cited by: §2, §2.
  37. Y. Luo and N. Mesgarani (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27 (8), pp. 1256–1266. Cited by: Table 4, Table 5, §2, §4.2, Table 1.
  38. N. Ma, T. May and G. J. Brown (2017) Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (12), pp. 2444–2453. Cited by: §4.6.
  39. M. I. Mandel, D. P. Ellis and T. Jebara (2007) An em algorithm for localizing multiple sound sources in reverberant environments. In Advances in neural information processing systems, pp. 953–960. Cited by: §2.
  40. M. I. Mandel, R. J. Weiss and D. P. Ellis (2009) Model-based expectation-maximization source separation and localization. IEEE Transactions on Audio, Speech, and Language Processing 18 (2), pp. 382–394. Cited by: §2.
  41. N. Mohammadiha, P. Smaragdis and A. Leijon (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2140–2151. Cited by: §2.
  42. E. Nachmani, Y. Adi and L. Wolf (2020) Voice separation with an unknown number of multiple speakers. arXiv preprint arXiv:2003.01531. Cited by: §2.
  43. O. Nadiri and B. Rafaely (2014) Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10), pp. 1494–1505. Cited by: §2.
  44. F. Nesta, P. Svaizer and M. Omologo (2010) Convolutive bss of short mixtures by ica recursively regularized across frequencies. IEEE transactions on audio, speech, and language processing 19 (3), pp. 624–639. Cited by: §2.
  45. A. A. Nugraha, A. Liutkus and E. Vincent (2016) Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (9), pp. 1652–1664. Cited by: §2.
  46. A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.1.3.
  47. H. Pan, R. Scheibler, E. Bezzam, I. Dokmanić and M. Vetterli (2017) FRIDA: fri-based doa estimation for arbitrary array layouts. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3186–3190. Cited by: §2, §4.3, Table 2.
  48. V. Panayotov, G. Chen, D. Povey and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.1.
  49. D. Pavlidi, A. Griffin, M. Puigt and A. Mouchtaris (2013) Real-time multiple sound source localization and counting using a circular microphone array. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2193–2206. Cited by: §2.
  50. X. Qian, A. Brutti, M. Omologo and A. Cavallaro (2017) 3D audio-visual speaker tracking with an adaptive particle filter. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2896–2900. Cited by: §4.6.
  51. B. Raj, T. Virtanen, S. Chaudhuri and R. Singh (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §2.
  52. A. Rouditchenko, H. Zhao, C. Gan, J. McDermott and A. Torralba (2019) Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. Cited by: §2.
  53. H. Sawada, S. Araki and S. Makino (2010) Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Transactions on Audio, Speech, and Language Processing 19 (3), pp. 516–527. Cited by: §2.
  54. R. Scheibler, E. Bezzam and I. Dokmanić (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 351–355. Cited by: §4.1.
  55. R. Schmidt (1986) Multiple emitter location and signal parameter estimation. IEEE transactions on antennas and propagation 34 (3), pp. 276–280. Cited by: §2, §4.3, Table 2.
  56. D. Stoller, S. Ewert and S. Dixon (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185. Cited by: §2.
  57. F. Stöter, A. Liutkus and N. Ito (2018) The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pp. 293–305. Cited by: §4.2, Table 1.
  58. N. Takahashi, S. Parthasaarathy, N. Goswami and Y. Mitsufuji (2019) Recursive speech separation for unknown number of speakers. arXiv preprint arXiv:1904.03065. Cited by: §2.
  59. J. Traa and P. Smaragdis (2013) A wrapped kalman filter for azimuthal speaker tracking. IEEE Signal Processing Letters 20 (12), pp. 1257–1260. Cited by: §4.6.
  60. J. Traa and P. Smaragdis (2014) Multichannel source separation and tracking with ransac and directional statistics. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (12), pp. 2233–2243. Cited by: §2.
  61. J. Traa, P. Smaragdis, N. D. Stein and D. Wingate (2015) Directional nmf for joint source localization and separation. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5. Cited by: §2.
  62. E. Tzinis, S. Venkataramani and P. Smaragdis (2019) Unsupervised deep clustering for source separation: direct learning from mixtures using spatial information. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81–85. Cited by: §2, §2.
  63. J. Valin, F. Michaud, J. Rouat and D. Létourneau (2003) Robust sound source localization using a microphone array on a mobile robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), Vol. 2, pp. 1228–1233. Cited by: §3.1.2.
  64. C. Veaux, J. Yamagishi and K. MacDonald (2016) Superseded-cstr vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. Cited by: §4.1.
  65. M. Vorländer (2007) Auralization: fundamentals of acoustics, modelling, simulation, algorithms and acoustic virtual reality. Springer Science & Business Media. Cited by: §4.1.
  66. D. Wang (2005) On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines, pp. 181–197. Cited by: Table 1.
  67. H. Wang and M. Kaveh (1985) Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources. IEEE Transactions on Acoustics, Speech, and Signal Processing 33 (4), pp. 823–831. Cited by: §2, §4.3, Table 2.
  68. C. Weng, D. Yu, M. L. Seltzer and J. Droppo (2015) Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (10), pp. 1670–1679. Cited by: §2.
  69. C. Xu, W. Rao, X. Xiao, E. S. Chng and H. Li (2018) Single channel speech separation with constrained utterance level permutation invariant training using grid lstm. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6–10. Cited by: §2.
  70. Y. Yoon, L. M. Kaplan and J. H. McClellan (2006) TOPS: new doa estimator for wideband signals. IEEE Transactions on Signal processing 54 (6), pp. 1977–1989. Cited by: §2, §4.3, Table 2.
  71. T. Yoshioka, H. Erdogan, Z. Chen and F. Alleva (2018) Multi-microphone neural speech separation for far-field multi-talker speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5739–5743. Cited by: §2.
  72. X. Zhang and D. Wang (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM transactions on audio, speech, and language processing 25 (5), pp. 1075–1084. Cited by: §2.
  73. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott and A. Torralba (2018) The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description