The Cone of Silence: Speech Separation by Localization
Given a multi-microphone recording of an unknown number of speakers talking concurrently, we simultaneously localize the sources and separate the individual speakers. At the core of our method is a deep network, in the waveform domain, which isolates sources within an angular region , given an angle of interest and angular window size . By exponentially decreasing , we can perform a binary search to localize and separate all sources in logarithmic time. Our algorithm allows for an arbitrary number of potentially moving speakers at test time, including more speakers than seen during training. Experiments demonstrate state-of-the-art performance for both source separation and source localization, particularly in high levels of background noise.
The ability of humans to separate and localize sounds in noisy environments is a remarkable phenomenon known as the “cocktail party effect.” However, our natural ability only goes so far – we may still have trouble hearing a conversation partner in a noisy restaurant or during a call with other speakers in the background. One can imagine future earbuds or hearing aids that selectively cancel audio sources that you don’t want to listen to. As a step towards this goal, we introduce a deep neural network technique that can be steered to any direction at run time, cancelling all audio sources outside a specified angular window, aka cone of silence (CoS) 1.
But how do you know what direction to listen to? We further show that this directionally sensitive CoS network can be used as a building block to yield simple yet powerful solutions to 1) sound localization, and 2) audio source separation. Our experimental evaluation demonstrates state of the art performance in both domains. Furthermore, our ability to handle an unknown number of potentially moving sound sources combined with fast performance represents additional steps forward in generality. Audio demos can be found at our project website.
We are particularly motivated by the recent increase of multi-microphone devices in everyday settings. This includes headphones, hearing aids, smart home devices, and many laptops. Indeed, most of these devices already employ directional sensitivity both in the design of the individual microphones and in the way they are combined together. In practice however, this directional sensitivity is limited to either being hard tuned to a fixed range of directions (e.g., cardioid), or providing only limited attenuation of audio outside that range (e.g., beam-forming). In contrast, our CoS approach enables true cancellation of audio sources outside a specified angular window that can be specified (and instantly changed) in software.
Our approach uses a novel deep network that can separate sources in the waveform domain within any angular region , parameterized by a direction of interest and angular window size . For simplicity, we focus only on azimuth angles, but the method could equally be applied to elevation as well. By exponentially decreasing , we perform a binary search to separate and localize all sources in logarithmic time (Figure 1). Unlike many traditional methods that perform direction based separation, we can also ignore background source types, such as music or ambient noise. Qualitative and quantitative results show state-of-the-art performance and a direct applicability to a wide variety of real world scenarios. Our key contribution is a logarithmic time algorithm for simultaneous localization and separation of speakers, particularly in high levels of noise, allowing for arbitrary number of speakers at test time, including more speakers than seen during training. We strongly encourage the reader to view our supplementary results for a demo of our method and audio results.
2 Related Work
Source separation has seen tremendous progress in recent years, particularly with the increasing popularity of learning methods, which improve over traditional methods such as Cardoso (1998); Nesta et al. (2010). In particular, unsupervised source modeling methods train a model for each source type and apply the model to the mixture for separation by using methods like NMF Raj et al. (2010); Mohammadiha et al. (2013), clustering Tzinis et al. (2019); Sawada et al. (2010); Luo et al. (2017), or bayesian methods Itakura et al. (2018); Jayaram and Thickstun (2020); Benaroya et al. (2005). Supervised source modeling methods train a model for each source from annotated isolated signals of each source type, e.g., pitch information for music Bertin et al. (2010). Separation based training methods like Halperin et al. (2018); Jansson et al. (2017); Nugraha et al. (2016) employ deep neural networks to learn source separation from mixtures given the ground truth signals as training data, also known as the mix-and-separate framework.
Recent trends include the move to operating directly on waveforms Stoller et al. (2018); Luo and Mesgarani (2018, 2019) yielding performance improvements over frequency-domain spectrogram techniques such as Hershey et al. (2016); Xu et al. (2018); Weng et al. (2015); Tzinis et al. (2019); Yoshioka et al. (2018); Chen et al. (2018); Zhang and Wang (2017). A second trend is increasing the numbers of microphones, as methods based on multi-channel microphone arrays Yoshioka et al. (2018); Chen et al. (2018); Gu et al. (2020) and binaural recordings Zhang and Wang (2017); Han et al. (2020) perform better than single-channel source separation techniques. Combined audio-visual techniques Zhao et al. (2018); Rouditchenko et al. (2019) have also shown promise.
Sound localization is a second active research area, often posed as direction of arrival (DOA) estimation Grondin and Glass (2019); Nadiri and Rafaely (2014); Pavlidi et al. (2013). Popular methods include beamforming DiBiase (2000), subspace methods Schmidt (1986); Wang and Kaveh (1985); Di Claudio and Parisi (2001); Yoon et al. (2006), and sampling-based methods Pan et al. (2017). A recent trend is the use of deep neural networks for multi-source DOA estimation, e.g., He et al. (2018); Adavanne et al. (2018).
One key challenge is that the number of speakers in real world scenarios is often unknown or non-constant. Many methods require a priori knowledge about the number of sources, e.g., Luo and Mesgarani (2018); Luo et al. (2020). Recent deep learning methods that address separation with an unknown number of speakers include Higuchi et al. (2017), Takahashi et al. (2019), and Nachmani et al. (2020). However, these methods use an additional model to help predict the number of speakers, and Nachmani et al. (2020) further uses a different separation model for different numbers of speakers.
Although DOA provides a practical approach for source separation, methods that take this approach suffer from another shortcoming: the direction of interest needs to be known in advance Adel et al. (2012). Without a known DOA for each source, these methods must perform a linear sweep of the entire angular space, which is computationally infeasible for state-of-the-art deep networks at fine-grained angular resolutions.
Some prior work has addressed joint localization and separation. For example, Traa and Smaragdis (2014); Mandel et al. (2009); Asano and Asoh (2004); Dorfan et al. (2015); Deleforge et al. (2015); Mandel et al. (2007) use expectation maximization to iteratively localize and separate sources. Traa et al. (2015) uses the idea of Directional NMF, while Johnson et al. (2018) poses separation and localization as a Bayesian inference problem based on inter-microphone phase differences. Our method improves on these approaches by combining deep learning in the waveform domain with efficient search.
In this section we describe our Cone of Silence network for angle based separation. The target angle and window size are learned independently; Separation at is handled entirely by a pre-shift step, while an additional network input is used to produce the window of size . We also describe how to use the network for separation by localization via binary search.
Problem Formulation: Given a known-configuration microphone array with microphones and , the problem of -channel source separation and localization can be formulated in terms of estimating sources and their corresponding angular position from an -channel discrete waveform of the mixture of length , where
Here represents the background signal, which could be a point source like music or diffuse-field background noise without any specific location.
In this paper we explore circular microphone arrays, but we also describe possible modifications to support linear arrays. The center of our coordinate system is always the center of the microphone array, and the angular position of each source, , is defined based on this coordinate system. In the problem formulation we assume the sources are stationary, but we describe how to handle potentially moving sources in Section 4.5. In addition, we only focus on separation and localization by azimuth angle, meaning that we assume the sources have roughly the same elevation angle. As we show in the experimental section, this assumption is valid for most real world scenarios.
3.1 Cone of Silence Network (CoS)
We propose a network that performs source separation given an angle of interest and an angular window size . The network is tasked with separating speech only coming from azimuthal directions between and and disregarding speech coming from other directions. In the following sections we describe how to create a network with this property. Figure 2 shows our proposed network architecture. and are encoded in a shifted input and a one-hot vector as described in Section 3.1.2 and Section 3.1.3 respectively.
Our CoS network is adapted from the Demucs architecture Défossez et al. (2019), a music separation network, which is similar to the Wave U-Net architecture Jansson et al. (2017). We extend the original Demucs network to our problem formulation by modifying the number of input and output channels to match the number of microphones.
There are several reasons why this base architecture is well suited for our task. As mentioned in Section 2, networks that operate on the raw waveform have been recently shown to outperform spectrogram based methods. In addition, Demucs was specifically designed to work at sampling rates as high as , while other speech separation networks operate at rates as low as . Although human speech can be represented at lower sampling rates, we find that operating at higher sampling rates is beneficial for capturing small time difference of arrivals between the microphones. This would also allow our method to be extended to high resolution source types, like music, where a high sampling rate is necessary.
In order to make the network output sources from a specific target angle , we use a shifted mixture based on . We found that that this worked better than trying to directly condition the network based on both and . is created as follows: by calculating the time difference of arrival at each microphone, we shift each channel in the original signal such that signals coming from angle are temporally aligned across all channels in .
We use the fact that the time differences of arrival (TDOA) between the microphones are primarily based on the azimuthal angle for a far-field source. This assumption is valid when the sources are roughly on the same plane as the microphone array Valin et al. (2003). Let be the speed of sound, be our sampling rate, be the position of a far-field source at angle , and be a simple Euclidean distance. The TDOA in samples for the source to reach the -th microphone is
In our experiments, we chose as our canonical position, meaning that and all other channels are shifted to align with .
shift is a 1-D shift operation with one sided zero padding. This idea is similar to the first step of a Delay and Sum Beamformer Johnson and Dudgeon (1992). We then train the network to output sources which are temporally aligned in while ignoring all others. For , these shifts are unique for a specific angle , so sources from other angles will not be temporally aligned. If or the mic array is linear, then sources at angles and have the same per-channel shift leading to front-back confusion.
Angular Window Size
Although the network trained on shifted inputs as described in Section 3.1.2 can produce accurate output for a given angle , it requires prior knowledge about the target angle for each source of interest. In addition, real sources are not perfect point sources and have finite width, especially in the presence of slight movements.
To solve these problems, we introduce a second variable, an angular window size which is passed as a global conditioning parameter to the network. This angular window size facilitates the application of this network for a fast binary search approach. It also allows the localization and separation of moving sources. By using a larger angular window size and a smaller temporal input waveform, it is possible to localize and separate moving sources within that window.
Motivated by the global conditioning framework in WaveNet Oord et al. (2016), we use a one-hot encoded variable to represent different window sizes. In our experiments, we use of size 5 corresponding to window sizes from the set . By passing to the network with our shifted input , we can explicitly make the network separate sources from the region . We embed to all encoder and decoder blocks in the network, using a learning linear projection as shown in Figure 2. Formally, the equations for the encoder block and decoder block can be written as follows:
The notation denotes a 1-D convolution between the weights for the layer of an encoding/decoding block at level and an input . The notation denotes a transposed convolution operation. Empirically we found that passing to every encoder and decoder block worked significantly better than passing it to the network only once. Evidence that the CoS network learns the desired window size is presented in Figure 4.
Consider an input mixture of sources with the corresponding locations along with a target angle and window size . The network is trained with the following objective function:
where and are the shifted signals of the input mixture and ground truth signal as described in Section 3.1.2 based on the target angle . is the output of the network using the shifted signal and the angular window . is an indicator function, indicating whether is present in the region . If no source is present in the region , the training target is a zero tensor .
3.2 Localization and Separation via Binary Search
By starting with a large window size and decreasing it exponentially, we can perform a binary search of the angular space in logarithmic time, while separating the sources simultaneously. More concretely, we start with our initial window size , our initial target angles , and our observed -channel mixture . In the first pass we run the network for all . This first step is the quadrant based separation illustrated in Step 1 of Figure 1. Because regions without sources will produce empty outputs, we can discard large angular regions early on with a simple cutoff. We then regress on a smaller window size, and the new candidate regions for regions with high energy outputs from . We continue to regress on smaller window sizes until reaching the desired resolution. The complete algorithm is written below and shown in Figure 1.
To avoid duplicate outputs from adjacent regions, we employ a non-maximum suppression step before outputting the final sources and locations. For this step, we consider both the angular proximity and similarity between the sources. If two outputted sources are physically close and have similar source content, we remove the one with the lower source energy. For example, for outputs , and with , we remove , if and .
3.3 Runtime Analysis
Suppose we have speakers and the angular space is discretized into angular bins. The binary search algorithm runs for at most steps and requires at most forward passes on every step. Thus, the total number of forward passes is while a linear sweep always runs in forward passes.
In most cases, , so the binary search is clearly superior. For instance, when operating at a resolution, the average number of forward passes our algorithm takes to separate 2 voices in the presence of background is , compared to 180 for a linear sweep. A forward pass of the network on a single GPU takes for a input waveform at , meaning that the binary search algorithm in this scenario could keep up with real-time while the linear search could not.
In this section, we explain our synthetic dataset and manually collected real dataset. We show numerical results for separation and localization on the synthetic dataset and describe qualitative results on the real dataset.
4.1 Synthetic Dataset
Numerical results are demonstrated on synthetically rendered data. To generate the synthetic dataset, we create multi-speaker recordings in simulated environments with reverb and background noises. All voices come from the VCTK dataset Veaux et al. (2016), and the background samples consist of recordings from either noisy restaurant environments or loud music. The train and test splits are completely independent and there are no overlapping identities or samples. We chose VCTK over other widely used datasets like LibriSpeech Panayotov et al. (2015) and WSJ0 Garofalo et al. (2007) because VCTK is available at a high sampling rate of compared to as offered by others. In the supplementary materials, we show results and comparisons with lower sampling rates.
To synthesize a single example, we create a 3-second mixture at by randomly selecting speech samples and a background segment and placing them at arbitrary locations in a virtual room of a randomly chosen size. We then simulate room impulse responses (RIRs) using the image source method Allen and Berkley (1979) implemented in the pyroomacoustics library Scheibler et al. (2018). To approximate a diffuse-field background noise, the background source is placed further away, and the RIR for the background is generated with high-order images, causing indirect reflections off room walls Vorländer (2007). All signals are convolved with the corresponding RIRs and rendered to a 6-channel circular microphone array () of radius (). The volumes of the sources are chosen randomly in order to create challenging scenarios; the input SDR is between and for most of the dataset. For training our network, we use 10,000 examples with chosen uniformly between 1 and 4, inclusively, at random, and for evaluating we use 1,000 examples with dependent on the evaluation task.
4.2 Source Separation
To evaluate the source separation performance of our method, we create mixtures consisting of 2 voices () and 1 background, allowing comparisons with deep learning methods that require a fixed number of foreground sources. We use the popular metric scale-invariant signal-to-distortion ratio (SI-SDR) Le Roux et al. (2019). When reporting the increase from the input to output SI-SDR, we use the label SI-SDR improvement (SI-SDRi). For deep learning baselines in the waveform domain we chose TAC Luo et al. (2020), a recently proposed neural beamformer, and a multi-channel extension of Conv-TasNet Luo and Mesgarani (2019), a popular speech separation network. For this multi-channel Conv-TasNet, we changed the number of input channels to match the number of microphones in order to process the full mixture. To compare with spectrogram based methods, we use oracle baselines based on the time-frequency representation like Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM), and Multi-channel Wiener Filter (MWF). For more details on oracle baselines, please refer to Stöter et al. (2018). Table 1 and Figure 3 show the comparison between our proposed system and the baseline systems.
Notice that our method strongly outperforms the best possible results obtainable with spectrogram masking, and is slightly better than recent deep-learning baselines operating on the waveform domain. Furthermore, our network can accept explicitly known source locations (given by Ours-Oracle Location), allowing the separation performance to improve further when the source positions are given.
|Conv-TasNet Luo and Mesgarani (2019)||15.526|
|TAC Luo et al. (2020)||15.121|
|Ours - Binary Search||17.059|
|Ours - Oracle Location||17.636|
|Oracle IBM Stöter et al. (2018); Wang (2005)||13.359|
|Oracle IRM Stöter et al. (2018); Liutkus and Badeau (2015)||4.193|
|Oracle MWF Stöter et al. (2018); Duong et al. (2010)||8.405|
4.3 Source Localization
To evaluate the localization performance of our method, we explore two variants of the same dataset in Section 4.2. The first set contains 2 voices and 1 background, exactly as in the previous section, and the second contains 2 voices with no background, a slightly easier variation. Here, we report the CDF curve of the angular error, i.e., the fraction of the test set below a given angle error.
For baselines, we choose popular methods for direction of arrival estimation, both learning-based systems He et al. (2018) and learning-free systems Schmidt (1986); DiBiase (2000); Wang and Kaveh (1985); Di Claudio and Parisi (2001); Pan et al. (2017); Yoon et al. (2006). For the scenario with 2 voice sources and 1 background source, we let the learning-free baseline algorithms localize 3 sources and choose the 2 sources closest to the ground truth voice locations. This is a less strict evaluation that does not require the algorithm to distinguish between a voice and background source. For the learning-based method He et al. (2018), we retrained the network separately for each dataset in order to predict the 2 voice locations, even in the presence of a background source. Figure 5 shows the CDF plots for both scenarios.
Our method shows state-of-the-art performance in the simple scenario with 2 voices, but some baselines show similar performance to ours. However, when background noise is introduced, the gap between our method and the baselines increases greatly. Traditional methods struggle, even when evaluated less strictly than ours, and MLP-GCC He et al. (2018) cannot effectively learn to distinguish a voice location from background noise.
|Method||Median Angular Error|
|2 Voices||2 Voices + BG|
|MUSIC Schmidt (1986)||82.5||36.8|
|SRP-PHAT DiBiase (2000)||6.2||46.4|
|CSSM Wang and Kaveh (1985)||30.1||36.3|
|WAVES Di Claudio and Parisi (2001)||16.4||32.1|
|FRIDA Pan et al. (2017)||6.9||18.5|
|TOPS Yoon et al. (2006)||2.4||11.5|
|MLP-GCC He et al. (2018)||1.0||41.5|
4.4 Varying Number of Speakers
To show that our method generalizes to an arbitrary number of speakers, we evaluate separation and localization on mixtures containing up to 8 speakers with no background. We train the network with mixtures of 1 background and up to 4 voices and evaluate the separation results with median SI-SDRi and the localization performance with median angular error. For a given number of speakers , we take the top outputs from the network and find the closest permutation between the outputs and ground truth. We report the results in Table 3. Notice that we are reporting results on scenarios where there are more speakers than seen during training.
We report the SI-SDRi and median angular error, together with the precision and recall of localizing the voices within of the ground truth when the algorithm has no information about the number of speakers. We remark that as the number of speakers increases, the recall drops as expected. The precision increases are due to the fact that there are fewer false positives when there are many speakers in the scene. The results suggest that our method generalizes and works even in scenarios with more speakers than seen in training.
|Number of Speakers||2||3||4||5||6||7||8|
|Median Angular Error||2.0||2.3||2.7||3.5||4.4||5.2||6.3|
4.5 Results on Real Data and Moving Sources
Dataset: To show results on real world examples, we use the ReSpeaker Mic Array v2.0 2, which contains microphones in a circle of radius (). Although a network trained purely on synthetic data works well, we find that it is useful to fine-tune with data captured by the microphone. To do this we recorded VCTK samples played over a speaker from known locations, and also recorded a variety of background sounds played over a speaker. We then created mixtures of this real recorded data and jointly re-trained with real and fully synthetic data. Complete details of this capture process are described in the supplementary materials.
In the supplementary videos
There are several limitations of our method. One limitation is that we must reduce the angular resolution to support moving sources. This in contrast to specific speaker tracking methods that can localize moving sources to a greater precision Traa and Smaragdis (2013); Qian et al. (2017). Another limitation is that in the minimal two-microphone case, our approach is susceptible to front-back confusion. This is an ambiguity that can be resolved with binaural methods that leverage information like HRTFs Keyrouz (2017); Ma et al. (2017) A final limitation is that we assume the microphone array is rotationally symmetric. For ad-hoc microphone arrays, our pre-shift method would still allow for separation from a known position. However, the angular window size would have to be learned dependent on , making the binary search approach more difficult.
In this work, we introduced a novel method for joint localization and separation of audio sources in the waveform domain. Experimental results showed state-of-the-art results, and the ability to generalize to an arbitrary number of speakers, including more than seen during training. We described how to create a network that separates sources within a specific angular region, and how to use that network for a binary search approach to separation and localization. Examples on real world data also show that our proposed method is applicable to real-life scenarios. Our work also has the potential to be extended beyond speech to perform separation and localization of arbitrary sound types.
The authors thank the labmates from UW GRAIL lab. This work was supported by the UW Reality Lab, Facebook, Google, Futurewei, and Amazon.
Broader Impact Statement
We believe that our method has the potential to help people hear better in a variety of everyday scenarios. This work could be integrated with headphones, hearing aids, smart home devices, or laptops, to facilitate source separation and localization. Our localization output also provides a more privacy-friendly alternative to camera based detection for applications like robotics or optical tracking. We note that improved ability to separate speakers in noisy environments comes with potential privacy concerns. For example, this method could be used to better hear a conversation at a nearby table in a restaurant. Tracking speakers with microphone input also presents a similar range of privacy concerns as camera based tracking and recognition in everyday environments.
Appendix A Hyperparameters and Training Details
Rendering Parameters For the simulated scenes, the origin of the scene is centered on the microphone array. The foregrounds are placed randomly between 1 and 5 meters away from the microphone array while the background is placed between 10 and 20 meters away. The walls of a virtual rectangular room are chosen between 15 and 20 meters away, expanding as necessary so the background is also within the room. The reverb absorption rate of the foreground is randomly chosen between 0.1 and 0.99, while the absorption rate of the background is chosen between 0.5 and 0.99.
Training Parameters We use a learning rate of and initialized our training from the pretrained single-channel Demucs weights. We use ADAM optimizer  for training the network with the following parameters: , , and . We found that training on our spatial dataset converged after roughly 20 epochs.
Data Augmentation As an additional data augmentation step we make the following perturbations to the data: Gaussian noise is added with a standard deviation of , and high-shelf and low-shelf gain of up to are randomly added using the sox library
Appendix B Real Dataset
Data Collection In order to fine-tune the network on real data, we played samples over a speaker and recorded the signals with the real microphone array. Approximately 3 hours of VCTK samples were played from a QSC K8 loudspeaker in a quiet room with the speaker volume set approximately to the volume of a human voice. The loudspeaker was placed at carefully measured positions between 1-4 meters away from the microphone array. We used azimuth angles in increments for a total of 12 different positions. The elevation angle was roughly the same as the microphone array. We maintained the train and test splits of the VCTK dataset to avoid overlapping identities. Because we could not record true diffuse background noise, we played various background noises over the loudspeaker such as music or recorded restaurant sounds. With these recorded samples, we could create mixtures with access to the ground truth voice samples. We found that jointly training with real and synthetic mixtures gave the best performance.
Numerical Results Because we did not have access to a true acoustic chamber, the ground truth samples and positions are not as reliable for evaluation as the fully synthetic data. However, we report separation results on mixtures of 2 voices and 1 background from the test set of real recorded data in Table 4. This, along with the qualitative samples, shows evidence that our method can generalize to real environments. We note that oracle baselines outperform our methods and other waveform-based baselines because oracle baselines have access to the ground-truth utterances. Additionally, our method outperforms other non-oracle baselines.
Appendix C Sample waveforms and spectrograms
In this section, we show sample waveforms of an input mixture and separated voices using our method. The input mixture contains two voices and one background, and we show an example of separation results in two different domains: waveform (Figure 7) and time-frequency spectrogram (Figure 8). Although the output closely matches the ground truth, we can see several differences. As illustrated by Figure 8, we observe that the network struggles in regions where the voice’s energy is low. Additionally, we find that the network can create artifacts in the high-frequency regions, which is why a simple denoising step or low pass filter is often helpful.
More example audio files are provided in the zip files.
Appendix D Sampling Rate
We show the effect of lowering the sample rate on both separation and localization in Table 5. We remark that our separation quality is worse at lower sample rates, showing that our model takes advantage of the higher sample rate.
|Separation: Median SI-SDRi ()|
|Ours - Binary Search||17.059||14.132|
|Ours - Oracle Position||17.636||14.468|
|Localization: Median Angular Error ()|
|Ours - 2 Voices 1 BG|
|Ours - 2 Voices No BG|
- Note: \urlhttps://en.wikipedia.org/wiki/Cone_of_Silence_(Get_Smart) Cited by: §1.
- Note: \urlhttps://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/ Cited by: §4.5.
- (2018) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing 13 (1), pp. 34–48. Cited by: §2.
- (2012) Beamforming techniques for multichannel audio signal separation. arXiv preprint arXiv:1212.6080. Cited by: §2.
- (1979) Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65 (4), pp. 943–950. Cited by: §4.1.
- (2004) Sound source localization and separation based on the em algorithm. In ISCA Tutorial and Research Workshop (ITRW) on Statistical and Perceptual Audio Processing, Cited by: §2.
- (2005) Audio source separation with a single sensor. IEEE Transactions on Audio, Speech, and Language Processing 14 (1), pp. 191–199. Cited by: §2.
- (2010) Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Transactions on Audio, Speech, and Language Processing 18 (3), pp. 538–549. Cited by: §2.
- (1998) Blind signal separation: statistical principles. Proceedings of the IEEE 86 (10), pp. 2009–2025. Cited by: §2.
- (2018) Multi-channel overlapped speech recognition with location guided speech extraction network. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 558–565. Cited by: §2.
- (2019) Demucs: deep extractor for music sources with extra unlabeled data remixed. arXiv preprint arXiv:1909.01174. Cited by: §3.1.1.
- (2015) Acoustic space learning for sound-source separation and localization on binaural manifolds. International journal of neural systems 25 (01), pp. 1440003. Cited by: §2.
- (2001) WAVES: weighted average of signal subspaces for robust wideband direction finding. IEEE Transactions on Signal Processing 49 (10), pp. 2179–2191. Cited by: §2, §4.3, Table 2.
- (2000) A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. Brown University Providence, RI. Cited by: §2, §4.3, Table 2.
- (2015) Speaker localization and separation using incremental distributed expectation-maximization. In 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 1256–1260. Cited by: §2.
- (2010) Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Transactions on Audio, Speech, and Language Processing 18 (7), pp. 1830–1840. Cited by: Table 1.
- (2007) Csr-i (wsj0) complete. Linguistic Data Consortium, Philadelphia. Cited by: §4.1.
- (2019) Multiple sound source localization with svd-phat. arXiv preprint arXiv:1906.11913. Cited by: §2.
- (2020) Enhancing end-to-end multi-channel speech separation via spatial feature learning. arXiv preprint arXiv:2003.03927. Cited by: §2.
- (2018) Neural separation of observed and unobserved distributions. arXiv preprint arXiv:1811.12739. Cited by: §2.
- (2020) Real-time binaural speech separation with preserved spatial cues. arXiv preprint arXiv:2002.06637. Cited by: §2.
- (2018) Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 74–79. Cited by: §2, §4.3, §4.3, Table 2.
- (2016) Deep clustering: discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. Cited by: §2.
- (2017) Deep clustering-based beamforming for separation with unknown number of sources.. In Interspeech, pp. 1183–1187. Cited by: §2.
- (2018) Bayesian multichannel audio source separation based on integrated source and spatial models. IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (4), pp. 831–846. Cited by: §2.
- (2017) Singing voice separation with deep u-net convolutional networks. Cited by: §2, §3.1.1.
- (2020) Source separation with deep generative priors. arXiv preprint arXiv:2002.07942. Cited by: §2.
- (2018) Latent gaussian activity propagation: using smoothness and structure to separate and localize sounds in large noisy environments. In Advances in Neural Information Processing Systems, pp. 3465–3474. Cited by: §2.
- (1992) Array signal processing: concepts and techniques. Simon & Schuster, Inc., USA. External Links: Cited by: §3.1.2.
- (2017) Robotic binaural localization and separation of multiple simultaneous sound sources. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC), Vol. , pp. 188–195. Cited by: §4.6.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
- (2019) SDR–half-baked or well done?. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630. Cited by: §4.2.
- (2015) Generalized wiener filtering with fractional power spectrograms. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 266–270. Cited by: Table 1.
- (2017) Deep clustering and conventional networks for music separation: stronger together. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 61–65. Cited by: §2.
- (2020) End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6394–6398. Cited by: Table 4, Table 5, §2, §4.2, Table 1.
- (2018) Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. Cited by: §2, §2.
- (2019) Conv-tasnet: surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing 27 (8), pp. 1256–1266. Cited by: Table 4, Table 5, §2, §4.2, Table 1.
- (2017) Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (12), pp. 2444–2453. Cited by: §4.6.
- (2007) An em algorithm for localizing multiple sound sources in reverberant environments. In Advances in neural information processing systems, pp. 953–960. Cited by: §2.
- (2009) Model-based expectation-maximization source separation and localization. IEEE Transactions on Audio, Speech, and Language Processing 18 (2), pp. 382–394. Cited by: §2.
- (2013) Supervised and unsupervised speech enhancement using nonnegative matrix factorization. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2140–2151. Cited by: §2.
- (2020) Voice separation with an unknown number of multiple speakers. arXiv preprint arXiv:2003.01531. Cited by: §2.
- (2014) Localization of multiple speakers under high reverberation using a spherical microphone array and the direct-path dominance test. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10), pp. 1494–1505. Cited by: §2.
- (2010) Convolutive bss of short mixtures by ica recursively regularized across frequencies. IEEE transactions on audio, speech, and language processing 19 (3), pp. 624–639. Cited by: §2.
- (2016) Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24 (9), pp. 1652–1664. Cited by: §2.
- (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.1.3.
- (2017) FRIDA: fri-based doa estimation for arbitrary array layouts. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3186–3190. Cited by: §2, §4.3, Table 2.
- (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.1.
- (2013) Real-time multiple sound source localization and counting using a circular microphone array. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2193–2206. Cited by: §2.
- (2017) 3D audio-visual speaker tracking with an adaptive particle filter. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2896–2900. Cited by: §4.6.
- (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In Eleventh Annual Conference of the International Speech Communication Association, Cited by: §2.
- (2019) Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. Cited by: §2.
- (2010) Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment. IEEE Transactions on Audio, Speech, and Language Processing 19 (3), pp. 516–527. Cited by: §2.
- (2018) Pyroomacoustics: a python package for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 351–355. Cited by: §4.1.
- (1986) Multiple emitter location and signal parameter estimation. IEEE transactions on antennas and propagation 34 (3), pp. 276–280. Cited by: §2, §4.3, Table 2.
- (2018) Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185. Cited by: §2.
- (2018) The 2018 signal separation evaluation campaign. In International Conference on Latent Variable Analysis and Signal Separation, pp. 293–305. Cited by: §4.2, Table 1.
- (2019) Recursive speech separation for unknown number of speakers. arXiv preprint arXiv:1904.03065. Cited by: §2.
- (2013) A wrapped kalman filter for azimuthal speaker tracking. IEEE Signal Processing Letters 20 (12), pp. 1257–1260. Cited by: §4.6.
- (2014) Multichannel source separation and tracking with ransac and directional statistics. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (12), pp. 2233–2243. Cited by: §2.
- (2015) Directional nmf for joint source localization and separation. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–5. Cited by: §2.
- (2019) Unsupervised deep clustering for source separation: direct learning from mixtures using spatial information. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 81–85. Cited by: §2, §2.
- (2003) Robust sound source localization using a microphone array on a mobile robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), Vol. 2, pp. 1228–1233. Cited by: §3.1.2.
- (2016) Superseded-cstr vctk corpus: english multi-speaker corpus for cstr voice cloning toolkit. Cited by: §4.1.
- (2007) Auralization: fundamentals of acoustics, modelling, simulation, algorithms and acoustic virtual reality. Springer Science & Business Media. Cited by: §4.1.
- (2005) On ideal binary mask as the computational goal of auditory scene analysis. In Speech separation by humans and machines, pp. 181–197. Cited by: Table 1.
- (1985) Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources. IEEE Transactions on Acoustics, Speech, and Signal Processing 33 (4), pp. 823–831. Cited by: §2, §4.3, Table 2.
- (2015) Deep neural networks for single-channel multi-talker speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (10), pp. 1670–1679. Cited by: §2.
- (2018) Single channel speech separation with constrained utterance level permutation invariant training using grid lstm. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6–10. Cited by: §2.
- (2006) TOPS: new doa estimator for wideband signals. IEEE Transactions on Signal processing 54 (6), pp. 1977–1989. Cited by: §2, §4.3, Table 2.
- (2018) Multi-microphone neural speech separation for far-field multi-talker speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5739–5743. Cited by: §2.
- (2017) Deep learning based binaural speech separation in reverberant environments. IEEE/ACM transactions on audio, speech, and language processing 25 (5), pp. 1075–1084. Cited by: §2.
- (2018) The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 570–586. Cited by: §2.