Multi-task U-Net for Music Source Separation This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. V. S. K. has received financial support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064. Additional funding comes from the MICINN/FEDER UE project with reference PGC2018-098625-B-I00, H2020-MSCA-RISE-2017 project with reference 777826 NoMADS, Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds. We also thank Nvidia for the donation of GPUs.

Multi-task U-Net for Music Source Separation 1

Abstract

A fairly straightforward approach for music source separation is to train independent models, wherein each model is dedicated for estimating only a specific source. Training a single model to estimate multiple sources generally does not perform as well as the independent dedicated models. However, Conditioned U-Net (C-U-Net) uses a control mechanism to train a single model for multi-source separation and attempts to achieve a performance comparable to that of the dedicated models. We propose a multi-task U-Net (M-U-Net) trained using a weighted multi-task loss as an alternative to the C-U-Net. We investigate two weighting strategies for our multi-task loss: 1) Dynamic Weighted Average (DWA), and 2) Energy Based Weighting (EBW). DWA determines the weights by tracking the rate of change of loss of each task during training. EBW aims to neutralize the effect of the training bias arising from the difference in energy levels of each of the sources in a mixture. Our methods provide two-fold advantages compared to the C-U-Net: 1) Fewer effective training iterations with no conditioning, and 2) Fewer trainable network parameters (no control parameters). Our methods achieve performance comparable to that of C-U-Net and the dedicated U-Nets at a much lower training cost.

source separation, multi-task loss, supervised, deep learning, weighted loss

I Introduction

Music source separation is the automatic estimation of the individual isolated sources that make up the audio mixture. It has been one of the most popular research problems in the music information retrieval community. Since most of the music audio present in the world exists in the form of mixtures, there are several applications of a system capable of music source separation – e.g. automatic creation of karaoke, music transcription, music unmixing and remixing, music production, assistance in music education, and denoising.

                                                 (a) Dedicated Models                         (b) Conditioned Model
(c) Multi-task Model
Fig. 1: Typical models for music source separation.

We are interested in training a system discriminatively to estimate the sources present in the audio mixture. The existing methods mostly use either the spectrogram as the input signal representation [3, 10, 5] or directly the waveform representation [2, 11] to train such a system. Among the spectrogram based methods, the U-Net [9] based methods have been popular owing to their simplicity and ease of training. Jansson et al. [3] proposed using a pair of independently trained U-Nets (type (a) system in Fig. 1) for the singing voice separation. Meseguer-Brocal and Peeters [5] pointed out that the implementations of such source-specific models get computationally expensive when there is a larger number of sources to be estimated. They proposed Conditioned U-Net (C-U-Net) as a cheaper alternative achieving comparable performance to that of dedicated U-Nets despite being a single model. The C-U-Net introduces control parameters through Feature-wise Linear Modulation (FiLM) layers in the encoder part of U-Net which adapts the model to estimate the source of the desired choice. C-U-Net corresponds to the type (b) system in Fig. 1. Though it is an interesting development over the work of [3], we notice that training a C-U-Net could also get expensive as we scale up the number of sources to be estimated. This is so because a data sample needs to be passed to C-U-Net multiple times with different conditions during training. In this way, for sources, the effective number of C-U-Net training iterations will be at least times that of training iterations for a single U-Net. Also, the addition of control parameters further adds to the training cost. For these reasons, we investigate the possibility of using a single multi-task U-Net (M-U-Net) (a type (c) system from Fig. 1) which neither requires training a data sample multiple times nor does it involve any extra trainable parameters. Another motivation for using the multi-task model comes from the fact that it could potentially perform even better than the dedicated models by learning from the extra mutual information across the tasks and sharing inductive bias as pointed out by Caruana [1]. Surprisingly, we have not found such a multi-task system ever being used for music source separation.

In this work, we train a single M-U-Net for multi-instrument source separation using a weighted multi-task loss function. We investigate the source separation task in two settings: 1) singing voice separation (two sources), and 2) multi-instrument source separation (four sources). The number of final output channels of our U-Net corresponds to the total number of sources in the chosen setting. Each loss term in our multi-task loss function corresponds to the loss on the respective source estimates. We explore Dynamic Weighted Average (DWA) [4] and Energy Based Weighting (EBW) strategies to determine the weights for our multi-task loss function. We compare the performance of our U-Net trained with the multi-task loss with that of dedicated U-Nets and the C-U-Net. Then we investigate the effect of training with the silent-source samples 2 on the performance. We also study the effect of the choice of loss term definition on the source separation performance.

Our main contributions are:

  • to propose M-U-Net as a computationally cheaper alternative (in terms of the number of training iterations and trainable parameters) for multi-instrument source separation to C-U-Net and the dedicated U-Nets.

  • to propose a novel weighting strategy, EBW, for the multi-task loss function, based on the energy distribution in the ground truth sources.

  • to show that training a model by discarding the data samples containing silent sources could reduce the overall number of training iterations and yet perform as good as the model trained with all the training data samples, if not better.

  • to emphasize the importance of choosing appropriate signal representation for computing the loss term.

The code necessary to reproduce our work and the pre-trained weights of our models are all made available on .

Ii Proposed Method

We train a Multi-task U-Net (M-U-Net) that generates multiple outputs, one per source in the mixture. Having multiple outputs gives rise to multiple task-specific loss terms and hence the following multi-task loss function:

(1)

where is the loss term corresponding to the -th source, is its corresponding weight, is the number of sources and is the overall scalar-valued loss. The input to our M-U-Net is the log-magnitude spectrogram of an audio mixture data sample. We train the U-Net to generate soft masks as the outputs.

We explore two different definitions for the individual loss terms :

Direct Loss In this case, firstly, we determine the Ideal Amplitude Masks (IAM [13]) for the ground truth source magnitude spectrograms for each time-frequency bin as:

(2)

where is the mixture magnitude spectrogram. We then find the mean absolute value error (MAE) directly between the original source IAM masks and their respective estimated masks :

(3)

Indirect Loss In this case, we find the MAE between the original source spectrograms and the estimated spectrograms as shown in (4). It is ‘indirect’ in the sense that the U-Net outputs the masks but the loss term is defined on the spectrogram representations rather than the masks. This kind of loss term definition has been used in source separation works like [14, 10, 5]. Michelsanti et al. [6] showed that such an indirect loss performs better than the direct loss term for the speech enhancement task.

(4)

Ii-a Loss Weighting Strategies

Now, we shift the focus on determining the weights for each loss term in (1). The ranges of loss values vary from one task to another. This results in competing tasks which could eventually make the training imbalanced. Training a multi-task model with imbalanced loss contributions might eventually bias the model in favor of the task with the highest individual loss, undermining the other tasks. Since all the tasks are of equal importance to us and the ranges of their individual loss terms differ, we can not treat the loss terms equally. We need to assign weights to these individual loss terms indicative of their relative importance with respect to each other. Finding the right set of weights helps to counter the imbalance caused by the competing tasks during training and helps the multi-task system learn better. For determining the weights of losses in our multi-task loss function, we explore mainly the Dynamic Weight Average (DWA) and the Energy Based Weighting (EWB) strategies in this paper:

Dynamic Weight Average (DWA)

Liu et al. [4] proposed the Dynamic Weight Average method for continuously adapting the weights of losses in a multi-task loss function during training. In this method, the weights are distributed such that a loss term decreasing at a higher rate is assigned a lower weight than the loss which does not decrease much. In this way, the model learns to focus more on difficult tasks rather than selectively learning easier tasks. The weight for the -th task is determined as:

(5)

where indicates the relative descending rate of the loss term , is the iteration index, and corresponds to the temperature which controls the softness of the task weighting. More the value of , more even the distribution of weights across all the tasks.

In our work, we use DWA with , the loss term being the average loss across the iterations in an epoch for the -th task. Like in [4], we also initialize for to avoid improper initialization.

Energy Based Weighting (EBW)

We noticed that the distribution of energy content in the target task representations to be learned vary from one task to another (see Fig. 2). In our context, this translates to the non-uniform energy distribution with respect to the sources that we are estimating. In the singing voice separation setting, the accompaniment has more energy than the vocals. In the multi-instrument source separation setting, the bass has relatively higher energy than the other sources. We hypothesize that the uneven energy distribution could be a reason why the multi-task model preferentially learns certain tasks more than the others. When we trained our multi-task model with unit weighted loss, the estimates of sources with higher energy were better than that of lower energy sources. Hence, we propose a weighting strategy based on energy distribution in the target representations such that the model does not become biased to the source with high-energy. To our knowledge, so far there has been no work which determines the weights of the multi-task loss based on the energy distribution in the target source representations.

Fig. 2: Energy distribution across the sources.

We explore the following energy-based weighting settings in this work:

Ebw_p1 In this setting, we use the average energy content in -th source, , across all the samples (see(6)). Note that the weights are constant throughout the training for this setting.

(6)

This way, for all tasks, being for the task associated to the source with highest energy and for the rest, and, in particular, the lower the energy of a specific source the higher its corresponding weight , thus keeping the balance among tasks.

EBW_InstP1 In this setting, we use the average energy content in -th source, , across all the samples in a batch at iteration as shown in (7). Note that the weights change during the training for this setting.

(7)

Ebw_p2 This setting is very similar to that of EBW_P1 except for the fact that there is a power of 2 while determining the weights thus strengthening the relative importance between the task related to the highest-energy source and the rest of sources:

(8)

Note that these weights are constant throughout the training for this setting as in EBW_P1.

Iii Experiments

In this section, we explore the effect of weighting strategies discussed in the previous section in training a multi-task model for source separation and compare their performance to that of a system of dedicated models and a conditioned multi-task model. We also perform ablation studies concerning the effect of choice of loss term and the effect of silent-source samples.

Iii-a Dataset

We use the Musdb18 [8] dataset for this work. It contains 150 full-length stereo (two channels) audio tracks along with the isolated constituent sources. The ground truth sources are available in two settings : 1) 2 sources (vocals and accompaniment), and 2) 4 sources (vocals, drums, bass and rest). The dataset comes with a pre-defined split of 100 tracks for training and 50 for testing. We convert them to mono (single channel), downsample the audio to 10880Hz and split them into 6s long chunks without any overlap. We then apply the Short-time Fourier Transform (STFT) on these chunks using a ‘hanning’ window of size of 1022 and hop-size of 256. This results in spectrograms of size 512256. We resample these spectrograms to 256256. Our preprocessing steps are similar to that of [14]. We move 5% of the spectrogram samples from the training set to form our validation set. From this new training set, we filter out the silent-source samples.

Iii-B Network Architecture

All the models in this work are based on a basic U-Net [9] model comprising of filters of sizes {32, 64, 128, 256, 512, 1024, 2048}. We have 6 down-convolution blocks, a transition block and 6 upconvolution blocks along with the skip connections in-between them. Throughout the experiments, we train the models using the Stochastic Gradient Descent (SGD) optimizer with a learning rate to 0.01 (unless otherwise mentioned) and a dropout of 0.1. The input to all our models is the log-magnitude spectrogram of audio mixture data sample of dimensions 256256.

In case of the dedicated U-Nets, we use a U-Net with single channel output since it estimates only a single source at a time. For the C-U-Net model, we adapt the implementation of C-U-Net provided by [5] to make it consistent with our U-Net architecture for a fairer comparison. In C-U-Net too, there is a single channel output as it estimates only one source at a time. In our M-U-Net, the number of output channels correspond to the total number of sources to be estimated, . The training cost for each of these models is reported in Table I. Note that our M-U-Net has the least number of trainable parameters as well as the training iterations as compared to the others.

Model Dedicated U-Nets C-U-Net M-U-Net
# params (approx.) 124M 162M 124M
# training iterations N each N N
(N = number of training samples, M = million)
TABLE I: Training Cost

Iii-C Evaluation Metrics

We choose to evaluate the following metrics [12] : Source-to-Distortion Ratio (SDR), Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) typically used for evaluating music source separation performance. We use the mir_eval toolbox [7] to get these metrics. Note that all our models estimate the soft masks and we obtain the magnitude spectrogram estimates of each source by multiplying these masks with the magnitude spectrogram of the mixture. We combine the phase of the mixture spectrogram along with the magnitude spectrograms of the estimated sources and apply inverse-STFT transform to obtain the waveforms. The metrics are evaluated on the waveforms of the estimated sources with respect to the appropriately downsampled ground truth audio waveforms. Out of these three metrics, SDR is more indicative of the source separation quality as a global performance measure [12] and some works (e.g. [11]) report only this metric.

Iii-D Multi-task Experiments

We aim to show that our M-U-Net can perform as good as (if not better than) the dedicated U-Nets for both singing voice separation (2 sources) and multi-instrument source separation (4 sources) with fewer trainable parameters. We conduct experiments with the M-U-Net exploring the weighting strategies: {DWA, EBW_P1, EBW_InstP1 and EBW_P2} which have been discussed earlier. We compare the performance of these M-U-Nets with the dedicated U-Nets and the C-U-Net. To notice the effectiveness of the weighting strategies, we also train an M-U-Net with unit weights (UW) and compare the performance with the models trained with our weighting strategies. Throughout the experiments, unless otherwise mentioned, we use the indirect loss function defition (4). Table II and Table III, respectively, report the results for singing voice separation and multi-instrument source separation.

Model Vocals Accompaniment Overall
Dedicated U-Nets (x2) 5.09 4.31 (5.61) 12.95 3.18 (12.53) 9.02 5.46 (9.64) SDR
11.68 4.85 (12.02) 17.80 3.80 (17.60) 14.74 5.31 (14.93) SIR
6.83 3.36 (7.20) 15.07 3.35 (14.82) 10.95 5.32 (11.16) SAR
C-U-Net 4.42 4.98 (5.17) 12.21 2.58 (12.16) 8.31 5.56 (9.26) SDR
12.99 5.67 (13.93) 18.16 4.13 (17.65) 15.57 5.58 (16.12) SIR
5.73 3.64 (5.92) 13.94 2.69 (14.06) 9.83 5.21 (9.99) SAR
UW 5.06 4.93 (5.75) 12.98 3.14 (12.48) 9.02 5.72 (9.74) SDR
11.84 5.09 (12.00) 17.60 3.81 (17.24) 14.72 5.33 (15.00) SIR
6.81 4.02 (7.31) 15.20 3.18 (15.01) 11.00 5.55 (11.24) SAR
DWA 5.20 4.50 (5.67) 12.96 3.11 (12.44) 9.08 5.48 (9.61) SDR
11.97 4.81 (12.30) 17.15 3.83 (16.67) 14.56 5.05 (14.67) SIR
6.86 3.66 (7.44) 15.44 3.03 (15.21) 11.15 5.46 (11.54) SAR
EBW P1 5.12 4.78 (5.89) 13.06 2.91 (12.88) 9.09 5.60 (9.77) SDR
11.81 5.15 (12.08) 18.23 3.88 (17.84) 15.02 5.57 (15.41) SIR
6.86 3.62 (7.36) 14.99 2.85 (15.00) 10.92 5.21 (11.02) SAR
EBW InstP1 5.28 4.60 (5.79) 13.04 3.02 (12.69) 9.16 5.50 (9.79) SDR
12.01 5.24 (12.22) 17.36 3.91 (16.85) 14.69 5.33 (14.93) SIR
7.02 3.41 (7.46) 15.45 2.88 (15.20) 11.24 5.27 (11.59) SAR
EBW P2 5.07 4.56 (5.63) 12.89 2.95 (12.39) 8.98 5.48 (9.66) SDR
11.30 5.12 (11.47) 17.31 3.95 (16.92) 14.30 5.46 (14.56) SIR
7.00 3.41 (7.31) 15.27 2.84 (15.04) 11.13 5.20 (11.42) SAR
trained with learning rate 0.001 instead of 0.01.
TABLE II: Results of Singing Voice Separation

From these tables, we notice that the source separation performance gets worse as the sources increase from 2 to 4, across all methods. Despite performing low in SDR and SAR metrics, C-U-Net always gives the best overall SIR metric. In general, we find the performance of M-U-Nets trained using our weighting strategies comparable to that of C-U-Net and the dedicated U-Nets. Especially for the 2 source setting, the DWA model and EBW model perform better than the naive unit weighting (UW) based model indicating the usefulness of our weighting strategies. As seen in Fig. 2, the average energy value between the sources differs a lot in the 2 source setting than in the 4 source setting. Hence, the EBW methods which incorporate the signal energy information to weight the loss terms, perform better than the energy agnostic DWA method in the 2 source setting rather than in the 4 source setting. We report performance improvement on our implementation of C-U-Net over [5] owing to the larger number of filters and convolutions in our convolution blocks thereby forming a deeper model. Overall, we notice that M-U-Net performs as well as the C-U-Net (if not better) in both the settings.

Model Vocals Drums Bass Rest Overall

Dedicated U-Nets (x4)
4.96 4.63 (5.77) 4.95 3.56 (4.60) 2.78 4.41 (3.19) 1.21 3.38 (2.23) 3.48 4.30 (3.61) SDR
10.70 5.05 (11.19) 10.24 4.04 (9.66) 5.83 5.37 (5.57) 5.56 3.35 (6.31) 8.08 5.09 (8.04) SIR
7.15 3.38 (7.20) 7.26 3.23 (6.75) 7.86 3.03 (8.02) 4.73 2.83 (5.23) 6.75 3.32 (6.67) SAR

C-U-Net
4.49 4.75 (5.26) 4.54 3.59 (4.30) 2.51 4.26 (2.97) 0.97 3.57 (1.69) 3.13 4.31 (3.37) SDR
11.33 4.80 (11.97) 10.80 4.15 (10.77) 6.60 4.91 (6.40) 6.12 3.20 (6.56) 8.71 4.90 (8.79) SIR
6.23 3.87 (6.77) 6.49 3.34 (5.88) 6.46 3.54 (6.57) 3.98 3.35 (4.43) 5.79 3.66 (5.81) SAR

UW
4.31 4.80 (5.46) 5.19 3.51 (4.72) 2.55 4.58 (2.81) 1.51 3.32 (2.49) 3.39 4.32 (3.58) SDR
9.95 5.28 (10.73) 10.85 4.27 (10.24) 5.36 5.25 (5.25) 5.66 3.49 (6.65) 7.96 5.22 (8.00) SIR
6.70 3.28 (7.21) 7.31 3.23 (6.62) 7.95 3.05 (7.93) 5.11 2.65 (5.84) 6.77 3.22 (6.71) SAR

DWA
4.36 4.64 (5.24) 5.22 3.54 (4.92) 2.78 4.54 (2.88) 1.52 3.25 (2.45) 3.47 4.25 (3.61) SDR
9.80 5.05 (10.65) 10.42 4.28 (10.11) 5.76 5.24 (5.63) 5.39 3.50 (6.48) 7.84 5.08 (7.78) SIR
6.78 3.35 (7.30) 7.62 3.19 (6.95) 7.81 3.04 (7.82) 5.41 2.50 (5.96) 6.90 3.16 (6.97) SAR

EBW P1
4.51 4.56 (5.41) 5.13 3.50 (4.77) 2.64 4.32 (2.94) 1.59 3.17 (2.64) 3.46 4.15 (3.65) SDR
10.25 4.97 (11.05) 10.60 4.34 (10.11) 5.38 4.91 (5.36) 5.64 3.31 (6.61) 7.97 5.04 (7.93) SIR
6.74 3.33 (7.29) 7.37 3.18 (6.70) 7.93 2.97 (7.99) 5.27 2.63 (5.88) 6.83 3.18 (6.77) SAR

EBW InstP1
4.49 4.62 (5.46) 5.16 3.55 (4.85) 2.63 4.53 (2.86) 1.58 3.18 (2.58) 3.46 4.24 (3.52) SDR
10.31 5.10 (10.94) 10.61 4.24 (10.23) 5.52 5.39 (5.31) 5.61 3.41 (6.61) 8.01 5.08 (8.10) SIR
6.67 3.36 (7.19) 7.39 3.28 (6.84) 7.93 3.02 (7.90) 5.28 2.57 (5.79) 6.82 3.81 (6.86) SAR

EBW P2
4.48 4.82 (5.44) 5.17 3.60 (4.89) 2.69 4.44 (2.99) 1.62 3.14 (2.58) 3.49 4.26 (3.66) SDR
10.20 5.27 (10.78) 10.40 4.03 (10.11) 5.74 5.45 (5.46) 5.72 3.39 (6.77) 8.01 5.12 (7.92) SIR
6.77 3.35 (7.28) 7.48 3.30 (6.82) 7.80 2.99 (7.93) 5.26 2.50 (5.85) 6.82 3.19 (6.79) SAR


TABLE III: Results of Multi-instrument Source Separation

Iii-E Ablation Studies

Now, we present some additional experiments to evaluate which loss definition, among direct loss and indirect loss, performs better. We also analyze the effect of including silent-source samples in the training set. For these additional experiments, we consider the setting EBW_P1 on 4 sources as a reference. Table IV reports the performance metrics pertaining to these experiments.

Model Overall Performance Metrics
SDR SIR SAR
EBW_P1 3.46 4.15 (3.65) 7.97 5.04 (7.93) 6.83 3.18 (6.77)
EBW_P1 with
Direct Loss (3)
3.31 3.96 (3.46) 7.98 4.97 (8.18) 6.64 3.12 (6.67)
EBW_P1
without filtering
3.44 4.28 (3.59) 8.33 5.10 (8.24) 6.61 3.46 (6.72)
TABLE IV: Results of Ablation Studies

From Table IV, we notice that the overall performance drops in both the ablation studies. Despite the slight improvement in the SIR metric on using direct loss (3), based on the SDR and SAR metrics, we recommend using indirect loss (4) definition which is congruent with the findings for speech enhancement [6]. We also infer that training with silent-source samples does not contribute much to the overall performance and we recommend discarding them.

Iv Conclusion

We presented a multi-task U-Net as a cheaper alternative (in terms of the number of training iterations and number of trainable parameters) to the Conditioned U-Net and the system of dedicated U-Nets for music source separation. Such a multi-task approach could be potentially extended to models other than the U-Net and perhaps also for other kinds of tasks. We also presented a novel weighting strategy, EBW, for training the multi-task loss function based on the energy of the signal representations. We showed how the EBW method is effective when the average energy values across the sources to be estimated are very different. We believe there are other ways of distilling the energy information into the weighting strategy and leave it for future work. We also showed that discarding silent-source samples during training saves on the training cost without much compromise in performance. We also showed that M-U-Net performs better when trained with an indirect loss term rather than the direct loss on the masks.

Acknowledgment

We thank Daniel Michelsanti (Aalborg University) and Olga Slizovskaia (Universitat Pompeu Fabra) for the insightful discussions related to source separation methods and practices.

Footnotes

  1. thanks: This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673. V. S. K. has received financial support through “la Caixa” Foundation (ID 100010434), fellowship code: LCF/BQ/DI18/11660064. Additional funding comes from the MICINN/FEDER UE project with reference PGC2018-098625-B-I00, H2020-MSCA-RISE-2017 project with reference 777826 NoMADS, Spanish Ministry of Economy and Competitiveness under the María de Maeztu Units of Excellence Program (MDM-2015-0502) and the Social European Funds. We also thank Nvidia for the donation of GPUs.
  2. data samples containing at least one silent source

References

  1. R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §I.
  2. A. Défossez, N. Usunier, L. Bottou and F. Bach (2019) Music source separation in the waveform domain. arXiv:1911.13254. Cited by: §I.
  3. A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Kumar and T. Weyde (2017) Singing voice separation with deep u-net convolutional networks. In Proc. 18th Int. Society for Music Information Retrieval Conf., ISMIR, pp. 745–751. Cited by: §I.
  4. S. Liu, E. Johns and A. J. Davison (2019) End-to-end multi-task learning with attention. In IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1871–1880. Cited by: §I, §II-A1, §II-A1.
  5. G. Meseguer-Brocal and G. Peeters (2019) CONDITIONED-u-net: introducing a control mechanism in the u-net for multiple source separations.. In Proc. 20th Int. Society for Music Information Retrieval Conf., Cited by: §I, §II, §III-B, §III-D.
  6. D. Michelsanti, Z. Tan, S. Sigurdsson and J. Jensen (2019) On training targets and objective functions for deep-learning-based audio-visual speech enhancement. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 8077–8081. Cited by: §II, §III-E.
  7. C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P. Ellis and C. C. Raffel (2014) Mir_eval: a transparent implementation of common mir metrics. In Prof. 15th Int. Society for Music Information Retrieval Conf., ISMIR, Cited by: §III-C.
  8. Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis and R. Bittner (2017-12) The MUSDB18 corpus for music separation. Cited by: §III-A.
  9. O. Ronneberger, P. Fischer and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Int. Conf. on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I, §III-B.
  10. D. Stoller, S. Ewert and S. Dixon (2018) Adversarial semi-supervised audio source separation applied to singing voice extraction. In IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 2391–2395. Cited by: §I, §II.
  11. D. Stoller, S. Ewert and S. Dixon (2018) Wave-u-net: A multi-scale neural network for end-to-end audio source separation. In Proc. 19th Int. Society for Music Information Retrieval Conf., ISMIR, pp. 334–340. Cited by: §I, §III-C.
  12. E. Vincent, R. Gribonval and C. Févotte (2006) Performance measurement in blind audio source separation. IEEE trans. on audio, speech, and language processing 14 (4), pp. 1462–1469. Cited by: §III-C.
  13. Y. Wang, A. Narayanan and D. Wang (2014) On training targets for supervised speech separation. IEEE/ACM trans. on audio, speech, and language processing 22 (12), pp. 1849–1858. Cited by: §II.
  14. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott and A. Torralba (2018) The sound of pixels. In European Conf. on Computer Vision (ECCV), pp. 570–586. Cited by: §II, §III-A.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
412543
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description