# Denoising Auto-encoder with Recurrent Skip Connections and Residual Regression for Music Source Separation

###### Abstract

Convolutional neural networks with skip connections have shown good performance in music source separation. In this work, we propose a denoising Auto-encoder with Recurrent skip Connections (ARC). We use 1D convolution along the temporal axis of the time-frequency feature map in all layers of the fully-convolutional network. The use of 1D convolution makes it possible to apply recurrent layers to the intermediate outputs of the convolution layers. In addition, we also propose an enhancement network and a residual regression method to further improve the separation result. The recurrent skip connections, the enhancement module, and the residual regression all improve the separation quality. The ARC model with residual regression achieves 5.74 siganl-to-distoration ratio (SDR) in vocals with MUSDB in SiSEC 2018. We also evaluate the ARC model alone on the older dataset DSD100 (used in SiSEC 2016) and it achieves 5.91 SDR in vocals.

## I Introduction

Music source separation aims at separating music sources such as vocals, drums, strings, or accompaniment from the original song. It can facilitate tasks that require clean sound sources, such as music remixing and karaoke [1]. In this work, we introduce a new model that uses denoising auto-encoder with symmetric skip connections for music source separation. Symmetric skip connections have been used for biomedical image segmentation [2] and singing voice separation [3]. Our model is different in that it uses 1D convolutions instead of 2D convolutions. Using 1D convolutions has the benefit that we can use recurrent layers right after the convolution layers. Furthermore, an enhancement module and a residual regression method are introduced in addition to the separation module.

## Ii Proposed models

In this section, we introduce the separation model, the enhancement model, and residual regression.

### Ii-a Separation model

The separation model is a fully-convolutional network (FCN) [4, 5]. All the convolution layers use 1D convolution. We call it the ARC model, for it is in principal a denoising auto-encoder with recurrent skip connections.

CNN with symmetric skip connections had been used for singing voice separation by Jansson et al. [3]. They used 2D convolutions in their convolutional neural networks (CNNs). The output tensor of a 2D convolution layer is of the shape (channels, frequency bins, temporal points). If we want to apply recurrent layers to this tensor, the dimension of frequency bins will pose some problems.

In our model, the convolution layers use 1D convolutions, namely doing convolutions along the temporal axis [6, 7]. The output tensor of an 1D convolution layer takes the shape (channels, temporal points). This allows us to directly apply recurrent layers to the convolution output tensors.

The proposed architecture is presented in Fig. 1. It contains six convolution layers and two skip connections. The two skip connections are processed by gated recurrent unit (GRU) layers [8, 9]. We use weight normalization [10] instead of batch normalization [11] in each convolution layer. Leaky rectified linear units (Leaky ReLUs) with 0.01 slope [12] are applied to all the convolution and transposed convolution layers.

The model takes a spectrogram of a song clip as the input. An input is also referred to as a mixture because it contains the sources such as vocals, bass, drums, and other sounds. The input to the model is , and the training target is the source spectrograms, that is, the concatenation of , where denotes the number of sources. This model can be seen as a denoising auto-encoder because, for one target source, the other sources can be seen as noises in the mixture signal.

In our pilot experiments, we also tried to apply a softmax function to the output layer so that the network predicts masks for different sources and enforces the condition that the summation of the predicted source spectrograms is equal to the mixture spectrogram. We found that this setting largely speeds up the training process, but the result becomes much worse. Therefore, we decided to use a leakly ReLU as the nonlinearity function to the output layer to directly estimate the source spectrograms.

### Ii-B Enhancement model

The separation model is in charge of the task of music source separation. The small noises could be ignored in the training process because the losses introduced by other sources could be much larger than the losses introduced by the smaller artifacts. But, we human beings are very sensitive to those smaller artifacts, especially in vocals.

In order to reduce these small artifacts, we introduce an extra enhancement model as a post-processing module. The enhancement model is another denoising auto-encoder that takes the output of a separation model (i.e. the ARC) as its input, and estimates an enhanced version of the separation result. Each source has its own enhancement model, and the training target is that specific source spectrogram.

The architecture is shown in Fig. 2. It is similar to ARC but the skip connections are implemented as convolution layers for simplicity. In the training process of the enhancement model, the parameters of a separation model are fixed.

### Ii-C Residual regression

Residual regression is also used to improve the separation result. Unlike the enhancement model, the model with residual regression uses the separation model itself to further improve the separation result.

The process of residual regression is depicted in Fig. 3. The separation model in Fig. 3 is similar to the one introduced in Section II-A. The difference is that the separation model takes another input feature map (the left arrow below the separation model) that is the output from the previous iteration. In iteration , the separation model takes both the output and the mixture feature map as the input. For the iteration 1, the output 0 is set to an all-zero tensor with the same shape as the mixture feature map. The total output of iteration is the output of the separation model plus the total output of iteration . In this way, the separation model will only estimate the residual of the target sources. In the training process, the total loss is the average of the losses from all the iterations.

## Iii Evaluation

The evaluation is conducted by using the official dataset MUSDB (100 songs for training and 50 songs for testing) and the official packages^{1}^{1}1https://github.com/sigsep/sigsep-mus-eval and https://github.com/sigsep/sigsep-mus-2018-analysis from SiSEC2018 [13]. The models are implemented with PyTorch.^{2}^{2}2https://pytorch.org/ We will report the evaluation result in terms of signal-to-distortion ratio (SDR) [14], as it is the most widely used metric in literature [15, 16, 13]

SiSEC ID | Skip connections | Enhancement | Residual fegression | vocals | drums | bass | other | accompaniment |
---|---|---|---|---|---|---|---|---|

JY1 | 1 GRU layer | No | No | 5.57 | 4.60 | 3.18 | 3.45 | 11.81 |

JY2 | 1 GRU layer | Yes | No | 5.69 | 4.76 | 3.58 | 3.70 | 11.90 |

JY3 | 1 GRU layer | No | Yes (3 iterations) | 5.74 | 4.66 | 3.67 | 3.40 | 12.08 |

### Iii-a Training process

The training dataset is MUSDB.^{3}^{3}3https://sigsep.github.io/datasets/musdb.html#tools It contains 100 songs, each of which has four sources: drums, bass, other, and vocals. We randomly choose 90 songs as the training set and 10 songs as the validation set. The validation set is used for early stopping. Each song is divided into 5-second sub-clips.

The short-time Fourier transform (STFT) is applied to the sub-clips for feature extraction. The native sampling rate 44,100 is used with a window size 2,048 and a hop size 1,024.

Uhlich et al. [15] showed that data augmentation is crucial to compensate for the scarcity of training data in music source separation. We conduct the online data augmentation to increase the number of training data as follows. Assume we have 5-second sub-clips. First, we randomly choose one sub-clip from the sub-clips for each source. Note that the sub-clip chosen for one source could be different from the sub-clip chosen for another source. The four sub-clips from the four sources are summed, leading to the mixture of one training instance. Then, we use the spectrogram of this mixture as the input and use the concatenated spectrograms of the four source sub-clips as the training target.

We use mean square error (MSE) as the loss function for updating the network. Assume that the mini-batch size is , and there are sources, temporal points, and frequency bins. Then, the loss function is , where is the prediction and is the target source spectrogram.

We use Adam [17] and a mini-batch of 10 instances to train the models. The initial learning rate is set to 0.001 for the convolution layers, and it is set to 0.0001 for the GRU layers. We found that using 0.001 learning rate often lead to gradient explosion for the GRU layers, while the training process was stable when we used 0.0001 for the GRU layers.

### Iii-B Testing process

In the testing phase, an entire song is processed at once. Because we adopt a FCN design, our model can deal with songs of abitrary length. Multi-channel Wiener filter is used for post-processing [14, 15]. We use the phases of the mixture to convert the estimated source spectrograms into waveforms via the inverse STFT. We use the sum of the estimates of the four sources as the estimate of the accompaniment (‘accomp.’).

### Iii-C Result

In this subsection, we show the performance of our submissions to SiSEC2018. The result is shown in TABLE I. In the model with residual regression (JY3), we run three iterations. We can see from this table that JY2 (using enhancement model) and JY3 (using residual regression) improves over JY1 in almost all sources.

Fig. 4 display the SiSEC 2018 results of the models using supervised approaches without using additional training data, showing the best model of each author group.^{4}^{4}4This figure is generated with a modified version of the code provided by the organizers https://github.com/sigsep/sigsep-mus-2018-analysis. We specify “not using additional training data” here, because some submissions did use additional training data (not by data augmentation but by actually including more songs with clean sources for training. Statistically the result of JY3 in vocals is not significantly different from that of the other two leading models TAK1^{5}^{5}5https://github.com/sigsep/sigsep-mus-2018/blob/master/submissions/TAK1/description.md [18, 19] and UHL2^{6}^{6}6https://github.com/sigsep/sigsep-mus-2018/blob/master/submissions/UHL2/description.md [15], according to the official SiSEC2018 report [13].

### Iii-D Effect of different skip connections

We compare different skip connections in this subsection. The four compared architectures are shown in Fig. 5, and the result is shown in TABLE II. We can see that the models with skip connections outperform the one without skip connections, and the model with recurrent skip connections outperforms the one with convolution skip connections.

Skip connections | vocals | drums | bass | other | accomp. |
---|---|---|---|---|---|

None | 4.41 | 4.48 | 3.43 | 2.91 | 10.74 |

Direct (identity) | 5.05 | 4.65 | 3.41 | 3.02 | 11.25 |

1 Convolution layer | 5.03 | 4.78 | 3.37 | 2.80 | 11.39 |

1 GRU layer (JY1) | 5.57 | 4.60 | 3.18 | 3.45 | 11.81 |

### Iii-E Applying recurrent layers at different locations

The recurrent layers could be applied at different locations of the separation model. We tested several possibilities, and many of them improves over the non-recurrent versions. For example, another possible way of using recurrent layers is shown in Fig. 5(b) and its performance is shown in TABLE III. Among these variants, we found that applying the recurrent layers to the skip connections is the most effective one.

Where to use | |||||
---|---|---|---|---|---|

recurrent layers | vocals | drums | bass | other | accomp. |

Skip connections (JY1) | 5.57 | 4.60 | 3.18 | 3.45 | 11.81 |

After TConv4 output | 5.36 | 4.38 | 3.53 | 3.66 | 11.91 |

### Iii-F Batch normalization VS Weight normalization

We have found that the separated audios subjectively sound less noisy using weight normalization [10] in convolution layers than the separated audios using batch normalization [11] after convolution layers. However, the objective evaluation with SDR suggests that their results are very close in vocals and the one with batch normalization is even better in the other sources, as shown in TABLE IV.

Normalization | vocals | drums | bass | other | accomp. |
---|---|---|---|---|---|

Weight norm (JY1) | 5.57 | 4.60 | 3.18 | 3.45 | 11.81 |

Batch norm | 5.56 | 4.92 | 3.63 | 3.57 | 11.98 |

### Iii-G Qualitative Result

Fig. 7 shows the groundtruth spectrograms and the estimated spectrograms of two example songs from the MUSDB test set. The groundtruths and the estimates have similar patterns. We can see clear activations of the fundamental frequencies and their harmonics from the estimated spectrograms. On the other hand, we can observe that the estimated spectrograms are less sharp and noisier compared to the groundtruth spectrograms, which indicate rooms for improvement in the future work.

We also build a website (http://mss.ciaua.com) to demo the result of the proposed model JY3 for songs not in MUSDB.

### Iii-H Evaluating with DSD100 dataset

We also evaluate the proposed ARC net with DSD100 dataset that was used in SiSEC2016 [16]. We evaluate ARC with batch normalization as introduced in Section III-F with DSD100 by using the official toolkit.^{7}^{7}7https://github.com/faroit/sisec-mus-results
The enhancement and residual regression are not used in this evaluation.
We use the 50/50 train/test split specified by SiSEC2016. The result is shown in TABLE V. The result of our model is only second to that of the MMDenseNet [18] and MMDenseLSTM [19] models proposed by Takahashi et al. The TAK1 method shown in Fig. 4 is an extended version of these models.

vocals | drums | bass | other | accomp. | |
---|---|---|---|---|---|

DeepNMF [20] | 2.75 | 2.11 | 1.88 | 2.64 | 8.90 |

NUG [14] | 4.55 | 3.89 | 2.72 | 3.18 | 10.29 |

MaDTwinNet [21] | 4.57 | — | — | — | — |

BLSTM [15] | 4.86 | 4.00 | 2.89 | 3.24 | 11.26 |

SH-4stack [22] | 5.16 | 4.11 | 1.77 | 2.36 | 12.14 |

BLEND [15] | 5.23 | 4.13 | 2.98 | 3.52 | 11.70 |

MMDenseNet [18] | 6.00 | 5.37 | 3.91 | 3.81 | 12.10 |

MMDenseLSTM [19] | 6.31 | 5.46 | 3.73 | 4.33 | 12.73 |

Ours | 5.91 | 4.11 | 2.54 | 3.53 | 11.31 |

## Iv Conclusions

In this paper, we have presented our models for music source separation. We proposed to use 1D convolutions in convolution layers so that we can naturally apply recurrent layers to the convolution outputs. The experiments show that the recurrent skip connections largely improve the separation result. Moreover, the proposed enhancement model and residual regression can further improve the separation result.

For future work, we would be interested in applying the source separation models for other applications, such as singing style transfer [23], vocal melody extraction [24, 25], instrument recognition [26], and lyrics transcription [27].

## References

- [1] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, D. FitzGerald, and B. Pardo, “An overview of lead and accompaniment separation in music,” CoRR, vol. abs/1804.08300, 2018.
- [2] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Medical Image Computing and Computer-Assisted Intervention. Springer International Publishing, 2015, pp. 234–241.
- [3] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing voice separation with deep U-net convolutional networks,” in Proc. International Society for Music Information Retrieval Conference, 2017.
- [4] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free? - Weakly-supervised learning with convolutional neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 685–694.
- [5] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” pp. 3431–3440, 2015.
- [6] J.-Y. Liu and Y.-H. Yang, “Event localization in music auto-tagging,” in Proc. ACM International Conference on Multimedia, 2016.
- [7] S.-Y. Chou, J.-S. R. Jang, and Y.-H. Yang, “Learning to recognize transient sound events using attentional supervision,” Proc. International Joint Conference of Artificial Intelligence, 2018.
- [8] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches,” in Proc. Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.
- [9] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
- [10] T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” arXiv preprint arXiv:1602.07868, feb 2016.
- [11] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. International Conference on Machine Learning, 2015, pp. 448–456.
- [12] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. International Conference on Machine Learning, vol. 30, no. 1, 2013, p. 3.
- [13] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” in Proc. International Conference on Latent Variable Analysis and Signal Separation, 2018.
- [14] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel music separation with deep neural networks,” in Proc. European Signal Processing Conference, 2016, pp. 1748–1752.
- [15] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, and Y. Mitsufuji, “Improving music source separation based on deep neural networks through data augmentation and network blending,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2017, pp. 261–265.
- [16] A. Liutkus, F. R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in Proc. LVA/ICA, 2017, pp. 323–332.
- [17] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [18] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets for audio source separation,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2017, pp. 21–25.
- [19] N. Takahashi, N. Goswami, and Y. Mitsufuji, “MMDenseLSTM: An efficient combination of convolutional and recurrent neural networks for audio source separation,” CoRR, vol. abs/1805.02410, 2018.
- [20] J. L. Roux, J. R. Hershey, and F. Weninger, “Deep NMF for speech separation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
- [21] K. Drossosy, S. I. Mimilakis, D. Serdyukz, G. Schuller, T. Virtaneny, and Y. Bengio, “MaD TwinNet: Masker-denoiser architecture with twin networks for monaural sound source separation,” CoRR, vol. abs/1802.00300, 2018.
- [22] S. Park, T. Kim, K. Lee, and N. Kwak, “Music source separation using stacked hourglass networks,” in Proc. International Society for Music Information Retrieval Conference, 2018.
- [23] C.-W. Wu, J.-Y. Liu, Y.-H. Yang, and J.-S. R. Jang, “Singing style transfer using cycle-consistent boundary equilibrium generative adversarial networks,” in Proc. Joint Workshop on Machine Learning for Music, 2018.
- [24] R. M. Bittner, B. McFee, J. Salamon, P. Li, and J. P. Bello, “Deep salience representations for estimation in polyphonic music,” in Proc. International Society for Music Information Retrieval Confernece, 2017, pp. 63–70.
- [25] L. Su, “Vocal melody extraction using patch-based CNN,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.
- [26] Y.-N. Hung and Y.-H. Yang, “Frame-level instrument recognition by timbre and pitch,” in Proc. International Society for Music Information Retrieval Confernece, 2018.
- [27] C.-P. Tsai, Y.-L. Tuan, and L. shan Lee, “Transcribing lyrics from commercial song audio: The first step towards singing content processing,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, 2018.