Recursive Visual Sound Separation Using Minus-Plus Net
Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred to as MinusPlus Network (MP-Net), for the task of visual sound separation. MP-Net separates sounds recursively in the order of average energy 111In this paper, average energy of sound stands for the average energy of its spectrogram., removing the separated sound from the mixture at the end of each prediction, until the mixture becomes empty or contains only noise. In this way, MP-Net could be applied to sound mixtures with arbitrary numbers and types of sounds. Moreover, while MP-Net keeps removing sounds with large energy from the mixture, sounds with small energy could emerge and become clearer, so that the separation is more accurate. Compared to previous methods, MP-Net obtains state-of-the-art results on two large scale datasets, across mixtures with different types and numbers of sounds.
Besides visual cues, the sound that comes along with what we see often provides complementary information, which could be used for object detection [12, 13, 17, 18] to clarify ambiguous visual cues, and description generation [6, 7, 26, 5] to enrich semantics. On the other hand, as what we hear in most cases is the mixture of different sounds, coming from different sources, it is necessary to separate sounds and associate them to sources in the visual scene, before utilizing sound data.
The difficulties of visual sound separation lie in several aspects. 1) First, possible sound sources in the corresponding videos may not make any sound, which causes ambiguity. 2) Second, the mixture usually contains a large variance in terms of numbers and types. 3) More importantly, sounds in the mixture often affect each other in multiple ways. For example, sounds with large energy often dominate the mixture, making other sounds less distinguishable or even sound like noise in some cases.
Existing works [10, 28] on visual sound separation mainly separate each sound independently. They assume either fixed types or fixed numbers of sounds, separating sounds independently. Since strong assumptions in [10, 28] have limited their applicability in generalized scenarios, separating sounds independently could lead to inconsistency between the actual mixture and the mixture of separated sounds, \eg some data in the actual mixture does not appear in any sounds. Moreover, the separation of sounds with small energy may be affected by sounds with large energy in such independent processes.
Facing these challenges, we propose a novel solution, referred to as MinusPlus Network (MP-Net), which identifies each sound in the mixture recursively, in descending order of average energy. It can be divided into two stages, namely a minus stage and a plus stage. At each step of the minus stage, MP-Net identifies the most salient sound from the current mixture, then removes the sound therefrom. This process repeats until the current mixture becomes empty or contains only noise. Due to the removal of preceding separations, only one sound could obtain the component that is shared by multiple sounds. Consequently, to compensate such cases, MP-Net refines each sound in the plus stage, which computes a residual based on the sound itself and the mixture of preceding separated sounds. The final sound is obtained by mixing the outputs of both stages.
MP-Net efficiently overcomes the challenges of visual sound separation. By recursively separating sounds, it adaptively decides the number of sounds in the mixture, without knowing a priori the number and the types of sounds. Moreover, in MP-Net, sounds with large energy will be removed from the mixture after they are separated. In this way, sounds with relatively smaller energy naturally emerge and become clearer, diminishing the effect of imbalanced sound energy.
Overall, our contributions can be briefly summarized as follows: (1) We propose a novel framework, referred to as MinusPlus Network (MP-Net), to separate independent sounds from the recorded mixture based on a corresponding video. Unlike previous works which assume a fixed number of sounds in the mixture, the proposed framework could dynamically determine the number of sounds, leading to better generalization ability. (2) MP-Net utilizes a novel way to alleviate the issue of imbalanced energy of sounds in the mixture, by subtracting salient sounds from the mixture after they are separated, so that sounds with less energy could emerge. (3) On two large scale datasets, MP-Net obtains more accurate results, and generalizes better compared to the state-of-the-art method.
2 Related Work
Works connecting visual and audio data can be roughly divided into several categories.
The first category is jointly embedding audio-visual data. Aytar \etal  transfer discriminative knowledge in visual content to audio data by minimizing the KL divergence of their representations. Arandjelovic \etal  associate representations of visual and audio data by learning their correspondence (\ie whether they belong to the same video), and authors in [21, 20, 16] further extend such correspondence to temporal alignment, resulting in better representations. Different from these works, visual sound separation requires to separate each independent sound from the mixture, relying on the corresponding video.
The task of sound localization also requires jointly processing visual and audio data, which identifies the region that generates the sound. To solve this task, Hershey \etal  locate sound sources in video frames by measuring audio-visual synchrony. Both Tian \etal  and Parascandolo \etal  apply sound event detection to find sound sources. Finally, Senocak \etal  and Arandjelovic \etal  find sound sources by analyzing the activation of feature maps. Although visual sound separation could also locate separated sounds in the corresponding video, it requires separating the sounds at first, making it more challenging.
Visual sound separation belongs to the third category, a special type of which is visual speech separation, where sounds in the mixture are all human speeches. For example, Afouras \etal  and Ephrat \etal  obtain a speaker-independent model by leveraging a large amount of news and TV videos, and Xu \etal  propose an auditory selection framework which uses attention and memory to capture speech characteristics. Unlike these works, we target the general task of separating sounds with different types, which have more diverse sound characteristics.
The most related works are  and . In , a convolutional network is used to predict the type of objects appeared in the video, and Non-negative Matrix Factorization  is used to extract a set of basic components. The association between each object and each basic component will be estimated via a Multi-Instance Multi-Label objective. Consequently, sounds will be separated using the associations between basic components and each predicted object.  follows a similar framework, replacing Non-negative Matrix Factorization with a U-Net . In addition, instead of predicting object-base associations, it directly predicts weights conditioned on visual semantics. While the former predicts the existence of different objects in the video, assuming fixed types of sounds, the latter assumes a fixed number of sounds. Such strong assumptions have limited their generalization ability, as the mixture of sounds often has large variance across sound types and numbers. More importantly, each prediction in  and  is conducted independently. As a result, 1) there may be an inconsistency between the mixture of all predicted sounds and the actual mixture. \eg some data appeared in the actual mixture may not appear in any predicted sounds, or some data has appeared too many times in predicted sounds, exceeding its frequency in the actual mixture. 2) As sounds in the mixture have different average energy, sounds with large energy may affect the prediction accuracy of sounds with less energy. Different from them, our proposed method recursively predicts each sound in the mixture, following the order of average energy. The predicted sound with a large energy will be removed from the mixture after its prediction. In this way, our proposed method requires no assumptions on the type and number of sounds and ensures consistent predictions with the input mixture. Moreover, when sounds with large energy are removed from the mixture continually, sounds with less energy could emerge and become clearer, resulting in more accurate predictions.
3 Visual Sound Separation
In the task of visual sound separation, we are given a context video and a recorded mixture of sounds , which is the mixture of a set of independent sounds . The objective is to separate each sound from the mixture based on the visual context in .
We propose a new framework for visual sound separation, referred to as MinusPlus Network (MP-Net), which learns to separate each independent sound from the recorded mixture, without knowing a priori the number of sounds in the mixture (\ie ). In addition, MP-Net could also associate each independent sound with a plausible source in the corresponding visual content, providing a way to link data in two different modalities.
In MP-Net, sound data is represented as spectrograms, and the overall structure of MP-Net has been demonstrated in Figure 1. It has two stages, namely the minus stage and the plus stage.
Minus Stage. In the minus stage, MP-Net recursively separates each independent sound from the mixture , where at every recursive step it will focus on the sound that is the most salient one in the remaining sounds. The process could be described as:
where is -th predicted sound, M-Net stands for the sub-net used in the minus stage, and is the element-wise subtraction on spectrograms. As shown in Eq.(3), MP-Net keeps removing from previous mixture , until current mixture is empty or contains only noise with significantly low energy.
Plus Stage. While in the minus stage we remove preceding predictions from the mixture by subtraction, a prediction may miss some content that is shared by it and preceding predictions . Inspired by this, MP-Net contains a plus stage, which further refines each separated sound following:
where P-Net stands for the sub-network used in the plus stage, and is the element-wise addition on spectrograms. As shown in Eq.(6), MP-Net computes a residual for -th prediction, based on and the mixture of all preceding predictions, and finally refines -th prediction by mixing and .
The benefits of using two stages lie in several aspects. 1) The minus stage could effectively determine the number of independent sounds in the mixture, without knowing it a priori. 2) Removing preceding predictions from the mixture could diminish their disruption on the remaining sounds, so that remaining sounds continue emerging as the recursion goes. 3) Removing preceding predictions from the mixture potentially helps M-Net focus on the distinct characteristics of remaining sounds, enhancing the accuracy of its predictions. 3) The plus stage could compensate for the loss of shared information between each prediction and all its preceding predictions, potentially smoothing the final prediction of each sound. Subsequently, we will introduce the two subnets, namely M-Net and P-Net, respectively.
M-Net is the sub-net responsible for separating each independent sound from the mixture, following a recursive procedure. Specifically, To separate the most salient sound at -th recursive step, M-Net will predict sub-spectrograms using a U-Net , which capture different patterns in . At the same time, we will obtain a feature map of size from the input video , which estimates the association score between each sub-spectrogram and the visual content at different spatial location. With and , we could then identify the associated visual content for :
where computes a location-specific mask. computes the average energy of a spectrogram, and stands for the sigmoid function, We regard as the source location of , and the feature vector in at that location as the visual feature of . To separate , we reuse the vector as the attention weights on sub-spectrograms and get the actual mask by
where stands for the sigmoid function. Following , we refer to as a ratio mask, and an alternative choice is to further binarize to get a binary mask. Finally, is obtained by . It is worth noting that we could also directly predict following . However, it is reported that an intermediate mask leads to better results . At the end of -th recursive step, MP-Net will remove the predicted from previous mixture via , so that less salient sounds could emerge in later recursive steps. When the average energy in less than a threshold , M-Net stops the recursive process, assuming all sounds have been separated.
While M-Net makes succeeding predictions more accurate by removing preceding predictions from the mixture, succeeding predictions may miss some content shared by preceding predictions, leading to incomplete spectrograms. To overcome this issue, MP-Net further applies a P-Net to refine sounds separated by the M-Net. Specifically, for , P-Net applies a U-Net  to get a residual mask , based on two inputs, namely and , which is the re-mixture of preceding predictions, as the missing content of could only appear in them. The final sound spectrogram for -th sound is obtained by:
3.4 Mutual Distortion Measurement
To evaluate models for visual sound separation, previous approaches [10, 28] utilize Normalized Signal-to-Distortion Ratio (NSDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR). While these traditional metrics could reflect the separation performance to some extent, they are sensitive to frequencies, so that scores of different separated sounds are not only affected by their similarities with ground-truths, but also the locations of mismatches. Consequently, as shown in Figure 2, scores in terms of SDR, SIR and SAR vary significantly when the mismatch appears at different locations. To compensate such cases, we propose to measure the quality of visual sound separation under the criterion that two pairs of spectrograms need to obtain approximately the same score if they have the same level of similarities. This metric, referred to as Average Mutual Information Distortion (AMID), computes the average similarity between a separated sound and a ground-truth of another sound, where the similarity is estimated via the Structural Similarity (SSIM)  over spectrograms. Specifically, for a set of separated sounds and its corresponding annotations , AMID is computed as:
As AMID relies on SSIM over spectrograms, it is insensitive to frequencies. Moreover, a low AMID score indicates the model can distinctly separate sounds in a mixture, which meets the evaluation requirements of visual sound separation.
|MP-Net (M-Net + P-Net)||Binary||2.14||7.66||9.47||5.78||1.48||4.99||4.80||5.76|
|MP-Net (M-Net + P-Net)||Binary||5.73||12.75||10.50||11.22||4.23||8.18||6.95||23.10|
MUSIC mainly contains untrimmed videos of people playing instruments belonging to categories, namely accordion, acoustic guitar, cello, clarinet, erhu, flute, saxophone, trumpet, tuba, violin and xylophone. There are respectively , and samples in the train, validation and test set of MUSIC. While the test set of MUSIC contains only duets without ground-truths of sounds in mixtures, we use its validation set as test set, and train set for training and validation. While MUSIC focuses on instrumental sounds, VEGAS, another dataset will a larger scale, covers types of natural sounds, including baby crying, chainsaw, dog, drum, fireworks, helicopter, printer, rail transport, snoring and water flowing, trimmed from AudioSet . samples in VEGAS are used as the test, with remaining samples being used for training and validation.
4.2 Training and Testing Details
Due to the lack of ground-truth of real mixed data, \ie those videos that contain multiple sounds. We construct such data from solo video clips instead. Each clip contains at most one sound. We denote the collection of solo video clips by , where and respectively represent the sound and visual content. Note that a video clip can be silent, for such case, is an empty spectrogram. For each , we sample frames at even intervals, and extract visual features for each frame using ResNet-18 . This would result in a feature tensor of size . In both training and testing, this feature tensor will be reduced into a vector to represent the visual content by performing max pooling along the first three dimensions. On top of this solo video collection, we then follow the Mix-and-Separate strategy as in [28, 10] to construct the mixed video/sound data, where each sample mixes videos, called a mix- sample.
Audios are preprocessed before training and testing. Specifically, we sample audios at kHz, and use the open-sourced package librosa  to transform the sound clips of around 6 seconds into STFT spectrograms of size , where the window size and the hop length are respectively set as and . We down-sample it on mel scale and obtain the spectrogram with size .
We use for the M-Net. We adopt a three-round training strategy, where in the first round, we train the M-Net in isolation. And in the second round, we train the P-Net while fixing parameters of the M-Net. Finally, in the third round, the M-Net and the P-Net are jointly finetuned.
During training, for each mix- sample, we first perform data augmentation, randomly scaling the energy of the spectrograms. Then, MP-Net makes predictions in the descending order of the average energy of the ground-truth sounds. Particularly, for the -th prediction, MP-Net predicts and for the sound with -th largest average energy and computes the BCE loss between and the ground-truth mask if binary masks are used, or loss if ratio masks are used. After all predictions are done, we add an extra loss between the remaining mixture and an empty spectrogram – ideally if all predictions are precise, there should be no sound left.
During evaluation, we determine the prediction order by Eq.(7). Since all baselines need to know the number of sounds in the mixture, to compare fairly, we also provide the number of sounds to MP-Net. It is, however, noteworthy that MP-Net could work without this information, relying only on the termination criterion to determine the number. On MUSIC, MP-Net predicts the correct number of sounds with over of accuracy.
4.3 Experimental Results
Results on Effectiveness
To study the effectiveness of our model, we compared our model to state-of-the-art methods, namely PixelPlayer  and MIML , across datasets and settings, providing a comprehensive comparison. Specifically, on both MUSIC and VEGAS, we train and evaluate all methods twice, respectively using mix- and mix- samples, which contain and sounds in the mixture. For PixelPlayer and MP-Net, we further alter the form of masks to switch between ratio masks and binary masks. The results in terms of NSDR, SIR, SAR and AMID are listed in Table 1 for VEGAS and Table 2 for MUSIC. We observe that 1) our proposed MP-Net obtains best results in most settings, outperforming PixelPlayer and MIML by large margins, which indicate the effectiveness of separating sounds in the order of average energy. 2) Using ratio masks is better in terms of NSDR and SAR, while using binary masks is better in terms of SIR and AMID. 3) Our proposed metric AMID correlates well with other metrics, which intuitively verifies its effectiveness. 4) Scores of all methods on mix- samples are much higher than scores on mix- samples, which add only one more sound in the mixture. Such differences in scores have shown the challenges of visual sound separation. 5) In general, methods obtain higher scores on MUSIC, meaning natural sounds are more complicated than instrumental sounds, as instrumental sounds often contain regular patterns.
Results on Ablation Study
While the proposed MP-Net contains two sub-nets, we have compared MP-Net with and without P-Net. As shown in Table 1 and Table 2, on all metrics, MP-Net with P-Net outperforms MP-Net without P-Net by large margins, indicating 1) different sounds have shared patterns, a good model needs to take this into consideration, so that the mixture of separated sounds is consistent with the actual mixture. 2) P-Net could effectively compensate the loss of shared patterns caused by sound removements, filling blanks in the spectrograms.
Results on Robustness
A benefit of recursively separating sounds from the mixture is that MP-Net is robust when the number of sounds in the mixture varies, although trained with a fixed number of sounds. To verify the generalization ability of MP-Net, we have tested all methods that trained with mix- or mix- samples, on samples with an increasing number of sounds in the mixtures. The resulting curves on MUSIC are shown in Figure 4, and Figure 5 includes the curves on VEGAS. In Figure 4 and Figure 5, up to mixtures consisting of sounds, MP-Net trained with a fixed number of sounds in the mixture outperforms baselines steadily as the number of sounds in the mixture increases.
In Figure 3, we show qualitative samples with sounds separated by respectively MP-Net and PixelPlayer, in the form of spectrograms. In the sample with a mixture of instrumental sounds, PixelPlayer fails to separate sounds belonging to violin and guitar, as their sounds are overwhelmed by the sound of accordion. On the contrary, MP-Net successfully separates sounds of violin and guitar, alleviating the effect of the accordion’s sound. Unlike PixelPlayer that separates sounds independently, MP-Net recursively separates the dominant sound in current mixture, and removes it from the mixture, leading to accurate separation results. A similar phenomenon can also be observed in the sample with a mixture of natural sounds, PixelPlayer predicts the same sound for rail_transport and water_flowing, and fails to separate the sound of a dog.
We propose MinusPlus Network (MP-Net), a novel framework for visual sound separation. Unlike previous methods that separate each sound independently, MP-Net jointly considers all sounds, where sounds with larger energy are separated firstly, followed by them being removed from the mixture, so that sounds with smaller energy keep emerging. In this way, once trained, MP-Net could deal with mixtures made of an arbitrary number of sounds. On two datasets, MP-Net is shown to consistently outperform state-of-the-arts, and maintains steady performance as the number of sounds in mixtures increases. Besides, MP-Net could also associate separated sounds to possible sound sources in the corresponding video, potentially linking data from two modalities.
This work is partially supported by the Collaborative Research Grant from SenseTime (CUHK Agreement No.TS1610626 & No.TS1712093), and the General Research Fund (GRF) of Hong Kong (No.14236516 & No.14203518).
-  Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. The conversation: Deep audio-visual speech enhancement. Proc. Interspeech 2018, pages 3244–3248, 2018.
-  Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–617, 2017.
-  Relja Arandjelovic and Andrew Zisserman. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018.
-  Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems, 2016.
-  Bo Dai, Sanja Fidler, and Dahua Lin. A neural compositional paradigm for image captioning. In Advances in Neural Information Processing Systems, pages 658–668, 2018.
-  Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision, pages 2970–2979, 2017.
-  Bo Dai, Deming Ye, and Dahua Lin. Rethinking the form of latent states in image captioning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 282–298, 2018.
-  Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4):112, 2018.
-  Cédric Févotte, Nancy Bertin, and Jean-Louis Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 21(3):793–830, 2009.
-  Ruohan Gao, Rogerio Feris, and Kristen Grauman. Learning to separate object sounds by watching unlabeled video. In Proceedings of the European Conference on Computer Vision (ECCV), pages 35–53, 2018.
-  Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780. IEEE, 2017.
-  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  John R Hershey and Javier R Movellan. Audio vision: Using audio-visual synchrony to locate sounds. In Advances in neural information processing systems, pages 813–819, 2000.
-  Bruno Korbar, Du Tran, and Lorenzo Torresani. Cooperative learning of audio and video models from self-supervised synchronization. In Advances in Neural Information Processing Systems, pages 7774–7785, 2018.
-  Hongyang Li, Bo Dai, Shaoshuai Shi, Wanli Ouyang, and Xiaogang Wang. Feature intertwiner for object detection. arXiv preprint arXiv:1903.11851, 2019.
-  Hongyang Li, Xiaoyang Guo, Bo Dai, Wanli Ouyang, and Xiaogang Wang. Neural network encapsulation. In European Conference on Computer Vision, pages 266–282. Springer, 2018.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, pages 18–25, 2015.
-  Andrew Owens and Alexei A Efros. Audio-visual scene analysis with self-supervised multisensory features. European Conference on Computer Vision (ECCV), 2018.
-  Andrew Owens, Jiajun Wu, Josh H McDermott, William T Freeman, and Antonio Torralba. Ambient sound provides supervision for visual learning. In European Conference on Computer Vision (ECCV), 2016.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018.
-  Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 247–263, 2018.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  Yilei Xiong, Bo Dai, and Dahua Lin. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 468–483, 2018.
-  Jiaming Xu, Jing Shi, Guangcan Liu, Xiuyi Chen, and Bo Xu. Modeling attention and memory for auditory selection in a cocktail party environment. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
-  Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. Visual to sound: Generating natural sound for videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3550–3558, 2018.