Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices
Abstract
This paper describes a versatile method that accelerates multichannel source separation methods based on fullrank spatial modeling. A popular approach to multichannel source separation is to integrate a spatial model with a source model for estimating the spatial covariance matrices (SCMs) and power spectral densities (PSDs) of each sound source in the timefrequency domain. One of the most successful examples of this approach is multichannel nonnegative matrix factorization (MNMF) based on a fullrank spatial model and a lowrank source model. MNMF, however, is computationally expensive and often works poorly due to the difficulty of estimating the unconstrained fullrank SCMs. Instead of restricting the SCMs to rank1 matrices with the severe loss of the spatial modeling ability as in independent lowrank matrix analysis (ILRMA), we restrict the SCMs of each frequency bin to jointlydiagonalizable but still fullrank matrices. For such a fast version of MNMF, we propose a computationallyefficient and convergenceguaranteed algorithm that is similar in form to that of ILRMA. Similarly, we propose a fast version of a stateoftheart speech enhancement method based on a deep speech model and a lowrank noise model. Experimental results showed that the fast versions of MNMF and the deep speech enhancement method were several times faster and performed even better than the original versions of those methods, respectively.
I Introduction
Multichannel source separation plays a central role for computational auditory scene analysis. To make effective use of an automatic speech recognition system in a noisy environment, for example, it is indispensable to separate speech signals from noisecontaminated signals. A standard approach to multichannel source separation is to use a nonblind method (e.g., beamforming and Wiener filtering) based on the spatial covariance matrix (SCM) of a target source (e.g., speech) and those of the other sources (e.g., noise). To use beamforming for speech enhancement, deep neural networks (DNNs) are often used for classifying each timefrequency bin into speech or noise [1, 2, 3]. The performance of such a supervised approach, however, is often considerably degraded in an unseen environment. In this paper we thus focus on generalpurpose blind source separation (BSS) and its extension for environmentadaptive semisupervised speech enhancement.
The goal of BSS is to estimate both a mixing process and sound sources from observed mixtures. To solve such an illposed problem, one can take a statistical approach based on a spatial model representing a sound propagation process and a source model representing the power spectral densities (PSDs) of each source. Duong et al. [4] pioneered this approach by integrating a fullrank spatial model using the frequencywise fullrank SCMs of each source with a source model assuming the source spectra to follow complex Gaussian distributions. We call it as fullrank spatial covariance analysis (FCA) in this paper as in [5]. To alleviate the frequency permutation problem of FCA, multichannel nonnegative matrix factorization (MNMF) that uses an NMFbased source model for representing the cooccurrence and lowrankness of frequency components has been developed [6, 7, 8]. Such a lowrank source model, however, does not fit speech spectra. In speech enhancement, a semisupervised approach that uses as source models a DNNbased speech model (deep prior, DP) trained from clean speech data and an NMFbased noise model learned on the fly has thus recently been investigated (called MNMFDP) [9, 10, 11].
The major drawbacks common to these methods based on the fullrank SCMs are the high computational cost due to the repeated heavy operations (e.g., inversion) of the SCMs and the difficulty of parameter optimization due to the large degree of freedom (DOF) of the spatial model. Kitamura et al. [12] thus proposed a constrained version of MNMF called independent lowrank matrix analysis (ILRMA) that restricts the SCMs to rank1 matrices. Although ILRMA is an order of magnitude faster and practically performed better than MNMF, it suffers from the severe loss of the spatial modeling ability. Ito et al. [5] proposed a fast version of FCA that restricts the SCMs of each frequency bin to jointlydiagonalizable matrices. For parameter estimation, an expectationmaximization (EM) algorithm with a fixedpoint iteration (FPI) method was proposed, but its convergence was not guaranteed.
In this paper we propose a versatile convergenceguaranteed method for estimating the jointlydiagonalizable SCMs of the fullrank spatial model and its application to FCA, MNMF, and MNMFDP called FastFCA, FastMNMF, and FastMNMFDP, respectively, where FastMNMF has an intermediate ability of spatial modeling between MNMF and ILRMA. As shown in Fig. 1, while all channels are correlated in the original spectrograms, they are independent in the linearlytransformed spectrograms obtained by applying a diagonalizer to each frequency bin. MNMF for the original complex spectrograms is thus equivalent to computationallyefficient nonnegative tensor factorization (NTF) for the independent nonnegative PSDs of the transformed spectrograms. To estimate such a diagonalizer (linear transform), we use an iterative projection (IP) method in a way similar to independent vector analysis (IVA) [13] that estimates a demixing matrix. The resulting algorithm based on iterations of NTF and IP is similar in form to that of ILRMA based on iterations of NMF and IP.
One of the important contributions of this paper is to improve existing decomposition methods by joint diagonalization of covariance matrices. This idea was first discussed for an ultimate but computationallyprohibitive extension of NTF called correlated tensor factorization (CTF) [14] based on multiway fullrank covariance matrices, resulting in a fast version of CTF called independent lowrank tensor analysis (ILRTA) [15]. While ILRTA was used for singlechannel BSS based on jointlydiagonalizable frequency covariance matrices, in this paper we focus on multichannel BSS based on jointlydiagonalizable spatial covariance matrices. Since NTF and IP are used in common for parameter optimization, the proposed FastMNMF can be regarded as a special case of ILRTA.
Ii Multichannel Source Separation
This section reviews existing multichannel source separation methods based on a fullrank spatial model, i.e., fullrank spatial covariance analysis (FCA) [4] based on an unconstrained source model, MNMF [8] based on an NMFbased source model, and its adaptation to speech enhancement called MNMFDP [10] based on a DNNbased speech model and an NMFbased noise model.
Iia FullRank Spatial Model
IiA1 Model Formulation
Suppose that sources are observed by microphones. Let be the observed multichannel complex spectra, where and are the number of frequency bins and that of frames, respectively. Let be the image of source assumed to be circularlysymmetric complex Gaussian distributed as follows:
(1) 
where is the PSD of source at frequency and time , is the positive definite fullrank SCM of source at frequency . Using the reproductive property of the Gaussian distribution, the observed spectrum is given by
(2) 
Given the mixture spectrum and the model parameters and , the posterior expectation of the source image is obtained by multichannel Wiener filtering (MWF):
(3) 
where and .
IiA2 Parameter Estimation
Our goal is to estimate the parameters and that maximize the loglikelihood given by Eq. (2):
(4) 
where . In this paper we use a majorizationminimization (MM) algorithm [8] that iteratively maximizes a lower bound of Eq. (4). As in [14, 15], the closedform update rule of was recently found to be given by
(5)  
(6)  
(7) 
IiB Source Models
IiB1 Unconstrained Source Model
The unconstrained model directly uses as free parameters. Using the MM algorithm, the multiplicative update (MU) rule of is given by
(8) 
IiB2 NMFBased Source Model
If the PSDs of a source (e.g., noise and music) have lowrank structure, the PSDs can be factorized as follows [8]:
(9) 
where is the number of bases, is the magnitude of basis of source at frequency , and is the activation of basis of source at time . Using the MM algorithm[16], the MU rules of and are given by
(10)  
(11) 
IiB3 DNNBased Source Model
To represent the complicated characteristics of the PSDs of a source (e.g., speech), a deep generative model can be used as follows[9]:
(12) 
where is a nonlinear function (DNN) with parameters that maps a latent variable to a nonnegative spectrum at each time , indicates the th element of a vector, is a scaling factor at frequency , and is an activation at time .
To update the latent variables , we use Metropolis sampling. A proposal is accepted with probability , where is given by
(13) 
where , . In practice, we update several times without updating to reduce the computational cost of calculating .
In the same way as the NMFbased source model, the MU rules of and are given by
(14)  
(15) 
IiC Integration of Spatial and Source Models
IiC1 FullRank Spatial Covariance Analysis
IiC2 Multichannel NMF
MNMF[8] is obtained by integrating the NMFbased source model into FCA.
IiC3 MNMF with a Deep Prior
MNMFDP[10] specialized for speech enhancement is obtained by integrating the fullrank spatial model and the DNN and NMFbased source models representing speech and noise sources, respectively. Assuming a source indexed by corresponds to the speech, and are given by Eq. (12) and Eq. (9), respectively.
Iii Fast Multichannel Source Separation
This section proposes the fast versions of FCA, MNMF, and MNMFDP based on the joint diagonalizable SCMs.
Iiia Jointly Diagonalizable FullRank Spatial Model
IiiA1 Model Formulation
To reduce the computational cost of the fullrank spatial model, we put a constraint that the SCMs can be jointly diagonalized as follows:
(16) 
where is a nonsingular matrix called a diagonalizer and is a nonnegative vector. The observed spectrum is projected into a new space where the elements of the projected spectrum are all independent (Fig. 1).
IiiA2 Parameter Estimation
Our goal is to jointly estimate , , and that maximize the loglikelihood given by substituting Eq. (16) into Eq. (2) as follows:
(17) 
where , indicates the elementwise absolute square, and .
Since Eq. (17) has the same form as the loglikelihood function of IVA [13], can be updated by using the convergenceguaranteed iterative projection (IP) method as follows:
(18)  
(19)  
(20) 
where is a onehot vector whose th element is 1. A diagonalizer is estimated so that the components (channels) of become independent. In IVA [13] and ILRMA [12] under a determined condition (), a demixing matrix is estimated so that the components (sources) of become independent. In any case, the characteristics of the components (e.g., lowrankness in the NMFbased source model) represented by are considered. This implies that our method could work as fast as ILRMA even in an underdetermined condition () while keeping the fullrank spatial modeling ability.
IiiB Source Models
IiiB1 Unconstrained Source Model
Using the MM algorithm for ISNMF[16], the MU rule of is given by
(22) 
IiiB2 NMFBased Source Model
Similarly, the MU rules of and included in Eq. (9) are given by
(23)  
(24) 
IiiB3 DNNBased Source Model
To update the latent variables included in Eq. (12), we use Metropolis sampling. A proposal is accepted with probability , where is given by
(25) 
where is a reconstruction without the component of source . As in the NMFbased source model, the MU rules of and are given by
(26)  
(27) 
IiiC Integration of Spatial and Source Models
IiiC1 FastFCA
The fast version of FCA is obtained by integrating the jointly diagonalizable fullrank spatial model and the unconstrained source model. While the EM algorithm with the FPI step was originally used in [5], in this paper we use the MM algorithm with the convergenceguaranteed IP step.
IiiC2 FastMNMF
The fast version of MNMF is obtained by integrating the NMFbased source model into FastFCA.
IiiC3 FastMNMFDP
The fast version of MNMFDP is obtained by integrating the jointly diagonalizable fullrank spatial model, the DNN and NMFbased source models representing speech and noise sources, respectively.
Iv Evaluation
This section evaluates the performances and efficiencies of the proposed methods in a speech enhancement task.
Iva Experimental Conditions
100 simulated noisy speech signals sampled at 16 kHz were randomly selected from the evaluation dataset of CHiME3 [17]. These data were supposed to be recorded by six microphones attached to a tablet device. Five channels () excluding the second channel behind the tablet were used. The shorttime Fourier transform with a window length of 1024 points () and a shifting interval of 256 points was used. To evaluate the performance of speech enhancement, the signaltodistortion ratio (SDR) was measured [18, 19]. To evaluate the computational efficiency, the elapsed time per iteration for processing 8 sec data was measured on Intel Xeon W2145 (3.70 GHz) or NVIDIA GeForce GTX 1080 Ti.
FastFCA (Section IIIC1), FastMNMF (Section IIIC2), and FastMNMFDP (Section IIIC3) based on the jointly diagonalizable SCMs were compared with FCA (Section IIC1), MNMF [8] (Section IIC2), MNMFDP [10] (Section IIC3) based on the unconstrained SCMs, where all methods used the MM algorithms (with the IP step for estimating ) described in this paper. The original FCA [4] and FastFCA [5] denoted by FCA and FastFCA based on the EM algorithms (with the FPI step for estimating ) were also tested. For comparison, ILRMA[12] based on the rank1 SCMs was tested.
The number of sources was set as except for ILRMA used only in a determined condition . The number of iterations was 100. For the NMFbased source model, the number of bases was set to , , or . For the DNNbased source model, the latent variables with were updated 30 times per iteration and the proposal variance was set to . The parameters were trained in advance from clean speech data of about 15 hours included in WSJ0 corpus [20] as described in [21]. More specifically, a DNNbased decoder that generates from and a DNNbased encoder that infers from were trained jointly in a variational autoencoding manner [22]. The SCM of speech was initialized as the average of the observed SCMs and the SCMs of noise were initialized as the identity matrices. and were initialized with spectral decomposition of . was initialized by feeding to the encoder.



Method  FCA / FastFCA  FCA / FastFCA  ILRMA  MNMF / FastMNMF  MNMFDP / FastMNMFDP  
# of bases  4  16  64  4  16  64  4  16  64  
# of sources  2  2.1 / 0.49  3.3 / 0.43  4.9 / 0.70  5.0 / 0.79  5.4 / 1.3  11 / 1.7  11 / 1.7  11 / 1.9  
3  2.6 / 0.59  4.0 / 0.47  5.9 / 0.78  6.0 / 0.91  6.5 / 1.7  13 / 1.8  13 / 1.8  13 / 2.3  
4  3.2 / 0.70  4.7 / 0.56  6.8 / 0.85  7.0 / 1.1  7.7 / 2.2  15 / 1.9  15 / 2.0  15 / 2.8  
5  3.7 / 0.81  5.3 / 0.63  0.53  0.62  1.0  7.8 / 1.0  8.0 / 1.2  8.9 / 2.8  17 / 2.0  17 / 2.2  17 / 3.4  




Method  FCA / FastFCA  FCA / FastFCA  ILRMA  MNMF / FastMNMF  MNMFDP / FastMNMFDP  
# of bases  4  16  64  4  16  64  4  16  64  
# of sources  2  1.6 / 0.16  2.0 / 0.15  3.0 / 0.38  3.0 / 0.55  3.2 / 1.2  7.0 / 0.90  7.0 / 0.98  7.1 / 1.3  
3  2.3 / 0.19  2.8 / 0.17  4.2 / 0.43  4.2 / 0.68  4.5 / 1.7  9.3 / 0.94  9.3 / 1.1  9.5 / 1.8  
4  3.0 / 0.22  3.6 / 0.17  5.3 / 0.46  5.4 / 0.81  5.7 / 2.2  12 / 0.99  12 / 1.2  12 / 2.3  
5  3.7 / 0.25  4.5 / 0.19  0.52  0.61  1.0  6.6 / 0.51  6.7 / 0.94  7.1 / 2.7  14 / 1.0  14 / 1.4  14 / 2.8  

IvB Experimental Results
Tables I\subreftable:processingTime_CPU and I\subreftable:processingTime_GPU list the elapsed times per iteration and Table II lists the average SDRs. FastFCA slightly outperformed FastFCA [5] in all measures because the IP method and the FPI method calculate the matrix inversion only once and twice, respectively, for updating , and the MM algorithm converges faster than the EM algorithm. FastFCA, FastMNMF, and FastMNMFDP were an order of magnitude faster and performed as well as or even better than their original versions. In general, more than any two positive definite matrices cannot be exactly jointly diagonalized. If , the fast versions are thus inferior to the original versions in terms of the DOF, but the restriction of the DOF of the spatial model was proved to be effective for avoiding bad local optima. If , the DOFs of the fast versions are exactly the same as those of the original versions in theory as described in stereo FastFCA [23], but the fast versions were less sensitive to the initialization in our experiment. One reason would be that while only the SCM of speech was initialized to a reasonable value in the original versions, the initialization of based on contributed to initializing the SCM of noise in the fast versions. When and (the best condition for ILRMA), FastMNMF was as fast as and outperformed ILRMA.
V Conclusion
This paper presented a fullrank spatial model
based on the jointly diagonalizable SCMs of sound sources
and its application to existing methods such as FCA, MNMF, and MNMFDP.
For such fast versions,
we proposed a general convergenceguaranteed MM algorithm
that uses the IP method for estimating the SCMs.
We experimentally showed that
our approach is effective for improving
both the separation performance and computational efficiency.
One important direction
is to develop online FastMNMFDP for realtime noisy speech recognition
because the realtime factor of FastMNMFDP could be less than 1.
We also plan to simultaneously consider
the jointly diagonalizable fullrank
spatial and frequency covariance matrices of sound sources
as suggested in [15].
References
 [1] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in ICASSP, 2013, pp. 7092–7096.
 [2] H. Erdogan et al., “Improved MVDR beamforming using singlechannel mask prediction networks,” in Interspeech, 2016, pp. 1981–1985.
 [3] J. Heymann et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196–200.
 [4] N. Q. K. Duong et al., “Underdetermined reverberant audio source separation using a fullrank spatial covariance model,” IEEE TASLP, vol. 18, no. 7, pp. 1830–1840, 2010.
 [5] N. Ito and T. Nakatani, “FastFCAAS: Joint diagonalization based acceleration of fullrank spatial covariance analysis for separating any number of sources,” in IWAENC, 2018, pp. 151–155.
 [6] A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE TASLP, vol. 18, no. 3, pp. 550–563, 2010.
 [7] S. Arberet et al., “Nonnegative matrix factorization and spatial covariance model for underdetermined reverberant audio source separation,” in ISSPA, 2010, pp. 1–4.
 [8] H. Sawada et al., “Multichannel extensions of nonnegative matrix factorization with complexvalued data,” IEEE TASLP, vol. 21, no. 5, pp. 971–982, 2013.
 [9] Y. Bando et al., “Statistical speech enhancement based on probabilistic integration of variational autoencoder and nonnegative matrix factorization,” in ICASSP, 2018, pp. 716–720.
 [10] K. Sekiguchi et al., “Bayesian multichannel speech enhancement with a deep speech prior,” in APSIPA, 2018, pp. 1233–1239.
 [11] S. Leglaive et al., “Semisupervised multichannel speech enhancement with variational autoencoders and nonnegative matrix factorization,” in ICASSP, 2019. [Online]. Available: https://arxiv.org/pdf/1811.06713.pdf
 [12] D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE TASLP, vol. 24, no. 9, pp. 1626–1641, 2016.
 [13] N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in WASPAA, 2011, pp. 189–192.
 [14] K. Yoshii, “Correlated tensor factorization for audio source separation,” in ICASSP, 2018, pp. 731–735.
 [15] K. Yoshii et al., “Independent lowrank tensor analysis for audio source separation,” in EUSIPCO, 2018, pp. 1671–1675.
 [16] M. Nakano et al., “Convergenceguaranteed multiplicative algorithms for nonnegative matrix factorization with divergence,” in MLSP, 2010, pp. 283–288.
 [17] J. Barker et al., “The third ’CHiME’ speech separation and recognition challenge: Analysis and outcomes,” Computer Speech & Language, vol. 46, pp. 605–626, 2017.
 [18] E. Vincent et al., “Performance measurement in blind audio source separation,” IEEE TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.
 [19] C. Raffel et al., “mir_eval: A transparent implementation of common MIR metrics,” in ISMIR, 2014, pp. 367–372.
 [20] J. Garofalo et al., “CSRI (WSJ0) complete,” Linguistic Data Consortium, Philadelphia, 2007.
 [21] S. Leglaive et al., “A variance modeling framework based on variational autoencoders for speech enhancement,” in IEEE MLSP, 2018.
 [22] D. P. Kingma and M. Welling, “Autoencoding variational Bayes,” in ICLR, 2014.
 [23] N. Ito et al., “FastFCA: A joint diagonalization based fast algorithm for audio source separation using a fullrank spatial covariance model,” in EUSIPCO, 2018, pp. 1667–1671.