Fast Multichannel Source Separation Based on Jointly Diagonalizable Spatial Covariance Matrices
This paper describes a versatile method that accelerates multichannel source separation methods based on full-rank spatial modeling. A popular approach to multichannel source separation is to integrate a spatial model with a source model for estimating the spatial covariance matrices (SCMs) and power spectral densities (PSDs) of each sound source in the time-frequency domain. One of the most successful examples of this approach is multichannel nonnegative matrix factorization (MNMF) based on a full-rank spatial model and a low-rank source model. MNMF, however, is computationally expensive and often works poorly due to the difficulty of estimating the unconstrained full-rank SCMs. Instead of restricting the SCMs to rank-1 matrices with the severe loss of the spatial modeling ability as in independent low-rank matrix analysis (ILRMA), we restrict the SCMs of each frequency bin to jointly-diagonalizable but still full-rank matrices. For such a fast version of MNMF, we propose a computationally-efficient and convergence-guaranteed algorithm that is similar in form to that of ILRMA. Similarly, we propose a fast version of a state-of-the-art speech enhancement method based on a deep speech model and a low-rank noise model. Experimental results showed that the fast versions of MNMF and the deep speech enhancement method were several times faster and performed even better than the original versions of those methods, respectively.
Multichannel source separation plays a central role for computational auditory scene analysis. To make effective use of an automatic speech recognition system in a noisy environment, for example, it is indispensable to separate speech signals from noise-contaminated signals. A standard approach to multichannel source separation is to use a non-blind method (e.g., beamforming and Wiener filtering) based on the spatial covariance matrix (SCM) of a target source (e.g., speech) and those of the other sources (e.g., noise). To use beamforming for speech enhancement, deep neural networks (DNNs) are often used for classifying each time-frequency bin into speech or noise [1, 2, 3]. The performance of such a supervised approach, however, is often considerably degraded in an unseen environment. In this paper we thus focus on general-purpose blind source separation (BSS) and its extension for environment-adaptive semi-supervised speech enhancement.
The goal of BSS is to estimate both a mixing process and sound sources from observed mixtures. To solve such an ill-posed problem, one can take a statistical approach based on a spatial model representing a sound propagation process and a source model representing the power spectral densities (PSDs) of each source. Duong et al.  pioneered this approach by integrating a full-rank spatial model using the frequency-wise full-rank SCMs of each source with a source model assuming the source spectra to follow complex Gaussian distributions. We call it as full-rank spatial covariance analysis (FCA) in this paper as in . To alleviate the frequency permutation problem of FCA, multichannel nonnegative matrix factorization (MNMF) that uses an NMF-based source model for representing the co-occurrence and low-rankness of frequency components has been developed [6, 7, 8]. Such a low-rank source model, however, does not fit speech spectra. In speech enhancement, a semi-supervised approach that uses as source models a DNN-based speech model (deep prior, DP) trained from clean speech data and an NMF-based noise model learned on the fly has thus recently been investigated (called MNMF-DP) [9, 10, 11].
The major drawbacks common to these methods based on the full-rank SCMs are the high computational cost due to the repeated heavy operations (e.g., inversion) of the SCMs and the difficulty of parameter optimization due to the large degree of freedom (DOF) of the spatial model. Kitamura et al.  thus proposed a constrained version of MNMF called independent low-rank matrix analysis (ILRMA) that restricts the SCMs to rank-1 matrices. Although ILRMA is an order of magnitude faster and practically performed better than MNMF, it suffers from the severe loss of the spatial modeling ability. Ito et al.  proposed a fast version of FCA that restricts the SCMs of each frequency bin to jointly-diagonalizable matrices. For parameter estimation, an expectation-maximization (EM) algorithm with a fixed-point iteration (FPI) method was proposed, but its convergence was not guaranteed.
In this paper we propose a versatile convergence-guaranteed method for estimating the jointly-diagonalizable SCMs of the full-rank spatial model and its application to FCA, MNMF, and MNMF-DP called FastFCA, FastMNMF, and FastMNMF-DP, respectively, where FastMNMF has an intermediate ability of spatial modeling between MNMF and ILRMA. As shown in Fig. 1, while all channels are correlated in the original spectrograms, they are independent in the linearly-transformed spectrograms obtained by applying a diagonalizer to each frequency bin. MNMF for the original complex spectrograms is thus equivalent to computationally-efficient nonnegative tensor factorization (NTF) for the independent nonnegative PSDs of the transformed spectrograms. To estimate such a diagonalizer (linear transform), we use an iterative projection (IP) method in a way similar to independent vector analysis (IVA)  that estimates a demixing matrix. The resulting algorithm based on iterations of NTF and IP is similar in form to that of ILRMA based on iterations of NMF and IP.
One of the important contributions of this paper is to improve existing decomposition methods by joint diagonalization of covariance matrices. This idea was first discussed for an ultimate but computationally-prohibitive extension of NTF called correlated tensor factorization (CTF)  based on multi-way full-rank covariance matrices, resulting in a fast version of CTF called independent low-rank tensor analysis (ILRTA) . While ILRTA was used for single-channel BSS based on jointly-diagonalizable frequency covariance matrices, in this paper we focus on multi-channel BSS based on jointly-diagonalizable spatial covariance matrices. Since NTF and IP are used in common for parameter optimization, the proposed FastMNMF can be regarded as a special case of ILRTA.
Ii Multichannel Source Separation
This section reviews existing multichannel source separation methods based on a full-rank spatial model, i.e., full-rank spatial covariance analysis (FCA)  based on an unconstrained source model, MNMF  based on an NMF-based source model, and its adaptation to speech enhancement called MNMF-DP  based on a DNN-based speech model and an NMF-based noise model.
Ii-a Full-Rank Spatial Model
Ii-A1 Model Formulation
Suppose that sources are observed by microphones. Let be the observed multichannel complex spectra, where and are the number of frequency bins and that of frames, respectively. Let be the image of source assumed to be circularly-symmetric complex Gaussian distributed as follows:
where is the PSD of source at frequency and time , is the positive definite full-rank SCM of source at frequency . Using the reproductive property of the Gaussian distribution, the observed spectrum is given by
Given the mixture spectrum and the model parameters and , the posterior expectation of the source image is obtained by multichannel Wiener filtering (MWF):
where and .
Ii-A2 Parameter Estimation
Our goal is to estimate the parameters and that maximize the log-likelihood given by Eq. (2):
where . In this paper we use a majorization-minimization (MM) algorithm  that iteratively maximizes a lower bound of Eq. (4). As in [14, 15], the closed-form update rule of was recently found to be given by
Ii-B Source Models
Ii-B1 Unconstrained Source Model
The unconstrained model directly uses as free parameters. Using the MM algorithm, the multiplicative update (MU) rule of is given by
Ii-B2 NMF-Based Source Model
If the PSDs of a source (e.g., noise and music) have low-rank structure, the PSDs can be factorized as follows :
where is the number of bases, is the magnitude of basis of source at frequency , and is the activation of basis of source at time . Using the MM algorithm, the MU rules of and are given by
Ii-B3 DNN-Based Source Model
To represent the complicated characteristics of the PSDs of a source (e.g., speech), a deep generative model can be used as follows:
where is a nonlinear function (DNN) with parameters that maps a latent variable to a nonnegative spectrum at each time , indicates the -th element of a vector, is a scaling factor at frequency , and is an activation at time .
To update the latent variables , we use Metropolis sampling. A proposal is accepted with probability , where is given by
where , . In practice, we update several times without updating to reduce the computational cost of calculating .
In the same way as the NMF-based source model, the MU rules of and are given by
Ii-C Integration of Spatial and Source Models
Ii-C1 Full-Rank Spatial Covariance Analysis
Ii-C2 Multichannel NMF
MNMF is obtained by integrating the NMF-based source model into FCA.
Ii-C3 MNMF with a Deep Prior
MNMF-DP specialized for speech enhancement is obtained by integrating the full-rank spatial model and the DNN- and NMF-based source models representing speech and noise sources, respectively. Assuming a source indexed by corresponds to the speech, and are given by Eq. (12) and Eq. (9), respectively.
Iii Fast Multichannel Source Separation
This section proposes the fast versions of FCA, MNMF, and MNMF-DP based on the joint diagonalizable SCMs.
Iii-a Jointly Diagonalizable Full-Rank Spatial Model
Iii-A1 Model Formulation
To reduce the computational cost of the full-rank spatial model, we put a constraint that the SCMs can be jointly diagonalized as follows:
where is a non-singular matrix called a diagonalizer and is a nonnegative vector. The observed spectrum is projected into a new space where the elements of the projected spectrum are all independent (Fig. 1).
Iii-A2 Parameter Estimation
where , indicates the element-wise absolute square, and .
where is a one-hot vector whose -th element is 1. A diagonalizer is estimated so that the components (channels) of become independent. In IVA  and ILRMA  under a determined condition (), a demixing matrix is estimated so that the components (sources) of become independent. In any case, the characteristics of the components (e.g., low-rankness in the NMF-based source model) represented by are considered. This implies that our method could work as fast as ILRMA even in an underdetermined condition () while keeping the full-rank spatial modeling ability.
Iii-B Source Models
Iii-B1 Unconstrained Source Model
Using the MM algorithm for IS-NMF, the MU rule of is given by
Iii-B2 NMF-Based Source Model
Similarly, the MU rules of and included in Eq. (9) are given by
Iii-B3 DNN-Based Source Model
To update the latent variables included in Eq. (12), we use Metropolis sampling. A proposal is accepted with probability , where is given by
where is a reconstruction without the component of source . As in the NMF-based source model, the MU rules of and are given by
Iii-C Integration of Spatial and Source Models
The fast version of FCA is obtained by integrating the jointly diagonalizable full-rank spatial model and the unconstrained source model. While the EM algorithm with the FPI step was originally used in , in this paper we use the MM algorithm with the convergence-guaranteed IP step.
The fast version of MNMF is obtained by integrating the NMF-based source model into FastFCA.
The fast version of MNMF-DP is obtained by integrating the jointly diagonalizable full-rank spatial model, the DNN- and NMF-based source models representing speech and noise sources, respectively.
This section evaluates the performances and efficiencies of the proposed methods in a speech enhancement task.
Iv-a Experimental Conditions
100 simulated noisy speech signals sampled at 16 kHz were randomly selected from the evaluation dataset of CHiME3 . These data were supposed to be recorded by six microphones attached to a tablet device. Five channels () excluding the second channel behind the tablet were used. The short-time Fourier transform with a window length of 1024 points () and a shifting interval of 256 points was used. To evaluate the performance of speech enhancement, the signal-to-distortion ratio (SDR) was measured [18, 19]. To evaluate the computational efficiency, the elapsed time per iteration for processing 8 sec data was measured on Intel Xeon W-2145 (3.70 GHz) or NVIDIA GeForce GTX 1080 Ti.
FastFCA (Section III-C1), FastMNMF (Section III-C2), and FastMNMF-DP (Section III-C3) based on the jointly diagonalizable SCMs were compared with FCA (Section II-C1), MNMF  (Section II-C2), MNMF-DP  (Section II-C3) based on the unconstrained SCMs, where all methods used the MM algorithms (with the IP step for estimating ) described in this paper. The original FCA  and FastFCA  denoted by FCA and FastFCA based on the EM algorithms (with the FPI step for estimating ) were also tested. For comparison, ILRMA based on the rank-1 SCMs was tested.
The number of sources was set as except for ILRMA used only in a determined condition . The number of iterations was 100. For the NMF-based source model, the number of bases was set to , , or . For the DNN-based source model, the latent variables with were updated 30 times per iteration and the proposal variance was set to . The parameters were trained in advance from clean speech data of about 15 hours included in WSJ-0 corpus  as described in . More specifically, a DNN-based decoder that generates from and a DNN-based encoder that infers from were trained jointly in a variational autoencoding manner . The SCM of speech was initialized as the average of the observed SCMs and the SCMs of noise were initialized as the identity matrices. and were initialized with spectral decomposition of . was initialized by feeding to the encoder.
|Method||FCA / FastFCA||FCA / FastFCA||ILRMA||MNMF / FastMNMF||MNMF-DP / FastMNMF-DP|
|# of bases||4||16||64||4||16||64||4||16||64|
|# of sources||2||2.1 / 0.49||3.3 / 0.43||4.9 / 0.70||5.0 / 0.79||5.4 / 1.3||11 / 1.7||11 / 1.7||11 / 1.9|
|3||2.6 / 0.59||4.0 / 0.47||5.9 / 0.78||6.0 / 0.91||6.5 / 1.7||13 / 1.8||13 / 1.8||13 / 2.3|
|4||3.2 / 0.70||4.7 / 0.56||6.8 / 0.85||7.0 / 1.1||7.7 / 2.2||15 / 1.9||15 / 2.0||15 / 2.8|
|5||3.7 / 0.81||5.3 / 0.63||0.53||0.62||1.0||7.8 / 1.0||8.0 / 1.2||8.9 / 2.8||17 / 2.0||17 / 2.2||17 / 3.4|
|Method||FCA / FastFCA||FCA / FastFCA||ILRMA||MNMF / FastMNMF||MNMF-DP / FastMNMF-DP|
|# of bases||4||16||64||4||16||64||4||16||64|
|# of sources||2||1.6 / 0.16||2.0 / 0.15||3.0 / 0.38||3.0 / 0.55||3.2 / 1.2||7.0 / 0.90||7.0 / 0.98||7.1 / 1.3|
|3||2.3 / 0.19||2.8 / 0.17||4.2 / 0.43||4.2 / 0.68||4.5 / 1.7||9.3 / 0.94||9.3 / 1.1||9.5 / 1.8|
|4||3.0 / 0.22||3.6 / 0.17||5.3 / 0.46||5.4 / 0.81||5.7 / 2.2||12 / 0.99||12 / 1.2||12 / 2.3|
|5||3.7 / 0.25||4.5 / 0.19||0.52||0.61||1.0||6.6 / 0.51||6.7 / 0.94||7.1 / 2.7||14 / 1.0||14 / 1.4||14 / 2.8|
Iv-B Experimental Results
Tables I-\subreftable:processingTime_CPU and I-\subreftable:processingTime_GPU list the elapsed times per iteration and Table II lists the average SDRs. FastFCA slightly outperformed FastFCA  in all measures because the IP method and the FPI method calculate the matrix inversion only once and twice, respectively, for updating , and the MM algorithm converges faster than the EM algorithm. FastFCA, FastMNMF, and FastMNMF-DP were an order of magnitude faster and performed as well as or even better than their original versions. In general, more than any two positive definite matrices cannot be exactly jointly diagonalized. If , the fast versions are thus inferior to the original versions in terms of the DOF, but the restriction of the DOF of the spatial model was proved to be effective for avoiding bad local optima. If , the DOFs of the fast versions are exactly the same as those of the original versions in theory as described in stereo FastFCA , but the fast versions were less sensitive to the initialization in our experiment. One reason would be that while only the SCM of speech was initialized to a reasonable value in the original versions, the initialization of based on contributed to initializing the SCM of noise in the fast versions. When and (the best condition for ILRMA), FastMNMF was as fast as and outperformed ILRMA.
This paper presented a full-rank spatial model
based on the jointly diagonalizable SCMs of sound sources
and its application to existing methods such as FCA, MNMF, and MNMF-DP.
For such fast versions,
we proposed a general convergence-guaranteed MM algorithm
that uses the IP method for estimating the SCMs.
We experimentally showed that
our approach is effective for improving
both the separation performance and computational efficiency.
One important direction
is to develop online FastMNMF-DP for real-time noisy speech recognition
because the real-time factor of FastMNMF-DP could be less than 1.
We also plan to simultaneously consider
the jointly diagonalizable full-rank
spatial and frequency covariance matrices of sound sources
as suggested in .
-  A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in ICASSP, 2013, pp. 7092–7096.
-  H. Erdogan et al., “Improved MVDR beamforming using single-channel mask prediction networks,” in Interspeech, 2016, pp. 1981–1985.
-  J. Heymann et al., “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, 2016, pp. 196–200.
-  N. Q. K. Duong et al., “Under-determined reverberant audio source separation using a full-rank spatial covariance model,” IEEE TASLP, vol. 18, no. 7, pp. 1830–1840, 2010.
-  N. Ito and T. Nakatani, “FastFCA-AS: Joint diagonalization based acceleration of full-rank spatial covariance analysis for separating any number of sources,” in IWAENC, 2018, pp. 151–155.
-  A. Ozerov and C. Févotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE TASLP, vol. 18, no. 3, pp. 550–563, 2010.
-  S. Arberet et al., “Nonnegative matrix factorization and spatial covariance model for under-determined reverberant audio source separation,” in ISSPA, 2010, pp. 1–4.
-  H. Sawada et al., “Multichannel extensions of non-negative matrix factorization with complex-valued data,” IEEE TASLP, vol. 21, no. 5, pp. 971–982, 2013.
-  Y. Bando et al., “Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization,” in ICASSP, 2018, pp. 716–720.
-  K. Sekiguchi et al., “Bayesian multichannel speech enhancement with a deep speech prior,” in APSIPA, 2018, pp. 1233–1239.
-  S. Leglaive et al., “Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization,” in ICASSP, 2019. [Online]. Available: https://arxiv.org/pdf/1811.06713.pdf
-  D. Kitamura et al., “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE TASLP, vol. 24, no. 9, pp. 1626–1641, 2016.
-  N. Ono, “Stable and fast update rules for independent vector analysis based on auxiliary function technique,” in WASPAA, 2011, pp. 189–192.
-  K. Yoshii, “Correlated tensor factorization for audio source separation,” in ICASSP, 2018, pp. 731–735.
-  K. Yoshii et al., “Independent low-rank tensor analysis for audio source separation,” in EUSIPCO, 2018, pp. 1671–1675.
-  M. Nakano et al., “Convergence-guaranteed multiplicative algorithms for nonnegative matrix factorization with -divergence,” in MLSP, 2010, pp. 283–288.
-  J. Barker et al., “The third ’CHiME’ speech separation and recognition challenge: Analysis and outcomes,” Computer Speech & Language, vol. 46, pp. 605–626, 2017.
-  E. Vincent et al., “Performance measurement in blind audio source separation,” IEEE TASLP, vol. 14, no. 4, pp. 1462–1469, 2006.
-  C. Raffel et al., “mir_eval: A transparent implementation of common MIR metrics,” in ISMIR, 2014, pp. 367–372.
-  J. Garofalo et al., “CSR-I (WSJ0) complete,” Linguistic Data Consortium, Philadelphia, 2007.
-  S. Leglaive et al., “A variance modeling framework based on variational autoencoders for speech enhancement,” in IEEE MLSP, 2018.
-  D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in ICLR, 2014.
-  N. Ito et al., “FastFCA: A joint diagonalization based fast algorithm for audio source separation using a full-rank spatial covariance model,” in EUSIPCO, 2018, pp. 1667–1671.