Speaker Adapted Beamforming for Multi-ChannelAutomatic Speech Recognition

Speaker Adapted Beamforming for Multi-Channel Automatic Speech Recognition

Abstract

This paper presents, in the context of multi-channel ASR, a method to adapt a mask based, statistically optimal beamforming approach to a speaker of interest. The beamforming vector of the statistically optimal beamformer is computed by utilizing speech and noise masks, which are estimated by a neural network. The proposed adaptation approach is based on the integration of the beamformer, which includes the mask estimation network, and the acoustic model of the ASR system. This allows for the propagation of the training error, from the acoustic modeling cost function, all the way through the beamforming operation and through the mask estimation network. By using the results of a first pass recognition and by keeping all other parameters fixed, the mask estimation network can therefore be fine tuned by retraining. Utterances of a speaker of interest can thus be used in a two pass approach, to optimize the beamforming for the speech characteristics of that specific speaker. It is shown that this approach improves the ASR performance of a state-of-the-art multi-channel ASR system on the CHiME-4 data. Furthermore the effect of the adaptation on the estimated speech masks is discussed.

\newacronym

AMAMacoustic model \newacronymASRASRautomatic speech recognition \newacronymBANBANblind analytic normalization \newacronym[plural=BLSTMs, firstplural=bidirectional long short-term memories (BLSTMs)]BLSTMBLSTMbidirectional long short-term memory \newacronymDASDASdelay and sum \newacronymDFTDFTdiscrete Fourier transform \newacronymFSFSfilter and sum \newacronym[plural=GMMs, firstplural=Gaussian mixture models (GMMs)]GMMGMMGaussian mixture model \newacronym[plural=DNNs, firstplural=deep neural networks (DNNs)]DNNDNNdeep neural network \newacronymGEVGEVgeneralized eigenvalue \newacronym[plural=HMMs, firstplural=hidden Markov models (HMMs)]HMMHMMhidden Markov model \newacronym[plural=LSTMs, firstplural=long short-term memories (LSTMs)]LSTMLSTMlong short-term memory \newacronymMVDRMVDRminimum variance distortionless response \newacronym[plural=RNNs, firstplural=recurrent neural networks (RNNs)]RNNRNNrecurrent neural network \newacronymSNRSNRsignal-to-noise ratio \newacronymSTFTSTFTshort-time Fourier transform \nameTobias Menne, Ralf Schlüter, Hermann Ney \address Human Language Technology and Pattern Recognition, Computer Science Department,
RWTH Aachen University, Aachen, Germany \email{menne, schlueter, ney}@cs.rwth-aachen.de

Index Terms: robust ASR, multi-channel ASR, speaker adaptation, acoustic beamforming, CHiME-4

1 Introduction

The performance of \glsASR systems has shown significant improvements over the last decade. Those have especially been driven by the utilization of deep learning techniques [deeNeuNetForAcoModInSpeRec]. Nevertheless the performance of systems dealing with realistic noisy and far-field scenarios is still significantly worse than the performance of close talking systems on clean recordings [anOveOfNoiRobAutSpeRec, strForDisSpeRecInRevEnv]. Multi-channel \glsASR systems are often used in those scenarios to improve recognition robustness. In these systems the effect of noise, reverberation and speech overlap is mitigated by utilizing spatial information through beamforming [micArrSigProTecAndApp].

Usually beamforming is done in a separate preprocessing step before applying the \glsASR system to the enhanced signal, which is obtained from the output of the preprocessing [theRwtUpbForSysComForThe4thChiChaEva]. A general formulation for beamforming is the filter-and-sum approach [acoFilAndSumBeaByAdaPriComAna, beaAVerAppToSpatFil], where the single channels are summed up after applying a separate linear filter to each one. Usually those filters are derived such that an objective criterion on the signal level, such as \glsSNR, is optimized. Popular approaches are the \glsDAS [micArrSigProTecAndApp], \glsMVDR [impMvdBeaUsiSinChaMasPreNet] and \glsGEV [bliAcoBeaBasOnGenEigDec] beamforming methods. Most systems submitted to the CHiME and REVERB challenges [theThiChiSpeSepAndRecChaDatTasAndBas, anAnaOfEnvMicAndDatSimMisInRobSpeRec, theRevChaAComEvaFraForDerAndRecOfRevSpe] follow one or more of these approaches.

The objective used to optimize the preprocessing thus differs from the objective of the acoustic model training. Even before the introduction of \glsDNN hybrid systems in \glsASR, the optimization of the preprocessing towards the goal of speech recognition was proposed e.g. in [likMaxBeaForRobHanSpeRec]. The success of deep learning also motivated the integration of the beamforming operation into the acoustic model. E.g. in [deeBeaNetForMulChaSpeRec, deeLonShoTerMemAdaBeaNetForMulRobSpeRec] the filters of the filter-and-sum beamforming are estimated by a neural network based on input features derived from the multi-channel input signal. Even learning the complete multi-channel preprocessing, starting from the raw time signal, has been shown to work [speLocAndMicSpaInvAcoModFroRawMulWav, facSpaAndSpeMulRawWavCld, neuNetAdaBeaForRobMulSpeRec]. The advantage of those approaches is, that the preprocessing is not optimized for a proxy measure like \glsSNR at the output of the beamformer, but directly towards the criterion for acoustic model training. But thus far, a very large amount of training data is necessary to obtain satisfying performance with those approaches.

Lately the performance of statistically optimal beamformers was improved by using neural networks to estimate speech and noise masks, which are then used to compute the beamforming vectors [blsSupGevBeaFroForThe3rdChiChal, robMvdBeaUsiTimFreMasForOnlOffAsrInNoi, impMvdBeaUsiSinChaMasPreNet]. This approach has worked well for many submissions to the CHiME challenge [theRwtUpbForSysComForThe4thChiChaEva, widResBlsNetWitDisSpeAdaForRobSpeRec, theUstIflSysForChiCha]. One problem of that approach is the need for target masks in the mask estimator training, which usually requires stereo data (the noisy and its respective clean signal) to create the target masks for training. Since this type of data is much more difficult to collect than only the noisy data, training of the mask estimator is usually done on simulated signals, which can lead to a mismatch between training and test data. To solve this problem, the authors of [beaEndToEndTraOfABeaSupMulChaAsrSys] proposed to integrate the mask based, statistically optimal beamforming with the acoustic modeling of the \glsASR system. This enables the propagation of the training error all the way through the acoustic model and the mask estimator network in the preprocessing. Therefore the mask estimator can be trained based on the training criterion of the acoustic model training.

In this paper, the approach of integrating the mask based, statistically optimal beamformer with the acoustic model is utilized to adapt the mask estimation to the speech characteristics of a speaker of interest in a two pass recognition approach.

The rest of the paper is organized as follows. An overview of the integrated system is given in Section 2. Furthermore an alternative approach to [beaEndToEndTraOfABeaSupMulChaAsrSys] for the propagation of the gradients through the eigenvalue problem of the beamformer is presented. Section 3 describes the experimental setup of a state-of-the-art system for the CHiME-4 speech recognition task followed by the experimental results in Section 4.

2 System overview

Figure 1: Overview of the integrated system. The grey blocks indicate modules with trainable parameters.

The system used in this work integrates the acoustic beamformer, usually called front-end, with the acoustic model of the \glsASR system, usually called back-end, very similarly to the integration described in [beaEndToEndTraOfABeaSupMulChaAsrSys]. Figure 1 gives an overview of the integrated system. is the input in the \glsSTFT domain, recorded from an array of microphones. It consists of a speech component and a noise component :

(1)

Where , is the time frame index and is the frequency bin index.

The main difference to the system introduced in [beaEndToEndTraOfABeaSupMulChaAsrSys] will be described in Section 2.3, whereas the acoustic beamformer and acoustic model are described in Sections 2.1 and 2.2, respectively. During a first pass decoding a \glsHMM-state sequence is obtained for the input signal , where and are the number of time frames and frequency bins of the signal. Section 2.4 describes the utilization of the state sequence to adapt the acoustic beamformer to a certain speaker.

2.1 GEV beamformer

The main purpose of the front-end is to denoise the input signal. Here this is achieved by acoustic beamforming [acoFilAndSumBeaByAdaPriComAna, beaAVerAppToSpatFil]:

(2)

Where is an estimate of the speech component, obtained by applying the beamforming vector . denotes the Hermitian transpose.

For this work we use the \glsGEV beamformer with \glsBAN, as described in [bliAcoBeaBasOnGenEigDec] and which is also used in [beaEndToEndTraOfABeaSupMulChaAsrSys]. The beamforming vector of the \glsGEV beamformer is derived by maximizing the a posteriori \glsSNR:

(3)

Where and are the spatial covariance matrices of speech and noise, respectively. This results in the generalized eigenvalue problem

(4)

with being the eigenvector corresponding to the largest eigenvalue.

The spatial covariance matrices for are computed by applying a mask to the recorded multi-channel signal

(5)

A mask estimating neural network is used to estimate and . For both, speech and noise, one mask is estimated for every channel, is then computed as the median mask, which contains the element-wise median of the channel dependent masks, as described e.g. in [blsSupGevBeaFroForThe3rdChiChal].

The \glsBAN post-filter, as described in [bliAcoBeaBasOnGenEigDec], is a frequency dependent scaling of the \glsGEV beamforming vector, such that the final beamforming vector used here is:

(6)

With being the scaling factor described in [bliAcoBeaBasOnGenEigDec].

2.2 Acoustic model

The acoustic model is a \glsBLSTM hybrid model using log-mel filterbank features as input. Apart from the features, the training pipeline is the same as for the speaker independent model described in [theRwtUpbForSysComForThe4thChiChaEva].

2.3 Beamformer integration into acoustic model

Training of the integrated system presented in Figure 1 is done according to standard error back propagation. The gradient computation for the propagation through the acoustic model, feature extraction, linear filtering of the beamformer, \glsBAN and mask estimator network are straight forward. To propagate the gradient through the computation of the principal eigenvector of

(7)

as required for computing the beamforming vector according to Equation 4, the derivatives of the eigenvalue problem w.r.t. and are derived in [optNeuNetSupAcoBeaByAlgDif] and used in [beaEndToEndTraOfABeaSupMulChaAsrSys].

In contrast, here the principal eigenvector of Equation 7 is approximated by applying the QR-algorithm as presented in [theQrTraAUniAnaToTheLrTraPar1]. A matrix is decomposed by the QR-decomposition into a product of a unitary matrix and an upper triangular matrix :

(8)

With being the iteration index, is then computed as

(9)

It is shown in [theQrTraAUniAnaToTheLrTraPar1], that converges to an upper triangular matrix as . The diagonal of then contains the eigenvalues of and contains the respective eigenvectors. This QR-algorithm is used here to approximate the principal eigenvector of by setting

(10)

The algorithmic differentiation of the QR decomposition is outlined in [onEvaHigOrdDerOfTheQrDecOfTalMatWitFulColRanInForAndRevModAlgDif] and applied here in the error back propagation.

2.4 Speaker adaptation of mask estimator

After a first pass recognition an optimal sequence of HMM states is obtained from the decoding process for each of the evaluation segments of the speaker of interest. Those alignments are then used as training targets for an adaptation training of the integrated system. Of the system shown in Figure 1, only the parameters of the mask estimator are adjusted in the adaptation training. The parameters of the remaining pipeline are kept fixed, such that only the mask estimator network is tuned towards optimizing the cost function of the integrated system. Therefore the mask estimator and thus the computation of the beamforming vector are optimized for the speech characteristics of the speaker of interest.

Even though this work is using the \glsGEV beamformer with \glsBAN, it is noteworthy that the proposed speaker adaptation method is equally applicable to the mask based \glsMVDR beamformer that is presented in [robMvdBeaUsiTimFreMasForOnlOffAsrInNoi], by changing the initialization of in Equation 10 and omitting the \glsBAN.

3 Experimental setup

The proposed speaker adaptation scheme for the acoustic beamformer is evaluated on the data of the CHiME-4 speech recognition task [anAnaOfEnvMicAndDatSimMisInRobSpeRec]. The CHiME-4 dataset features real and simulated \SI16kHz, multi-channel audio data recorded with a six channel microphone array arranged around a tablet device. Based on the 5k WSJ0-Corpus recordings and simulations have been done with four different kinds of real-world background noise. The training set contains approximately \SI18h of data per channel recorded from 87 different speakers. Results are provided for the real development and real evaluation set of the 6-Channel track. Both sets contain audio of 4 speakers each, of which 2 are male and 2 are female, with no overlap between development and evaluation set. The amount of data per speaker is approximately \SI0.7h in the development set and around \SI0.5h in the evaluation set.

The acoustic model used in the experiments is a \glsBLSTM network, with 5 layers and 600 units per layer. Different to the system in [theRwtUpbForSysComForThe4thChiChaEva], the input features are 80 dimensional log-mel filterbank features computed in the \glsSTFT domain employing a blackman window with a window size of \SI25ms and a frame shift of \SI10ms. The input features are unnormalized, but a linear layer with 80 units, employing batch normalization, was added as a first layer to the network. This results in a marginally better baseline system over the one described in [theRwtUpbForSysComForThe4thChiChaEva]. The initial training of the acoustic model is done as described in [theRwtUpbForSysComForThe4thChiChaEva], where at first alignments for the training set are computed on the data of the close talking microphone by using a \glsGMM-\glsHMM trained only on the data of the close talking microphones of the training set. Those alignments can then be used for all other channels, since the data is recorded sample synchronized. The training of the \glsBLSTM acoustic model is done by using the unprocessed audio data of the single channels. This has been demonstrated to be beneficial in many submissions to the and CHiME challenge e.g. in [theNttChiSysAdvInSpeEnhAndRecForMobMulMicDev].

The mask estimator network used in the experiments is similar to the one described in [blsSupGevBeaFroForThe3rdChiChal]. It consists of a \glsBLSTM layer with 256 units followed by two fully connected layers with 512 units and ReLU activations and another fully connected layer with 402 units and sigmoid activation. Thus the resolution of the estimated masks in frequency is lower than described in [blsSupGevBeaFroForThe3rdChiChal]. This is due to the adjustment of the dimensions of the masks to the \glsDFT size of the feature extraction pipeline of the \glsASR system used here. The input of the mask estimation network is the magnitude spectrum of a single channel. The output of the network is the concatenation of the noise mask and the speech mask. During decoding the outputs of the different channels, of one utterance, are grouped and the median masks are calculated. Those are then applied to all channels to estimate the spatial covariance matrices as described in Section 2.1. The initial mask estimation network is trained on the simulated training data as described in [blsSupGevBeaFroForThe3rdChiChal]. In contrast to [blsSupGevBeaFroForThe3rdChiChal], only the provided baseline configuration of the simulation is used and no additional data augmentation is done. The number of iterations of the QR-algorithm described in Section 2.3 is fixed to 5.

The decoding is done with the 5-gram language model provided as a baseline language model with the CHiME-4 dataset. In a post processing step a \glsRNN language model lattice rescoring is done. The \glsRNN language model is a 3 layer \glsLSTM with highway connections. Details about the language model and lattice rescoring can be found in [theRwtUpbForSysComForThe4thChiChaEva].

In addition to the acoustic beamforming described in Section 2.1, the baseline beamforming algorithm of the CHiME-4 task (BFIT) is used to provide baseline results. Apart from the beamforming algorithm, the exact same pipeline as described above is used.

The hyper-parameters for the speaker adaptation training such as the learning rate were tuned on the development set and applied to the evaluation set.

4 Experimental results

4.1 Baseline systems

Table 1 shows an overview of the experimental results. It shows, that using the \glsGEV front-end described in Section 2.1 yields an improvement of about \SI20\percent - \SI30\percent relative over the baseline system with the BFIT front-end. Joint training of the \glsGEV front-end and acoustic model further improves the performance another \SI5\percent relative. Those results are in line with the results reported in [beaEndToEndTraOfABeaSupMulChaAsrSys]. When comparing the mask output of the mask estimator before and after joint training only minor differences in the masks can be observed. This is in line with the suggestion of the authors of [beaEndToEndTraOfABeaSupMulChaAsrSys], that a majority of the performance increase stems from the adaptation of the acoustic model towards the specific front-end.

System Dev Eval
\theadSystem
id \theadFront-
end \theadJoint
training \theadSpeaker
adapted
0 BFIT - - 4.36 7.17
1 GEV 3.46 5.18
2 + 3.32 4.84
3 + 3.09 4.58
Table 1: Average WER (%) for the described systems for different stages of the integrated training.

4.2 Speaker adapted beamforming

Table 1 shows an overall improvement of WER after speaker adaptation and Table 2 shows that improved performance is obtained for the majority of the speakers with an improvement in WER of up to \SI11\percent and \SI15\percent relative for single speakers of the evaluation and development set, respectively. Figure 2 shows an example of the estimated speech mask before and after the speaker adaptation. It can be seen, that the speech mask after speaker adaptation shows a stronger emphasis on the fundamental frequency and the harmonics. This can be seen repeatedly between the time marks of \SI2s and \SI3s. At time mark \SI4s a pattern of fundamental frequency and harmonics can be seen in the mask after adaptation, which is not present in the mask before adaptation and which can also hardly be spotted in the input signal or the clean signal. This could indicate an increased bias of the mask estimator towards this kind of pattern.

\theadSys.
id
Dev Eval
F01 F04 M03 M04 F05 F06 M05 M06
2 4.19 3.23 2.77 3.07 6.88 4.09 3.83 4.58
3 3.55 3.20 2.48 3.14 6.35 4.09 3.38 4.48
Table 2: WER (%) of separate speakers for the jointly trained system and the speaker adapted system
Figure 2: Two seconds snippet of the signal ”F01_421C0210_BUS” of the development set starting at second 2 and showing the frequency range up to \SI3kHz. a) log magnitude spectrum of the noisy signal recorded at channel 5 b) log magnitude spectrum of the signal recorded at the close talking microphone c) estimated speech mask of system 2 (jointly trained but before speaker adaptation) d) estimated speech mask of system 3 (after speaker adaptation)

5 Conclusion

This work describes a method for speaker adaptation of mask based beamforming in a multi-channel ASR system. The basis of the adaptation method is the integration of the statistically optimal beamforming with the acoustic model to allow the back propagation of the training errors through the complete system, which has been previously introduced in [beaEndToEndTraOfABeaSupMulChaAsrSys]. Here an alternative solution for the back propagation of the errors through the computation of the beamforming vector, based on the QR-algorithm, is presented. The system is then used in a two pass approach to adapt the mask estimator to a speaker of interest during the decoding phase. It was shown that this adaptation method results in speech masks which show a stronger emphasis on the fundamental frequency and harmonics of the speaker. Furthermore a relative ASR improvement, for single speakers of the real evaluation data of the CHiME-4 ASR task, of up to \SI11\percent relative was shown.

6 Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program grant agreement No. 694537. This work has also been supported by Deutsche Forschungsgemeinschaft (DFG) under contract No. Schl2043/1-1 and European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 644283. The work reflects only the authors’ views and the European Research Council Executive Agency is not responsible for any use that may be made of the information it contains. The GPU cluster used for the experiments was partially funded by Deutsche Forschungsgemeinschaft (DFG) Grant INST 222/1168-1.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
204759
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description