DNN-based Source Enhancement to IncreaseObjective Sound Quality Assessment Score

# DNN-based Source Enhancement to Increase Objective Sound Quality Assessment Score

## Abstract

We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error (MSE). Since OSQA scores have been used widely for sound-quality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create high-quality output signals. However, since most OSQA scores are not analytically tractable, i.e., they are black boxes, the gradient of the objective function cannot be calculated by simply applying back-propagation. To calculate the gradient of the OSQA-based objective function, we formulated a DNN optimization scheme on the basis of black-box optimization, which is used for training a computer that plays a game. For a black-box-optimization scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate output signals using the sampling algorithm, DNNs are used to estimate the probability-density function of the output signals that maximize OSQA scores. The OSQA scores are calculated from the simulated output signals, and the DNNs are trained to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several experiments, we found that OSQA scores significantly increased by applying the proposed method, even though the MSE was not minimized.

Sound-source enhancement, time-frequency mask, deep learning, objective sound quality assessment (OSQA) score.

## I Introduction

Sound-source enhancement has been studied for many years [1, 2, 3, 4, 5, 6] because of the high demand for its use for various practical applications such as automatic speech recognition [7, 8, 9], hands-free telecommunication [10, 11], hearing aids [12, 13, 14, 15], and immersive audio field representation [16, 17]. In this study, we aimed at generating an enhanced target source with high listening quality because the processed sounds are assumed perceived by humans.

Recently, deep learning [18] has been successfully used for sound-source enhancement [8, 15, 19, 20, 21, 22, 23, 24, 25, 28, 26, 29, 30, 31, 27, 32, 33, 34, 35] . In many of these conventional studies, deep neural networks (DNNs) were used as a regression function to estimate time-frequency (T-F) masks [19, 20, 21, 22] and/or amplitude-spectra of the target source [23, 24, 25, 28, 26, 29, 30, 31, 27]. The parameters of the DNNs were trained using back-propagation [36] to minimize an analytically tractable objective function such as the mean squared error (MSE) between supervised outputs and DNN outputs. In recent studies, advanced analytical objective functions were used such as the maximum-likelihood (ML) [32, 31], the combination of multi-types of MSE [25, 26, 27], the Kullback-Leibler and/or Itakura-Saito divergence [33], the modified short-time intelligibility measure (STOI) [22], the clustering cost [34], and the discriminative cost of a clean target source and output signal using a generative adversarial network (GAN) [35].

When output sound is perceived by humans, the objective function that reflects human perception may not be analytically tractable, i.e., it is a black-box function. In the past few years, objective sound quality assessment (OSQA) scores, such as the perceptual evaluation of speech quality (PESQ) [37] and STOI [38], have been commonly used to evaluate output sound quality. Thus, it might be better to construct DNNs to increase OSQA scores directly. However, since typical OSQA scores are not analytically defined (i.e., they are black-box functions), the gradient of the objective function cannot be calculated by simply applying back-propagation.

We previously proposed a DNN training method to estimate T-F masks and increase OSQA scores [39]. To overcome the problem that the objective function to maximize the OSQA scores is not analytically tractable, we developed a DNN-training method on the basis of the black-box optimization framework [40], as used in predicting the winning percentage of the game Go [41]. The basic idea of block-box optimization is estimating a gradient from randomly simulated output. For example, in the training of a DNN for the Go-playing computer, the computer determines a “move” (where to put a Go-stone) depending on the DNN output. Then, when the computer won the game, a gradient is calculated to increase the selection probability of the selected “moves”. We adopt this strategy to increase the OSQA scores; some output signals are randomly simulated and a DNN is trained to increase the generation probability of the simulated output signals that achieved high OSQA scores. For the first trial, we prepared a finite number of T-F mask templates and trained DNNs to select the best template that maximizes the OSQA score. Although we found that the OSQA scores increased using this method, the output performances would improve by extending the method to a more flexible T-F mask design scheme from the template-selection scheme.

In this study, to arbitrarily estimate T-F masks, we modified the DNN source enhancement architecture to estimate the latent parameters in a continuous probability density function (PDF) of the T-F mask processing output signals, as shown in Fig. 1. To calculate the gradient of the objective function, we adopt the policy gradient method [42] as a black-box optimization scheme. With our method, the estimated latent parameters construct a continuous PDF as the “policy” of T-F-mask estimation to increase OSQA scores. On the basis of this policy, the output signals are directly simulated using the sampling algorithm. Then, the gradient of the DNN is estimated to increase/decrease the generation probability of output signals with high/low OSQA scores, respectively. The sampling from continuous PDF causes the estimate of the gradient to fluctuate, resulting in unstable training behavior. To avoid this problem, we additionally formulate two tricks: i) score normalization to reduce the variance in the estimated gradient, and ii) a sampling algorithm to simulate output signals to satisfy the constraint of T-F mask processing.

The rest of this paper is organized as follows. Section II introduces DNN source enhancement based on the ML approach. In Section III, we propose our DNN training method to increase OSQA scores on the basis of the black-box optimization. After investigating the sound quality of output signals through several experiments in Section IV, we conclude this paper in Section V.

## Ii Conventional Method

### Ii-a Sound source enhancement with time-frequency mask

Let us consider the problem of estimating a target source , which is surrounded by ambient noise . A signal observed with a single microphone is assumed to be modeled as

 Xω,τ=Sω,τ+Nω,τ, (1)

where and denote the frequency and time indices, respectively.

In sound-source enhancement using T-F masks, the output signal is obtained by multiplying a T-F mask by as

 ^Sω,τ=Gω,τXω,τ, (2)

where is a T-F mask. The IRM [8] is an implementation of T-F mask, which is defined by

 G\scriptsize IRMω,τ=|Sω,τ||Sω,τ|+|Nω,τ|. (3)

The IRM maximizes the signal-to-noise-ratio (SNR) when the phase spectrum of coincides with that of . However, this assumption is almost never satisfied in most practical cases. To compensate for this mismatch, the phase sensitive spectrum approximation (PSA) [19, 20] was proposed

 G\scriptsize PSAω,τ=min(1,max(0,|Sω,τ||Xω,τ|cos(θ(S)ω,τ−θ(X)ω,τ))), (4)

where and are the phase spectra of and , respectively. Since the PSA is a T-F mask that minimizes the squared error between and on the complex plane, we use this as a T-F masking scheme.

### Ii-B Maximum-likelihood-based DNN training for T-F mask estimation

In many conventional studies of DNN-based source enhancement, DNNs were used as a mapping function to estimate T-F masks. In this section, we explain DNN training based on ML estimation, on which the proposed method is based. Since the ML-based approach explicitly models the PDF of the target source, it becomes possible to simulate output signals by generating random numbers from the PDF.

In ML-based training, the DNNs are constructed to estimate the parameters of the conditional PDF of the target source providing the observation is given by . Here, denotes the DNN parameters. Its example on a fully connected DNN is described later (after (16)). The target and observation source are assumed to be vectorized for all frequency bins as

 Sτ :=(S1,τ,...,SΩ,τ)⊤, (5) Xτ :=(X1,τ,...,XΩ,τ)⊤, (6)

where is transposition. Then is trained to maximize the expectation of the log-likelihood as

 Θ←arg maxΘJ\scriptsize ML(Θ), (7)

where the objective function is defined by

 J\scriptsize ML(Θ)=ES,X[lnp(S|X,Θ)], (8)

and denotes the expectation operator for . However, since (8) is difficult to analytically calculate, the expectation calculation is replaced with the average of the training dataset as

 J\scriptsize ML(Θ)≈1TT∑τ=1lnp(Sτ|Xτ,Θ). (9)

The back-propagation algorithm [36] is used in training to maximize (9). When is composed of differentiable functions with respect to , the gradient is calculated as

 ∇ΘJ\scriptsize ML(Θ) ≈1TT∑τ=1∇Θlnp(Sτ|Xτ,Θ), (10)

where is a partial differential operator with respect to .

To calculate (10), is modeled by assuming that the estimation error of is independent for all frequency bins and follows the zero-mean complex Gaussian distribution with the variance . The assumption is based on state-of-the-art methods, which train DNNs to minimize the MSE between and on the complex plane [19, 20]. The minimum-MSE (MMSE) on the complex plane is equivalent to assuming that the errors are independent for all frequency bins and follow the zero-mean complex Gaussian distribution with variance 1. Our assumption relaxes the assumption of the conventional methods; the variances of each frequency bin vary according to the error values to maximize the likelihood. Thus, since is given by , is modeled by the following complex Gaussian distribution as

 p(Sτ|Xτ,Θ) =Ω∏ω=112πσ2ω,τexp⎧⎪⎨⎪⎩−∣∣Sω,τ−^Gω,τXω,τ∣∣22σ2ω,τ⎫⎪⎬⎪⎭. (11)

In this model, it can be regarded that the MSE between and on the complex plane is extended to the likelihood of defined on the complex Gaussian distribution, the mean and variance parameters of which are and , respectively. (11) includes unknown parameters: the T-F mask and error variance . Thus, we construct DNNs to estimate and from , as shown in Fig. 2. The vectorized T-F masks and error variances for all frequency bins are defined as

 G(xτ) :=(^G1,τ,...,^GΩ,τ)⊤, (12) σ(xτ) :=(σ21,τ,...,σ2Ω,τ)⊤. (13)

Here is the input vector of DNNs that is prepared by concatenating several frames of observations to account for previous and future frames as , and and are estimated by

 G(xτ) ←ϕg{W(μ)z(L−1)τ+b(μ)}, (14) σ(xτ) ←ϕσ{W(σ)z(L−1)τ+b(σ)}+Cσ, (15) z(l)τ =ϕh{W(l)z(l−1)τ+b(l)}, (16)

where is a small positive constant value to prevent the variance from being very small. Here, , , , and are the layer index, number of layers, weight matrix, and bias vector, respectively. are the weight matrices and are the bias vectors to estimate the T-F mask and variance, respectively. The DNN parameters are composed of . The functions , , and are nonlinear activation functions, and in conventional studies, sigmoid and exponential functions were used as an implementation of [19, 20] and [32], respectively. The input vector is passed to the first layer of the network as .

## Iii Proposed Method

Our proposed DNN-training method increases OSQA scores. With the proposed method, the policy gradient method [42] is used to statistically calculate the gradient with respect to by using a sampling algorithm, even though the objective function is not differentiable. However, sampling-based gradient estimation would frequently make the DNN training behavior become unstable. To avoid this problem, we introduce two tricks: i) score normalization that reduces the variance in the estimated gradient (in Sec. III-B), and ii) a sampling algorithm to simulate output signals to satisfy the constraint of T-F mask processing (in Sec. III-C). Finally, the overall training procedure of the proposed method is summarized in Sec. III-D.

### Iii-a Policy gradient-based DNN training for T-F mask estimation

Let be a scoring function that quantifies the sound quality of the estimated sound signal defined by (2). To implement , subjective evaluation is simple. However, it would be difficult to use in practical implementation because DNN training requires a massive amount of listening-test results. Thus, quantifies the sound quality based on OSQA scores, as shown in Fig. 1, and the details of its implementation are discussed in Sec. III-B. We assume is non-differentiable with respect to , because most OSQA scores are black-box functions.

Let us consider the expectation maximization of as a metric of performance of the sound-source enhancement that increases OSQA scores as

 E^S,X[B(^S,X)]=∬B(^S,X)p(^S,X)d^SdX. (17)

Since the output signal is calculated from the observation , we decompose the joint PDF into the conditional PDF of the output signal given the observation and the marginal PDF of the observation as . Then, (17) can be reformed as

 E^S,X[B(^S,X)]=∫p(X)∫B(^S,X)p(^S|X)d^SdX. (18)

We use DNNs to estimate the parameters of the conditional PDF of the output signal , as with the case of ML-based training. For example, the complex Gaussian distribution in (11) can be used as . To train , is used as an objective function by replacing the conditional PDF with as

 J(Θ) =E^S,X[B(^S,X)], (19) =∫p(X)∫B(^S,X)p(^S|X,Θ)d^SdX. (20)

Since is non-differentiable with respect to , the gradient of (20) cannot be analytically obtained by simply applying back-propagation. Hence, we apply the policy-gradient method [42], which can statistically calculate the gradient of a black-box objective function. By assuming that the function form of is smooth, is a continuous function and its derivative exists. In addition, we assume is composed with differentiable functions with respect to . Then, the gradient of (20) can be calculated using a log-derivative trick [42] as

 ∇ΘJ(Θ) =∫p(X)∫B(^S,X)∇Θp(^S|X,Θ)d^SdX, (21) =EX[E^S|X[B(^S,X)∇Θlnp(^S|X,Θ)]]. (22)

Since the expectation in (22) cannot be analytically calculated, the expectation with respect to is approximated by averaging the training data, and the average of is calculated using the sampling algorithm as

 ∇ΘJ(Θ) ≈1TT∑τ=11KK∑k=1B(^S(k)τ,Xτ)∇Θlnp(^S(k)τ|Xτ,Θ), (23) ^S(k)τ ∼p(^S|Xτ,Θ), (24)

where is the -th simulated output signal and is the number of samplings, which is assumed to be sufficiently large. The superscript represents the variable of the -th sampling, and is a sampling operator from the right-side distribution. The details of the sampling process for (24) are described in Sec. III-C.

Most OSQA scores, such as PESQ, are designed for their scores to be calculated using several time frames such as one utterance of a speech sentence. Since of every time frame cannot be obtained, the gradient cannot be calculated by (23). Thus, instead of using the average of , we use the average of utterances. We define the observation of the -th utterance as , and the -th output signal of the -th utterance as . Then the gradient can be calculated as

 ∇ΘJ(Θ) ≈1II∑i=1∇ΘJ(i)(Θ), (25) ∇ΘJ(i)(Θ) ≈K∑k=1B(^S(i,k),X(i))KT(i)T(i)∑τ=1∇Θlnp(^S(i,k)τ|X(i)τ,Θ), (26)

where is the frame length of the -th utterance, and we assume that the output signal of each time frame is calculated independently. The details of the deviation of (25) are described in the Appendix -A.

### Iii-B Scoring-function design for stable training

We now introduce a design of a scoring function to stabilize the training process. Because the expectation for the gradient calculation in (22) is approximated using the sampling algorithm, the training may become unstable. One reason for unstable training behavior is that the variance in the estimated gradient becomes large in accordance with the large variance in the scoring-function output [42]. To stabilize the training, instead of directly using a raw OSQA score as , a normalized OSQA score is used to reduce its variance. Hereafter, a raw OSQA score calculated from , and is written as to distinguish between a raw OSQA score and normalized OSQA score .

From (25) and (26), the total gradient is a weighted sum of the -th gradient of the log-likelihood function, and is used as its weight. Since typical OSQA scores vary not only by the performance of source enhancement but also by the SNRs of each input signal , also varies by the OSQA scores and SNRs of . To reduce the variance in the estimate of the gradient, it would be better to remove such external factors according to the input conditions of each input signal, e.g., input SNRs. As a possible solution, the external factors involved in the OSQA score would be estimated by calculating the expectation of the OSQA score of the input signal. Thus, subtracting the conditional expectation of given by each input signal from might be effective in reducing the variance as

 B(^S,X)=Z(^S,X)−E^S|X[Z(^S,X)]. (27)

This implementation is known as “baseline-subtraction” [42, 43]. Here, cannot be analytically calculated, so we replace the expectation with the average of OSQA scores. Then the scoring function is designed as

 (28)

### Iii-C Sampling-algorithm to simulate T-F-mask-processed output signal

The sampling operator used in (24) is an intuitive method that uses a typical pseudo random number generator such as the Mersenne-Twister [44]. However, this sampling operator would in fact be difficult to use because typical sampling algorithms simulate output signals that do not satisfy the constraint of real-valued T-F-mask processing defined by (2). To avoid this problem, we calculate the T-F mask and output signal from the simulated output signal by using a typical sampling algorithm , so that and satisfy the constraint of T-F-mask processing and minimize the squared error between and .

Figure 3 illustrates the overview of the problem and the proposed solution on the complex plane. In this study, we use the real-value T-F mask within the range of . Thus, the output signal is constrained to exist on the dotted line in Fig. 3, i.e., T-F mask processing affects only the norm of . However, since is modeled by a continuous PDF such as the complex Gaussian distribution in (11), a typical sampling algorithm possibly generates output signals that do not satisfy the T-F-mask constraint, i.e., the phase spectrum of does not coincide with that of . To solve this problem, we formulate the PSA-based T-F-mask re-calculation. First, a temporary output signal is sampled using a sampling algorithm (Fig. 3 arrow-(i)). Then, the T-F mask that minimizes the squared error between and is calculated using the PSA equation as

 ^G(i,k)ω,τ=min⎛⎜⎝1,max⎛⎜⎝0,|~S(i,k)ω,τ||X(i)ω,τ|cos(θ(~S(i,k))ω,τ−θ(X(i))ω,τ)⎞⎟⎠⎞⎟⎠, (29)

where and are the phase spectra of and , respectively. Then, the output signal is calculated by

 ^S(i,k)ω,τ=^G(i,k)ω,τX(i)ω,τ, (30)

as shown with arrow-(ii) in Fig. 3.

### Iii-D Training procedure

We describe the overall training procedure of the proposed method, as shown in Fig. 4. Hereafter, to simplify the sampling algorithm, we use the complex Gaussian distribution as described in (11)–(16).

First, the -th observation utterance is simulated by (1) using a randomly selected target-source file and a noise source with equal frame size from the training dataset. Next, the T-F mask and variance are estimated by (11)–(16). Then, to simulate the -th output signal , the temporary output signal is sampled from the complex Gaussian distribution using a pseudo random number generator, such as the Mersenne-Twister [44], as

 ⎡⎢ ⎢⎣R(~S(i,k)ω,τ)I(~S(i,k)ω,τ)⎤⎥ ⎥⎦ ∼NC⎛⎜⎝^G(i)ω,τ⎡⎢⎣R(X(i)ω,τ)I(X(i)ω,τ)⎤⎥⎦,σ2ω,τI⎞⎟⎠, (31)

where is the identity matrix, and and denote the real and imaginary parts of the complex number, respectively. After that, T-F mask is calculated using (29). To accelerate the algorithm convergence, we additionally use the -greedy algorithm to calculate . With probability applied to each time-frequency bin, the maximum a posteriori (MAP) T-F mask estimated using DNNs is used instead of as

 ^G(i,k)ω,τ←⎧⎨⎩^G(i,k)ω,τ(with prob. ϵ)^G(i)ω,τ(otherwise). (32)

In addition, a large gradient value leads to unstable training. One reason for the large gradient is that the log-likelihood in (26) becomes large. To reduce the gradient of the log-likelihood, the difference between the mean T-F mask and simulated T-F mask is truncated to confine it within the range of as

 Δ^G(i,k)ω,τ ←^G(i,k)ω,τ−^G(i)ω,τ (33) Δ^G(i,k)ω,τ ←⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩λ(Δ^G(i,k)ω,τ>λ)Δ^G(i,k)ω,τ(−λ≤Δ^G(i,k)ω,τ≤λ)−λ(Δ^G(i,k)ω,τ<−λ), (34) ^G(i,k)ω,τ ←^G(i)ω,τ+Δ^G(i,k)ω,τ. (35)

Then, the output signal is calculated by T-F-mask processing (30), and the OSQA scores and are calculated by (28). After applying these procedures for utterances, is updated using the back-propagation algorithm using the gradient calculated by (25).

## Iv Experiments

We conducted objective experiments to evaluate the performance of the proposed method. The experimental conditions are described in Sec. IV-A. To investigate whether a DNN source-enhancement function can be trained to increase OSQA scores, we first investigated the relationship between the number of updates and OSQA scores (Sec. IV-B). Second, the source enhancement performance of the proposed method was compared with those of conventional methods by using several objective measurements (Sec. IV-C). Finally, subjective evaluations for sound quality and ineligibility were conducted (Sec. IV-D). For comparison methods, we used four DNN source-enhancement methods; two T-F-mask mapping functions trained using an MMSE-based objective function [19] and the ML-based objective function described in Sec. II-B, and two T-F-mask selection functions trained for increasing the PESQ and STOI [39].

### Iv-a Experimental conditions

#### Dataset

The ATR Japanese speech database [45] was used as the training dataset of the target source. The dataset consists of 6640 utterances spoken by 11 males and 11 females. The utterances were randomly separated into 5976 for the development set and 664 for the validation set. As the training dataset of noise, a noise dataset of CHiME-3 was used that consisted of four types of background noise files including noise in cafes, street junctions, public transport, and pedestrian areas [46]. The noisy-mixture dataset was generated by mixing clean speech utterances with various noisy and SNR conditions using the following procedure; i) the noise is randomly selected from noise dataset, ii) the amplitude of noise is adjusted to be the desired SNR-level, and iii) the speech and noise source is added in the time-domain. As the test dataset, a Japanese speech database consisting of 300 utterances spoken by 3 males and 3 females was used for target-source dataset, and an ambient noise database recorded at airports (Airp.), amusement parks (Amuse.), offices (Office), and party rooms (Party) was used as the noisy dataset. All samples were recorded at the sampling rate of 16 kHz. The SNR levels of the training/test dataset were -6, 0, 6, and 12 dB.

#### DNN architecture and setup

For the proposed and all conventional methods, a fully connected DNN was used that has 3 hidden layers and 1024 hidden units. All input vectors were mean-and-variance normalized using the training data statistics. The activation functions for the T-F mask , variance , and hidden units were the sigmoid function, exponential function, and rectified linear unit (ReLU), respectively. The context window size was , and the variance regularization parameter in (15) was 1 . The Adam method [47] was used as a gradient method. To avoid over-fitting, input vectors and DNN outputs, i.e., the T-F masks and error variances, were compressed using a Mel-transformation matrix, and the estimated T-F masks and error variances were transformed into a linear frequency domain using the Mel-transform’s pseudo-inverse [48].

A PSA objective function [19, 20] was used as the MMSE-based objective function. Since the PSA objective function does not use the variance parameter , DNNs estimate only T-F masks . For the ML-based objective function, we used (9) with the complex Gaussian distribution described in Sec. II-B. To train both methods, the dropout algorithm was used and initialized by layer-by-layer pre-training [49]. An early-stopping algorithm [17] was used for fine-tuning with the initial step-size and the step-size threshold , and L2 normalization with the parameter was used as a regularization algorithm.

For the T-F-mask selection-based method [39], to improve the flexibility of T-F-mask selection, we used 128 T-F-mask templates. The DNN architecture, except for the output layer, is the same as MMSE- and ML-based methods.

For the proposed method, DNN parameters were initialized by ML-based training, and their step-size was . To calculate , the iteration parameters and were used. The -greedy parameter was 0.05, and the clipping parameter was determined as according to preliminary informal experiments2. As the OSQA scores, we used the PSEQ, which is a speech quality measure, and the STOI, which is a speech intelligibility measure. To avoid adjusting the step-size of the gradient method for each OSQA, we normalized OSQA scores to uniform the range of the each OSQA score. In this experiments, each OSQA score was normalized so that its maximum and minimum values were 100 and 0 as

 Z\scriptsize PESQ(^S,X) =20.0×(PESQ(^S,X)+0.5), Z\scriptsize STOI(^S,X) =100.0×STOI(^S,X).

The training algorithm was stopped after 10,000 times of executing the whole parameter update process shown in Fig. 4.

#### Other conditions

It is known that T-F-mask processing causes artificial distortion, so-called musical noise [50]. For all methods, to reduce musical noise, flooring [6, 51] and smoothing [52, 53] were applied to before T-F-mask processing as

 ^Gω,τ ←max(G\scriptsize min,^Gω,τ), (36) ^Gω,τ ←β^Gω,τ+(1−β)^Gω,τ−1, (37)

where we used the lower threshold of the T-F mask and smoothing parameter . The frame size of the short-time Fourier transform (STFT) was 512, and the frame was shifted by 256 samples. All the above-mentioned conditions are summarized in Table I.