Spatial Source Subtraction Based on Incomplete Measurements of Relative Transfer Function
Abstract
Relative impulse responses between microphones are usually long and dense due to the reverberant acoustic environment. Estimating them from short and noisy recordings poses a longstanding challenge of audio signal processing. In this paper we apply a novel strategy based on ideas of Compressed Sensing. Relative transfer function (RTF) corresponding to the relative impulse response can often be estimated accurately from noisy data but only for certain frequencies. This means that often only an incomplete measurement of the RTF is available. A complete RTF estimate can be obtained through finding its sparsest representation in the timedomain: that is, through computing the sparsest among the corresponding relative impulse responses. Based on this approach, we propose to estimate the RTF from noisy data in three steps. First, the RTF is estimated using any conventional method such as the nonstationaritybased estimator by Gannot et al. or through Blind Source Separation. Second, frequencies are determined for which the RTF estimate appears to be accurate. Third, the RTF is reconstructed through solving a weighted convex program, which we propose to solve via a computationally efficient variant of the SpaRSA (Sparse Reconstruction by Separable Approximation) algorithm. An extensive experimental study with realworld recordings has been conducted. It has been shown that the proposed method is capable of improving many conventional estimators used as the first step in most situations.
Relative Transfer Function, Relative Impulse Response, Sparse Approximations, norm, Compressed Sensing
I Introduction
Noise reduction, speech enhancement and signal separation have been goals in audio signal processing for decades. Although various methods were already proposed and also applied in practice, there still remain open problems. The main reason is that the propagation of sound in a natural acoustic environment is complex. Acoustical signals are wideband in nature and span a frequency range from 20 Hz to 20 kHz. Typical room impulse responses have thousands of coefficients; this aspect makes them difficult to estimate, especially in noisy conditions.
When dealing with, e.g., noise reduction, the crucial question is “what is the unwanted part of the signal to be removed?” Singlechannel methods, most of which were developed earlier than multichannel methods, typically rely on some knowledge of noise or interference spectra. For example, the spectra can be acquired during noiseonly periods, provided that information about the target source activity is available; see an overview of singlechannel methods, e.g., in [1, 2, 3]. Multichannel methods can also use spatial information [3, 4]. For example, a multichannel filter can be designed to cancel the signal coming from the target’s position. The output of this filter contains only noise and interference components and provides the key reference for the signal enhancement tasks.
Several terms are used in connection with the target signal cancelation, in particular, spatial source subtraction, null beamforming, target cancelation filter, and blocking matrix (BM). The latter refers to one of the building blocks of the minimum variance distortionless (MVDR) beamformer implemented in a generalized sidelobe canceler structure [5]. The BM block is responsible for steering a null towards the desired source, hence blocking, yielding noiseonly reference signals, further used to enhance the desired source through adaptive interference canceler and/or by a postfilter. Null beamformers were originally designed under the assumption of freefield propagation (no reverberation) knowing the microphone array geometry (e.g. linear or circular). But later they were also designed taking the reverberation into account; see, e.g., [6, 7].
In natural acoustic environments, the reverberation must be taken into account to achieve satisfactory signal cancelation. This could be done knowing relative impulse responses or, equivalently, relative transfer functions (RTFs) between microphones [6]. The RTF depends on the properties of the environment and on the positions of the target source and microphones. It can be easily computed from noisefree recordings when the target is static [8, 9]. However, the environment as well as the position of the target source can change quickly. Therefore, methods capable of estimating current RTF within short intervals of noisy recordings, during which the target is approximately static, are desirable.
There have been many attempts to estimate the RTF, or (more generally speaking) to design a null beamformer, from noisy recordings [6, 10, 11]. A popular approach is to use Blind Source Separation (BSS) based on Independent Component Analysis (ICA). However, the accuracy of ICA declines with the number of estimated parameters as it is a statistical approach [12]. The blind estimation of the RTF thus poses a challenging problem since there are thousands of coefficients (parameters) to be estimated. The difficulty of this task particularly grows with growing reverberation time and with growing distance of the target source. A recent goal has therefore been to simplify the task through incorporation of prior knowledge. For example, the knowledge of approximate directionofarrival of the target is used in [13, 14], or a set of preestimated RTFs for potential positions of the target is assumed in [15, 16, 17].
A novel strategy is used in [18, 19, 20] by considering the fact that relative impulse responses can be replaced or approximated by sparse filters, that is, by filters that have many coefficients equal to zero; see also [21], a recent work on sparse approximations of room impulse responses. The authors of [20] propose a semiblind approach assuming knowledge of the support of a sparse approximation. Hence only nonzero coefficients are estimated using ICA, which implies a significant dimensionality reduction of the parameter space. Results show that sparse estimates of filters achieve better target cancelation than dense filters that are estimated in a fully blind way. However, the assumption that the filter support is known is rather impractical.
In this paper, we propose a novel method based on the idea that the RTF could be known or accurately estimated only in several frequency bins. An appropriate name for such observation is the incomplete measurement of the RTF. The entire RTF is then reconstructed by finding a sparse representation of the incomplete measurement in the timedomain. In other words, the relative impulse response between the microphones is replaced by a sparse impulse response whose Fourier transform is, for known frequencies, (approximately) equal to the incomplete RTF. In fact, the idea draws on Compressed Sensing usually applied to sparse/compressible signals or images [22] as well as to system identification.
The following Section introduces the audio mixture model. Section III describes several methods to estimate the relative impulse response or the RTF, both when noise is or is not active. Section IV describes the proposed method, in which the incomplete RTF is reconstructed by an algorithm solving a weighted LASSO program with sparsityinducing regularization. Section V then describes several ways to select the incomplete RTF estimate. Section VI presents an extensive experimental study with real recordings, and Section VII concludes this article.
Ii Problem Description
Iia Model
We will consider situations where two microphones are available^{1}^{1}1In this paper, we focus only on the twomicrophone scenario due to its comparatively easy accessibility. The idea, however, may be generalized to more microphones. . A stereo noisy observation of a target signal can be described as
(1) 
where is the time index taking values ; denotes the convolution; and are, respectively, the signals from the left and right microphones; and and are the remaining signals (noise and interferences) commonly referred to as noise. Further, and denote the microphonetarget acoustical impulse responses. The signals as well as the impulse responses are supposed to be realvalued.
This model assumes that the position of the target source remains (approximately) fixed during the recording interval, i.e., for .
Using the relative impulse response between the microphones denoted as , (1) can be rewritten as
(2) 
where and where denotes the filter inverse to . Note that although realworld acoustic channels and are causal, need not be so.
The equivalent description of (1) in the shortterm frequencydomain is
(3) 
where denotes the frequency, and is the frame index. The analogy to (2) is
(4) 
where . Here denotes the Fourier transform of , which is called the relative transfer function (RTF). It holds that
With low impact on generality, we assume that does not have any zeros on the unit circle; see the discussion in [6] on page 1619.
IiB Spatial Subtraction of a Target Source
When or are known, an efficient multichannel filter can be designed that cancels the target signal and only pass through noise signals. Consider twoinput singleoutput filter defined as such that its output is
(5) 
According to (2), it holds that
(6) 
For , the target signal leakage vanishes, and
(7) 
This is the information provided about the noise signals and , which is crucial in signal separation/enhancement or noise reduction applications. For example, the filter defined through (5) serves as the blocking matrix part in systems having the structure of generalized sidelobe canceler, see, e.g., [6, 8, 9, 23, 24, 25].
To complete the enhancement of the noisy signal, many steps still have to be taken, all of which pose other problems. For example, the spectrum of (7) must sometimes be corrected to approach that of the noise in the signal mixture. The noise reduction itself can be done through adaptive interference cancelation (AIC), a task closely related to Acoustic Echo Cancelation (AEC), and/or postfiltering. For the latter, singlechannel noise reduction methods could be used once the noise reference is given [26].
Iii Survey of Known Solutions
Iiia NoiseFree Conditions
When a recording of an active target source is available in which no noise is present, the relative impulse response or the RTF can be easily estimated. Such estimates naturally provide good substitutes for in (5).
IiiA1 Timedomain estimation using least squares
The mixture model (2) without noise takes on the form
where . Least squares can be used to estimate the first coefficients of as
(8) 
where is the vector of estimated coefficients of , where is an integer delay due to causality, and
The solution of (8) is
(9) 
where
(10)  
(11) 
It is worth noting that the LevinsonDurbin algorithm [27] exploiting the Toeplitz structure of can be used to compute for all filter lengths in operations. The consistency of the timedomain estimation was studied in [28].
IiiA2 Frequencydomain estimation
The noisefree recording, in the shortterm frequencydomain, takes on the form
A straightforward estimate of the RTF is given by
(12) 
IiiB Estimators Admitting Presence of Noise
IiiB1 Frequencydomain estimator using nonstationarity
A frequencydomain estimator was proposed by Gannot et al. [6]. It admits the presence of noise signals that are stationary or, at least, much less dynamic compared to the target signal; see also [29].
The model (4) can be written as
(13) 
where . Note that, in this form, and are not independent. Let this model be valid for a certain interval during which is approximately constant, and let the interval be split into frames. By (13), we have
(14) 
where denotes the (cross) power spectral density between and during the th frame.
According to the assumptions of this method (noise is stationary), should be independent of (thus written without the frame index) and the following set of equations holds
(15) 
IiiB2 Geometric Source Separation (GSS) by [30]
The method described here was originally designed to blindly separate directional sources whose directions of arrival (DOAs) must be given in advance (known or estimated). The method then makes use of constrained BSS so that the separating filters are kept close to a beamformer that is steering directional nulls in selected directions. We skip details of this method to save space and refer the reader to [30] or to [31] for a shorter description (pages 674–675); see also a modified variant of GSS in [14].
This method can be used for the RTF estimation as follows. Considering two microphones and two sources, one steered direction is selected in the DOA of the target source. The second direction is either the DOA of the (directional) interferer or, in the case of diffused or omnidirectional noise, in a direction that is apart (say 90°) from that of the target source. Let denote the resulting separating () transform that is applied to the mixed signals as
(16) 
where , and denotes the shortterm Fourier transform of . Ideally, the elements of correspond to individual signals in the selected directions.
Let the first row of be the filter that steers directional null towards the target source, which means that the first element of contains only noise signals. The RTF estimate is then given through
(17) 
where denotes the th element of .
Iv Proposed Solution
Iva Motivation and Concept
The estimators described above become biased when the assumptions used in their derivations are violated. For example, the bias in (12) depends on the initial SignaltoNoise Ratio (SNR), which may vary over time and frequency. Assuming that the SNR is sufficiently high for a given frequency, the estimator is good. But when the SNR is low, the estimator’s accuracy is also low. Rather than using inaccurate estimates, we can ignore those corresponding to frequencies with low SNR values. We thus arrive at incomplete information about the RTF. That is, the estimate of is known only for some s.
Based on this idea, our strategy is to construct an appropriate substitute for in (5) using an incomplete RTF. Typical relative impulse responses are fast decaying sequences, which are compressible in the timedomain, and can thus be replaced by sparse filters [18, 19, 22, 32]. These are derived through finding sparse solutions of a system built up from incomplete information in a different domain: in our case, the frequencydomain [33, 34].
We thus propose a novel method that consists of three parts^{2}^{2}2The proposed method can be modified in many ways since various solutions can be used for each part of it. We could therefore speak about a proposed class of methods. Nevertheless, the term “proposed method” will be used throughout the article.:

Preestimation of the RTF from a (noisy) recording.

Determination of a subset of frequencies where the estimate of the RTF is sufficiently accurate.

Computation of a sparse approximation of using the incomplete RTF.
Various solutions can be used for each part. Potential methods to solve Part 1 have been already described in Section III. Part 2 can be solved in many ways depending on a given scenario, signal characteristics and the method used within Part 1; we postpone this issue to the next Section. Now we focus on a mathematical description of an appropriate method to solve Part 3.
IvB Nomenclature and Problem Formulation for Part 3
Consider the Discrete Fourier Transform (DFT) domain where the length of the DFT is (sufficiently large with respect to the effective length of ), and, for simplicity, let be even. Let denote the set of indices of frequency bins where a given RTF estimate, denoted as , is sufficiently accurate (that is, assume that Part 1 and 2 have already been resolved). Specifically, let the values of the estimate be
(18) 
where .
For simplicity, the frequency bins and can be excluded from for the following symmetry to hold: Once , then the RTF estimate is also known for , namely (the conjugate value of ), since is realvalued.
Let denote an column vector stacking coefficients of , and where denotes the cardinality of . The known estimates of the RTF satisfy
(19) 
where is the matrix of the DFT, and is a submatrix of comprised of rows whose indices are in . Since is real, the system of linear equations (19) can be written as realvalued linear conditions
(20) 
where and , and and denote, respectively, the real and imaginary parts of the argument.
Since is typically smaller than , the system (20) is underdetermined and has many solutions. The key idea is to find sparse solutions that yield efficient sparse approximations of .
IvC Sparse solutions of (20)
The sparsest solution of (20) is defined as
(21) 
where is equal to the number of nonzero elements in (the pseudonorm). Solving this task is an NPhard problem. Further in the paper, we will therefore consider relaxed variants based on convex programming. Several efficient greedy algorithms to solve (21) exist but cannot guarantee the finding of a global solution in general; see, e.g., [35, 36].
A more tractable formulation is based on the replacement of the pseudonorm in (21) by norm, a sparsityinducing criterion with that the optimization program becomes convex. The program is called basis pursuit [37] and is defined as
(22) 
Using the substitution where and , (22) can be recasted as
(23) 
under the constraints
which is indeed a linear programming problem. The solution can be found using the standard Matlab linprog function. Other stateoftheart optimization tools can also be used, such as the SPGL1 package^{3}^{3}3http://www.cs.ubc.ca/mpf/spgl1 by Berg et al.; see [38].
However, neither formulation (21) nor (22) takes into account the fact that contains certain estimation errors. It is therefore better to relax the constraint given through (20). One such alternative to (22) is LASSO (Least Absolute Shrinkage and Selector Operator) defined as
(24) 
where . This formulation is closely related to the basis pursuit denoising program defined as
(25) 
with , which is easy to interpret: The constraint is a relaxation of taking the possible inaccuracy in into account. LASSO is equivalent to (25) in the sense that the sets of solutions for all possible choices of and are the same. It means that the solution of (25) can be found through solving (24) with the corresponding . Nevertheless, the correspondence between and is not trivial and is possibly discontinuous [39].
In this paper, we use a weighted formulation of (24) given by
(26) 
where is a vector of nonnegative weights (absorbing ), and denotes the elementwise product.
The weights enable us to incorporate a priori knowledge about the solution. Elements of with higher weights tend to be closer to or equal to zero. We use this fact and select the weights to reflect the expected shape of . Our heuristic choice, which is similar to that in [21], is
(27) 
where , , are positive constants. Fig. 1 shows three examples of this weighting function with three different values of the exponent parameter when , , and . The smallest weights are concentrated near , because the directpath peak of is expected there; the minimum value is . The weights grow with the distance from , where the speed of the growth is controlled through and . The growth of weights should reflect the expected decay in magnitudes of coefficients in .
IvD Algorithm
In this subsection, a proximal gradient algorithm to solve (26) is proposed. It is a modification of SpaRSA (Sparse Reconstruction by Separable Approximation) introduced in [40]; see also closely related iterative shrinkage/thresholding methods [41]. An advantage of these methods is their fast convergence, especially when they are initialized in the vicinity of the solution. The computational load is reduced using the properties of .
Proximal gradient methods could be seen as a generalization of gradient descent algorithms for convex minimization programs where the objective function has the form
where both and are closed proper convex and is differentiable [42]. Indeed, (26) obeys this form where and .
One iteration of the proximal gradient method is
(28) 
where
(29) 
is the proximal operator, and is a steplength parameter. The method is known to converge under very mild conditions; see [42].
By putting and from (26) into (28), we arrive at one iteration of the proposed algorithm
(30) 
where is the iteration index, is a variable steplength parameter, and
(31) 
The elements of are separable in (30), which allows us to find the solution in closedform [40], that is
(32) 
where . In (32), this “softthresholding” function is applied elementwise.
The steplength parameter is chosen as in SpaRSA
(33) 
which was derived based on a variant of the BarzilaiBorwein spectral approach; see [40].
To terminate the algorithm, we derive a stopping criterion as follows. It holds that is the solution of (26) if and only if it satisfies [43]
(34)  
(35) 
where the subscript denotes the restriction to indices (columns in the case of a matrix) in the set ; is the vector of signs of , that is ; is the set of indices of nonzero elements of (the active set), and is its complement to . We define the termination criterion that assesses the degree of validity of (34) as
(36) 
The algorithm stops iterating when where is a small positive constant.
Using the fact that satisfies (34) and (35), it can be shown that is a fixed point of (30). The global convergence of the algorithm (although with a different stopping criterion) was proven in [40].
Most of the computational burden is due to the vectormatrix products by and in (31) and in (33). Since only represents a part of the DFT, the products can be computed via the (inverse) Fast Fourier transform, which also leads to memory savings as is determined only through . The computational complexity of one iteration is thus . A pseudocode of the algorithm^{4}^{4}4The Matlab implementation of Algorithm 1 is available at http://itakura.ite.tul.cz/zbynek/downloads.htm is summarized in Algorithm 1.
V Determining the Set
This Section is dedicated to solutions of Part 2 of the proposed method. Let the estimates of be given for all . The task is to select the set such that is sufficiently accurate for .
Va Oracle Inference
For experimental purposes, we define an oracle method that comes from complete knowledge of the SNR in the frequency domain. For simplicity, we can consider the SNR on the left microphone only, which is given by
This method selects frequencies for which the SNR is higher than a positive adjustable parameter . The resulting set will be denoted as . Specifically, it holds that
(37) 
Now we focus on methods that do not require prior knowledge of SNR.
VB KurtosisBased Selection
For cases where the target signal is a speaker’s voice while the other sources are nonspeech, voice activity detectors (VAD) can be used to infer highSNR frequency bins [2]. Here we use a simple detector based on kurtosis. Kurtosis is often used as a contrast function reflecting (non)Gaussian character of a random variable, because the kurtosis of a Gaussian variable is equal to zero. For example, a VAD using kurtosis was proposed in [44]; a recent method for blind source extraction using kurtosis was proposed in [45].
For a complexvalued random variable , normalized kurtosis is defined as
(38) 
where stands for the expectation operator, which is replaced by the sample mean in practice. Speech signals often yield positive kurtosis. We therefore define the set of selected frequencies as
(39) 
In other words, frequencies that yield higher kurtosis than on the left channel are supposed to contain a dominating target (speech) signal.
VC Selection Methods after applying BSS
VC1 Divergence
Some BSS methods, such as GSS described in Section IIIB2, proceed by numerical optimization of a contrast function that evaluates the independence of separated outputs. For example, GSS minimizes a criterion for approximate joint diagonalization of covariance matrices of the input signals computed on frames, plus a penalty function ensuring a constraint [30]. When the minimum of the function is shallow, the convergence is slow, which might be indicative of poor separation.
Therefore, the method proposed here rejects frequencies for which the algorithm did not converge within a selected number of iterations. Thus, the selection is
(40) 
VC2 CoherenceBased Selection
Another way to assess the separation quality without knowing the achieved SNR is to compute the coherence function among the separated signals. As the separated signals should be independent, the coherence, defined as
(41) 
should be “small”. Here, denotes the th separated signal, that is, the th element of defined in (16). Now, the selection is defined as
(42) 
VD Thresholds
Vi Experiments
We present results of experiments evaluating and comparing the ability of several methods to attenuate a target speaker in noisy stereo recordings. Each scenario is simulated using a database^{5}^{5}5http://www.eng.biu.ac.il/gannot/downloads/ of room impulse responses (RIR) measured in the speech & acoustic lab of the Faculty of Engineering at BarIlan University [46]. The lab is a m room with variable reverberation time (T is set, respectively, to 160 ms, 360 ms and 640 ms). The database consists of impulse responses relating eight microphones and a loudspeaker. The microphones are arranged to form a linear array (we use pairs of microphones from the arrangement cm) and the loudspeaker is placed at various angles from to at distances of and m; see the setup depicted in Fig. 2. All computations were done in Matlab on a standard PC with fourcore processor 2.6 GHz and 8 MB of RAM.
Noise signals are either diffused and isotropic (shortly referred to as omnidirectinal) or simulated to be directional (one channel of an original noise signal is convolved with RIRs corresponding to the interferer’s position). Sample of omnidirectional babble noise is taken from the database recorded in the lab. Signals for directional sources are taken from the task of the SiSEC 2013 evaluation campaign [47]^{6}^{6}6http://sisec.wiki.irisa.fr/ titled “Twochannel mixtures of speech and realworld background noise.” We use a female and a male utterance and a sample of babble noise recorded in a cafeteria^{7}^{7}7This sample is used to simulate a directional babble noise although typical babble noise is diffused and isotropic. The purpose of this sample is to also have another directional source besides the Gaussian noise.. The signals are s long, and the sampling frequency is kHz.
Once microphone responses of the sources are prepared, they are mixed together at a specified SNR averaged over both microphones. Specifically,
(43) 
where spans a given interval of data.
The testing sample (10 s) is split into intervals with 75% overlap; experiments are always conducted on each interval (37 independent trials when the interval length is 1 s) and the results are averaged. For a particular interval, SNR at the output of (6) is computed as
(44) 
where (the response of the target signal on the right microphone), and denotes the estimate of . The numerator of (44) corresponds to the leakage of the target signal in (6) while the denominator contains the desired noise reference.
The final criterion is the attenuation rate evaluated as the ratio between and . The more negative the value (in dBs) of this criterion is, the better the evaluated filter performs.
We compare several variants of the proposed method combining different approaches to solve Part 1 and Part 2; Part 3 is the same in all instances. The methods used in Part 1 (FD, NSFD and GSS) are always compared with those obtained after Parts 2 and 3, as the main goal is that the latter improve the former; see the list of compared methods in Table I.
Method Acronym  Method used in Part 1  Method used in Part 2 

FD  frekv.dom. estimator (12)   
FD  frekv.dom. estimator (12)  oracle selector (37) 
FD  frekv.dom. estimator (12)  kurtosisbased selector (39) 
NSFD  nonstationaritybased frekv.dom. estimator (15)   
NSFD  nonstationaritybased frekv.dom. estimator (15)  oracle selector (37) 
NSFD  nonstationaritybased frekv.dom. estimator (15)  kurtosisbased selector (39) 
GSS  BSS estimator (17)   
GSS  BSS estimator (17)  divergencebased selector (40) 
GSS  BSS estimator (17)  coherencebased selector (42) 
If not specified otherwise, parameters are set to the default values shown in Table II. Note that microphone distances are differently selected for FD and NSFD and for GSS in order to provide setups that are preferable for each method (optimized based on the results).
Parameter  Value [units] 

Sampling frequency  16 kHz 
Data interval length per trial  1 s 
T  360 ms 
SNR  0 dB 
Target angle  
Directional interferer angle  
Distance of sources to microphones  2 m 
Length of DFT  2048 
Window shift in shortterm DFT  64 
Delay parameter  100 
Microphone pair when using FD or NSFD  [3 4] (3 cm) 
Microphone pair when using GSS  [4 5] (8 cm) 
, , in (27)  0.1, 0.11, 0.3 
Frame length in NSFD  1000 
Number of blocks in GSS  4 
, ,  , , 
Via Attenuation Rate vs. Percentage
The number of selected frequencies within Part 2 (the parameter we refer to as the percentage) has a particular influence on the resulting estimator^{8}^{8}8Results of methods that do not allow the choice of the percentage are in graphs shown as constant lines.. On the one hand, the attenuation rate is always poor when the percentage is lower than a certain threshold (depending on the method and experiment). On the other hand, the rate is always getting back to that of the initial estimator as the percentage approaches 100%. It is desirable that the rate should be improved, at least for some values in between these two extremes.
ViA1 Diffused and isotropic noise
Figures 3(a) and 3(b) show results from two experiments when the target signal (female speech) is contaminated, respectively, by stationary Gaussian white noise that is spatially white (independently generated on each channel) and by the omnidirectional babble noise.
The white noise situation (Fig. 3(a)) favors NSFD as it obeys the assumed model [6]. Now NSFD and NSFD perform approximately the same as NSFD or marginally improve the attenuation rate (maximum by 1 dB) unless the percentage goes below 15%. The methods based on FD behave similarly but do not outperform those based on NSFD. The original NSFD is hard to outperform in this scenario as its performance is close to optimal.
In babble noise, NSFD attenuates the target by about dB, while FD yields an attenuation rate above dB, and hence fails. The proposed methods successfully improve these results for a wide range of the percentage values. The best improvements are achieved through oracle methods NSFD (70%) and FD (20–80%), where the attenuation rates by NSFD and FD are improved by about dB. The optimum improvement by the kurtosisbased variants NSFD (45%) and FD (45%) is by 46 dB, which is only reasonably lower compared to that of the oraclebased frequency selections. The results confirm that the kurtosisbased selection is efficient in detecting frequencies with high SNR when the noise is Gaussian or babble. Examples of estimated ReIRs in this experiment are shown in Fig. 4.
We also examined the case when the target source was shifted to a 60°angle. The results, not shown here due to space constraints, were comparable with the results for 0°.
ViA2 Directional noise
Fig. 5 shows results of experiments when noise signals were played from a loudspeaker placed at and the target was placed at an angle of or .
By comparing Fig. 3(a) with Figures 5(a) and 5(c), FD and NSFD perform worse by 5–6 dB and by 11–13 dB, respectively, when the Gaussian noise is directional and the target speaker stands at angles of and . This means the directional noise scenario is now less favorable for both FD and NSFD than in the previous scenario. To explain, note that within the frequency bins with low activity of the target source, these methods, in fact, estimate the RTF of the (directional) noise source. When applying such estimated RTF to attenuate the target signal, part of the noise source is attenuated as well, which causes loss in terms of the attenuation rate.
It should also be noted that the performance loss may be even higher when the target is spatially more separated from the noise source (), because the higher the spatial separation of the directional noise source, the higher the bias in the RTF estimates by FD and NSFD could be.
NSFD and NSFD as well as FD and FD improve their initial methods, especially when the percentage value approaches 15%. Moreover, these methods yield an attenuation rate that is close to that achieved with the spatially white Gaussian noise in Fig. 3(a). Compared to FD and NSFD, the proposed methods do not attenuate the directional noise in the frequency bins with low target source activity.
ViA3 A speaking interferer
A more difficult situation occurs when the interference is a speech signal. We demonstrate this in an experiment where a male speech (interferer) impinges the microphones from the direction of , while a female speaker (target loudspeaker) is placed at ; both at a distance of m; T is ms. The results are shown in Fig. 6.
Compared to previous experiments, the interfering signal here has similar dynamics and kurtosis as the target signal, which violates the prerequisites of NSFD and of the kurtosisbased selection procedure. Neither FD, NSFD nor FD and NSFD can distinguish the target speaker from the interfering one and, therefore, all of them perform much worse than FD and NSFD (for a large range of percentage values).
By looking closer at FD and NSFD, they actually try to attenuate both signals by eliminating the dominating signal within each frequency. To show this fact, we performed a simple experiment by taking only the first trial interval of this experiment. Here, FD and NSFD achieved, respectively, attenuation rates of and dB with a percentage of 25%. When the roles of the target and interfering speaker were interchanged so that the oracle procedures took 25% of frequencies where the interferer was dominant, FD and NSFD attenuated the interferer, respectively, by and dB. The fact that both results were obtained from the same RTF estimates by just selecting different frequency bins confirms that FD and NSFD tend to attenuate both signals.
In this experiment, we further consider GSS which is capable of blindly separating the target signal from the interference and vice versa^{9}^{9}9We apply GSS using known DOAs in this experiment.. The RTF estimate can be obtained as described in Section IIIB2. Then we can also apply the proposed method based on the selection procedures (40) and (42).
The results in Fig. 6 show that GSS outperforms NSFD as well as FD. Next, GSS (here with ) attenuates the target by about dB, which improves GSS by dB. Here GSS also improves the attenuation rate achieved by GSS, where the best improvement is achieved for 70–80%. Hence, GSS appears to be better than GSS. However, other experiments not shown here due to space limitations prove that this comparison does not hold in general.
ViB Attenuation Rate versus Length of Data
Fig. 7 shows results of repeated experiments, respectively, with temporally and spatially white Gaussian noise and omnidirectional babble noise. The selection percentage of the proposed methods was, respectively, fixed at 25% and 45% while the data length was varied from ms to s.
The attenuation rates of FD and NSFD are slowly improved with a growing interval length. Also the performance of the proposed variants is improved with a growing length of data. On the other hand, the improvement is not necessarily monotonic, since the attenuation rate also depends on the percentage, which is fixed in this experiment. An example of the nonmonotonic performance is that of NSFD in Fig. 7(b). Next, NSFD and FD perform even worse than NSFD and FD for the data length of ms. This may be solved by increasing the percentage in the latter methods closer to 100%. The performances of NSFD and FD remain stable for all data lengths, which points to room for possible improvements (e.g. more robust selection procedures).
ViC Varying SNR
Here, the experiments where the babble noise was played from a loudspeaker (Fig. 5(b)) and with the male interferer (Fig. 6) are, respectively, repeated with the percentage fixed, respectively, at 45% and 55%; SNR was changed from to dB. Fig. 8 shows the resulting attenuation rates.
The performance of FD and NSFD is improving with growing SNR. For SNR below about 0 dB, their attenuation rate goes above zero, because the interfering source is becoming dominant, and FD and NSFD aim to attenuate the former more than the target signal.
The proposed methods achieve a better attenuation rate than FD and NSFD for almost all values of SNR. An exception occurs when SNR dB. Here, NSFD (and also NSFD in Fig. 8(a)) perform worse than NSFD. This is again due to the fixed percentage value, which should be chosen close to 100% when SNR is high. For SNR dB, NSFD appears to be efficient.
In the experiment of Fig. 8(b), GSS and the variants derived therefrom perform almost constantly and are only slightly improved with the growing SNR. This is due to the blind separation of the sources by GSS, which is very efficient when sources are closer to microphones ( m here) and the reverberation time is low (T ms).
ViD Varying T
The last experiment considers varying reverberation time when T is respectively , and ms (the values available in the database [46]). The experiment with two speakers is repeated here with the percentage fixed at 55%.
FS, NSFD and their kurtosisbased variants do not succeed here for any value of for the same reasons as in the experiment of Section VIA3. By contrast, the attenuation rates of NSFD and FD are only slightly dependent on , which points to the necessity to distinguish the target’s and interferer’s frequencies correctly. The performance of FD is even improving with , but this is again due to the fixed percentage whose optimum value is different for each situation.
The attenuation rate by GSS, GSS and GSS is dropping as the is growing, because the blind separation is becoming difficult with the reverberation time of the environment. Nevertheless, both GSS and GSS improve the attenuation rate by GSS up to by dB even in the most difficult case when ms.
Vii Conclusions and Discussion
We have proposed a novel approach estimating the RTF from noisy data. The experiments have shown that, in most situations, the proposed approach yields RTF estimates better than conventional estimators in terms of the capability to cancel the target signal. The crucial parameter to select is the percentage. The optimum percentage depends on many circumstances and is hard to predict. Nevertheless, the experiments where the percentage was fixed have shown that the performance of the method is not too sensitive to this parameter. The performance gain due to the method remains positive when reasonable percentage is chosen, e.g., based on practice.
The proposed method is flexible in providing room for future modifications and improvements, some of which we list now.
Methods for solving particular parts of the method can be replaced by novel ones, especially the conventional estimators used within the first part. The methods could be tailored to particular scenarios, signal mixtures or noise conditions. For example, we have demonstrated through experiments that NSFD is effective for the first part when noise is isotropic and less dynamic than the target speech signal, while GSS can be efficient when noise is a competitive speech signal.
If some prior knowledge of SNR (or other knowledge) is available, the selection of frequencies (the second part) could be done before or simultaneously with the RTF estimation (the first part). This could lead to computational savings as only the incomplete RTF estimate needs to be computed.
In the method proposed here, the RTF estimate is reconstructed through searching for the sparsest representation of the incomplete RTF in the discrete timedomain. Besides the fact that faster methods for solving (26) may appear in the future, the weighted program is by far not the only way to reconstruct the RTF estimate [48]. For example, it is possible to reconstruct the RTF in an oversampled discrete timedomain or in the continuous timedomain; see [49, 50].
Online or batchonline implementations of the proposed methods can be the subject of future developments. For each part, it is possible to select an appropriate online or adaptive method to solve the corresponding task.
References
 [1] P. C. Loizou, Speech Enhancement: Theory and Practice, CRC Press, 2007.
 [2] I. Tashev, Sound Capture and Processing: Practical Approaches, John Wiley & Sons Ltd., 2009.
 [3] S. Gannot and I. Cohen, “Springer Handbook of Speech Processing and Speech Communication,” ch. “Adaptive beamforming and postfiltering,” New York: SpringerVerlag, 2007.
 [4] J. Benesty, S. Makino, and J. Chen (Eds.), Speech Enhancement, 1st edition, SpringerVerlag, Heidelberg, 2005.
 [5] L. Griffiths and C. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Trans. Antennas Propag., vol. 30, no. 1, pp. 27–34, Jan. 1982.
 [6] S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Trans. on Signal Processing, vol. 49, no. 8, pp. 1614–1626, Aug. 2001.
 [7] S. Affes, Y. Grenier, “A signal subspace tracking algorithm for microphone array processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 5, pp. 425–437, Sept. 1997.
 [8] A. Krueger, E. Warsitz, and R. HaebUmbach, “Speech enhancement with a GSClike structure employing eigenvectorbased transfer function ratios estimation,” IEEE Audio, Speech, Language Process., vol. 19, no. 1, pp. 206–219, Jan. 2011.
 [9] S. Doclo and M. Moonen, “GSVDbased optimal filtering for single and multimicrophone speech enhancement,” IEEE Trans. Signal Process., vol. 50, no. 9, pp. 2230–2244, Sep. 2002.
 [10] S. Markovich, S. Gannot and I. Cohen, “Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment with Multiple Interfering Speech Signals,” IEEE Transactions on Audio, Speech and Language Processing, vol. 17, no. 6, pp. 1071–1086, Aug. 2009.
 [11] K. Yen and Y. Zhao, “Adaptive CoChannel Speech Separation and Recognition,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 2, pp. 138–151, March 1999.
 [12] J.F. Cardoso, “Blind signal separation: statistical principles”, Proceedings of the IEEE, vol. 90, n. 8, pp. 20092026, October 1998.
 [13] F. Nesta and M. Omologo, “Convolutive underdetermined source separation through weighted interleaved ICA and spatiotemporal source correlation,” Proc. of The 10th International Conference on Latent Variable Analysis and Source Separation (LVA/ICA 2012), pp. 222–230, TelAviv, Israel, March 1215, 2012.
 [14] K. Reindl, S. MarkovichGolan, H. Barfuss, S. Gannot, W. Kellermann, “Geometrically Constrained TRINICONbased relative transfer function estimation in underdetermined scenarios,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4, 2013.
 [15] Z. Koldovský, P. Tichavský, D. Botka, “Noise Reduction in DualMicrophone Mobile Phones Using A Bank of PreMeasured TargetCancellation Filters,” Proc. of ICASSP 2013, pp. 679–683, Vancouver, Canada, May 2013.
 [16] Z. Koldovský, J. Málek, P. Tichavský, and F. Nesta, “Semiblind Noise Extraction Using Partially Known Position of the Target Source,” IEEE Trans. on Speech, Audio and Language Processing, vol. 21, no. 10, pp. 20292041, Oct. 2013.
 [17] R. Talmon and S. Gannot, “Relative transfer function identification on manifolds for supervised GSC beamformers,” in Proc. of the 21st European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, Sep. 2013.
 [18] Y. Lin, J. Chen, Y. Kim and D. Lee, “Blind channel identification for speech dereverberation using norm sparse learning,” Advances in Neural Information Processing Systems 20, Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, MIT Press, Vancouver, British Columbia, Canada, December 36, 2007.
 [19] M. Yu, W. Ma, J. Xin, S. Osher, “MultiChannel Regularized Convex Speech Enhancement Model and Fast Computation by the Split Bregman Method,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 661–675, Feb. 2012.
 [20] J. Málek and Z. Koldovský, “Sparse Target Cancellation Filters with Application to SemiBlind Noise Extraction,” Proc. of the 41st IEEE International Conference on Audio, Speech, and Signal Processing (ICASSP 2014), Florence, Italy, pp. 2109–2113, May 2014.
 [21] A. Benichoux, L. S. R. Simon, E. Vincent and R. Gribonval, “Convex Regularizations for the Simultaneous Recording of Room Impulse Responses,” IEEE Transactions on Signal Processing, vol. 62, no. 8, pp. 1976–1986, April 2014.
 [22] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Information Theory, vol. 52, no. 4, pp. 1289–1306, April 2006.
 [23] O. Hoshuyama, A. Sugiyama, A. Hirano, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Transactions on Signal Processing, vol. 47, no. 10, pp. 2677–2684, Oct. 1999.
 [24] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, K. Shikano, “Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 4, pp. 650–664, May 2009.
 [25] K. Reindl, Y. Zheng, A. Schwarz, S. Meier, R. Maas, A. Sehr, W. Kellermann “A Stereophonic Acoustic Signal Extraction Scheme for Noisy and Reverberant Environments,” Computer Speech and Language, vol. 27, no. 3, pp. 726–745, May 2012.
 [26] E. Habets and S. Gannot, “Dualmicrophone speech dereverberation using a reference signal,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Honolulu, Hawaii, USA, vol. 4, no. IV, pp. 901–904, Apr. 2007.
 [27] N. Levinson, “The Wiener RMS error criterion in filter design and prediction,” J. Math. Phys., vol. 25, pp. 261–278, 1947.
 [28] L. Tong, G. Xu, and T. Kailath, “Blind identification and equalization based on secondorder statistics: A time domain approach,” IEEE Trans. Information Theory, vol. 40, no. 2, pp. 340349, 1994.
 [29] O. Shalvi and E. Weinstein, “System identification using nonstationary signals,” IEEE Trans. Signal Processing, vol. 44, no. 8, pp. 20552063, Aug. 1996.
 [30] L. C. Parra and Ch. V. Alvino, “Geometric Source Separation: Merging Convolutive Source Separation With Geometric Beamforming,” IEEE Trans. on Signal Processing, vol. 10, no. 6, Sept. 2002.
 [31] H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K. Shikano, “Blind source separation based on a fastconvergence algorithm combining ICA and beamforming,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no.2, pp. 666–678, March 2006.
 [32] E. J. Candès and T. Tao, “NearOptimal Signal Recovery From Random Projections: Universal Encoding Strategies?,” IEEE Transactions on Information Theory, vol. 52, no. 12, pp. 5406–5425, Dec. 2006.
 [33] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005.
 [34] M. Rudelson and R. Vershynin, “On sparse reconstruction from Fourier and Gaussian measurements,” Communications on Pure and Applied Mathematics, vol. 61, no. 8, pp. 10251045, August 2008.
 [35] J. A. Tropp, “Greed is good: Algoritmic results for sparse approximation,” IEEE Trans. Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004.
 [36] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Applied and Computational Harmonic Analysis, vol. 26, no. 3, pp. 301–321, May 2009.
 [37] S. S. Chen, D. L. Donoho, M. A. Saunders, ”Atomic Decomposition by Basis Pursuit”, SIAM Journal on Scientific Computing, Vol. 20, No. 1., pp. 33–61, 1999.
 [38] E. van den Berg and M. P. Friedlander, “Probing the Pareto frontier for basis pursuit solutions,” SIAM J. on Scientific Computing, vol. 31, no. 2, pp.890–912, Nov. 2008.
 [39] D. L. Donoho, Y. Tsaig, ”Fast Solution of l1Norm Minimization Problems When the Solution May Be Sparse”, IEEE Transactions on Information Theory, Vol. 54, No. 11., pp. 4789–4812, 2008.
 [40] S. J. Wright, R. D. Nowak, M. A. T. Figueiredo, “Sparse Reconstruction by Separable Approximation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2479–2493, July 2009.
 [41] P. Combettes and V. Wajs, “Signal recovery by proximal forwardbackward splitting,” SIAM J. Multiscale Model. Simul., vol. 4, no. 4, pp. 1168–1200, 2005.
 [42] N. Parikh and S. Boyd, “Proximal Algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, Nov. 2013.
 [43] M. S. Asif and J. Romberg, “Fast and accurate algorithms for reweighted L1norm minimization,” submitted to IEEE Trans. on Signal Process., arXiv:1208.0651, July 2012.
 [44] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activity detection using higherorder statistics in the LPC residual domain,” IEEE Transactions on Speech and Audio Processing, vol. 9, no. 3, pp. 217–231, March 2001.
 [45] S. Javidi, D. P. Mandic, C. C. Took, and A. Cichocki, “Kurtosisbased blind source extraction of complex noncircular signals with application in EEG artifact removal in realtime,” Frontiers in Neuroscience, vol. 5, no. 105, 2011.
 [46] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” International Workshop on Acoustic Signal Enhancement 2014 (IWAENC 2014), Antibes, France, Sept. 2014.
 [47] N. Ono, Z. Koldovský, S. Miyabe, N. Ito, “The 2013 Signal Separation Evaluation Campaign,” Proc. of IEEE International Workshop on Machine Learning for Signal Processing, Southampton, UK, Sept. 2013.
 [48] V. Chandrasekaran, B. Recht, P. Parrilo, and A. Willsky, “The Convex Geometry of Linear Inverse Problems,” Foundations of Computational Mathematics, vol. 12, no. 6, pp. 805–849, 2012.
 [49] B. N. Bhaskar, T. Gongguo, and B. Recht, “Atomic Norm Denoising With Applications to Line Spectral Estimation,” IEEE Transactions on Signal Processing, vol. 61, no. 23, pp. 5987–5999, Dec. 2013.
 [50] Z. Koldovský and P. Tichavský, “Sparse Reconstruction of Incomplete Relative Transfer Function: Discrete and Continuous Time Domain,” submitted to a special session of EUSIPCO 2015, Feb. 2015.
[]Zbyněk Koldovský (S’03M’04) was born in Jablonec nad Nisou, Czech Republic, in 1979. He received the M.S. degree and Ph.D. degree in mathematical modeling from Faculty of Nuclear Sciences and Physical Engineering at the Czech Technical University in Prague in 2002 and 2006, respectively.
He is currently an associate professor at the Institute of Information Technology and Electronics, Technical University of Liberec. He has also been with the Institute of Information Theory and Automation of the Academy of Sciences of the Czech Republic since 2002. His main research interests are focused on audio signal processing, blind source separation, statistical signal processing, compressed sensing, and multilinear algebra.
Dr. Koldovský serves as a reviewer for several journals such as the IEEE Transaction on Audio, Speech, and Language Processing, IEEE Transaction on Signal Processing, Elsevier Signal Processing Journal, and in several conferences and workshops in the field of (acoustic) signal processing. He has served as a cochair in the fourth communitybased Signal Separation Evaluation Campaign (SiSEC 2013) and as the general cochair of the twelfth International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA 2015).
[]Jiří Málek received his master and Ph.D. degrees from Technical University in Liberec (TUL, Czech Republic) in 2006 and 2011, respectively, in technical cybernetics. Currently, he holds a postdoctoral position at the Institute of Information Technology and Electronics, TUL. His research interests include blind source separation and speech enhancement.
[] Sharon Gannot(S’92M’01SM’06) received his B.Sc. degree (summa cum laude) from the Technion Israel Institute of Technology, Haifa, Israel in 1986 and the M.Sc. (cum laude) and Ph.D. degrees from TelAviv University, Israel in 1995 and 2000 respectively, all in electrical engineering. In 2001 he held a postdoctoral position at the department of Electrical Engineering (ESATSISTA) at K.U.Leuven, Belgium. From 2002 to 2003 he held a research and teaching position at the Faculty of Electrical Engineering, TechnionIsrael Institute of Technology, Haifa, Israel. Currently, he is an Associate Professor at the Faculty of Engineering, BarIlan University, Israel, where he is heading the Speech and Signal Processing laboratory. Prof. Gannot is the recipient of BarIlan University outstanding lecturer award for 2010 and 2014.
Prof. Gannot has served as an Associate Editor of the EURASIP Journal of Advances in Signal Processing in 20032012, and as an Editor of two special issues on Multimicrophone Speech Processing of the same journal. He has also served as a guest editor of ELSEVIER Speech Communication and Signal Processing journals. Prof. Gannot has served as an Associate Editor of IEEE Transactions on Speech, Audio and Language Processing in 20092013. Currently, he is a Senior Area Chair of the same journal. He also serves as a reviewer of many IEEE journals and conferences. Prof. Gannot is a member of the Audio and Acoustic Signal Processing (AASP) technical committee of the IEEE since Jan., 2010. He is also a member of the Technical and Steering committee of the International Workshop on Acoustic Signal Enhancement (IWAENC) since 2005 and was the general cochair of IWAENC held at TelAviv, Israel in August 2010. Prof. Gannot has served as the general cochair of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) in October 2013. Prof. Gannot was selected (with colleagues) to present a tutorial sessions in ICASSP 2012, EUSIPCO 2012, ICASSP 2013 and EUSIPCO 2013. Prof. Gannot research interests include multimicrophone speech processing and specifically distributed algorithms for ad hoc microphone arrays for noise reduction and speaker separation; dereverberation; single microphone speech enhancement and speaker localization and tracking.