Listening for Sirens: Locating and Classifying Acoustic Alarms in City Scenes.
This paper is about alerting acoustic event detection and sound source localisation in an urban scenario. Specifically, we are interested in spotting the presence of horns, and sirens of emergency vehicles. In order to obtain a reliable system able to operate robustly despite the presence of traffic noise, which can be copious, unstructured and unpredictable, we propose to treat the spectrograms of incoming stereo signals as images, and apply semantic segmentation, based on a Unet architecture, to extract the target sound from the background noise. In a multi-task learning scheme, together with signal denoising, we perform acoustic event classification to identify the nature of the alerting sound. Lastly, we use the denoised signals to localise the acoustic source on the horizon plane, by regressing the direction of arrival of the sound through a CNN architecture. Our experimental evaluation shows an average classification rate of , and a median absolute error on the localisation of when operating on audio frames of , and of when operating on frames of . The system offers excellent performance in particularly challenging scenarios, where the noise level is remarkably high.
Our autonomous vehicles / cars are largely deaf. They typically make little if any use of auditory inputs and this paper starts to address that shortcoming. Here we approach the problem of using auditory perception in intelligent transportation to spot the presence of, and localise “alerting events” which carry crucial information to enable safe navigation of urban areas, and which, in some cases (e.g. a car honking) could not be perceived by different sensing means. Specifically, we aim to detect and recognise anomalous sounds, such as car horns and sirens of emergency vehicles, and localise the respective acoustics sources. Autonomous vehicles would clearly benefit from the ability to identify and interpret those signals. An emergency vehicle approaching an intersection could be detectable long before it reaches the crossing point and despite occlusions. The possibility of having advance information of this kind would considerably increase the time frame allowed for a safe response from the driver and, for a smart vehicle working in semi-autonomous regime could also be used to trigger manual intervention. Furthermore, people with hearing impairments are potentially more prone to accidents which could be avoided if these acoustic cues could be perceived .
One of the greatest challenges in the identification of auditory events lies in the copious and unstructured traffic noise which characterises the acoustic urban scene, and, against which, filtering techniques, traditionally used in signal processing literature, struggle to perform well. In one of our previous works  we introduced the idea of treating tempo-spectral representations of the incoming audio signals (e.g. spectrograms) as images, applying segmentation techniques for signal denoising. Identification of the various sound types was, then, performed on the filtered clean signals, showing a high level of accuracy, and proving the efficacy of the segmentation as a denoising method.
Building on the analysis and the results of , we extend that contribution in two directions. Firstly, rather than relying on the use of K-means to perform image segmentation, we now employ deep learning, in the form of a Unet  architecture in a multi-task learning scheme, to simultaneously identify the nature of the acoustic event, and extract the corresponding target signal from the background noise. The conspicuous advantage that this kind of approach offers over other filtering techniques is its extreme flexibility. Traditional signal processing methods need to estimate the behaviour and the characteristics of the background noise to be able to discard it, which might be extremely challenging in urban scenarios, where the traffic noise is of a variegate nature, doesn’t present a clear structure, and the geometry of the sound sources is unknown and unpredictable. The use of the Unet makes noise modelling unnecessary, as it attempts to retrieve the target signals directly. Furthermore, this method overcomes the limitations and constraints of our previous approach, where the segmentation procedure was purely based on the energy characterising the stimuli in the audio mixture, and assumptions on their relationships were required. We are now able to robustly address scenarios where the noise is particularly powerful compared to the target signal, and where our previous method was having major difficulties in recovering the shape of the sound of interest. Additionally, thanks to the multi-task learning scheme, the segmentation, and the consequent signal extraction, are now tailored to the class of the signal analysed.
Secondly, we utilise a Convolutional Neural Network (CNN) architecture to localise the acoustic source. Specifically, from a stereo combination (i.e. as perceived by two different microphones, separated in space) of the recovered target signals, we estimate the direction of arrival (DoA) of the sound on the horizon plane, also known as Horizontal Localisation (HL). Horizontal sound source localisation techniques rely on the analysis of two main auditory cues to establish differences between the two signals in the stereo combination . Those differences are expressed as the Interaural Time Difference (ITD), and the Interaural Level Difference (ILD). The former refers to the difference in the time necessary for the acoustic wave to reach the two channels. The latter refers to the difference in the intensity of the signals in the two channels. For this kind of analysis to be successful, the sound should not be corrupted by noise. Intuitively, the cleaner the two signals are, the more accurate the resulting localisation will be; which is why spectrogram segmentation plays such a crucial role in this context. Nevertheless, small inaccuracies in the segmentation can lead to further inaccuracies in the estimation of the ITD and the ILD, which, in turn, can lead to important errors in the computation of the sound source position. In order to avoid this risk, rather than recovering the DoA of the sound from the extracted signals, we opt for learning a direct, and more robust to interference, mapping between those and the location of the sound source, through the use of CNNs.
Ii Related Work
The literature reports few attempts to detect siren, and more generally, alerting urban sounds . All those attempts follow two main strategies to spot the presence of the sound of interest in the acoustic scene: they either model the characteristics of the background noise , or the ones of the target signal . In the case of the former, adaptive filtering techniques (e.g. ) are applied. In the case of the latter, while  employs peak searching to detect the pitch frequency of the siren in the background noise, in  and  sirens are detected through more traditional machine learning paradigms. Our work is closer in spirit to , as it aims to learn the characteristics of the sound of interest directly, and independently of the nature and the features of all maskers potentially present. Yet, with respect to those, we are able to successfully address extremely challenging scenarios characterised by a remarkably low Signal-to-Noise Ratio (SNR) (). Scenarios which in previous studies, instead, are either not directly examined  or yield to a substantial degrade in the classification performance , even at a relatively high SNR ().
Semantic segmentation has been widely investigated both in robotics and computer vision literature. Recent advancements in deep learning , especially, have determined tremendous improvements in the performance of segmentation algorithms in a wide range of application domains. Audio analysis and understanding, on the other hand, has only partially benefited from such improvements in image processing. Indeed, the idea of exploiting image processing techniques operating on tempo-spectral representations of acoustic signals (e.g. spectrograms) is still in its infancy, with only few works (e.g.  and ) exploring this path for speaker and music classification purposes. In , we investigated the possibility of utilising image segmentation as a noise cancelling method, obtaining promising results. We now enhance that approach, which relied on the use of K-means , by employing a more powerful segmentation method, based on a Unet architecture . As already discussed in Section I, this allows us to extract the sound of interest from a noisy background, even when this noise is extremely powerful, down to , where  wasn’t able to work properly with SNRs lower than .
Sound source localisation in robotics has been manly concerned with human-robot interaction applications. Recently, more traditional geometry-based methods , have been replaced by deep learning techniques , which employ cross-correlation information to model the sound source location. This work is close in spirit to both these studies. Yet, with respect to those, which focus on indoor environments and multi-speaker localisation, where the noise is either low () or structured in the form of competing speakers, we analyse and are able to cope with outdoor scenarios characterised by the presence of a variety of unknown sources of noise of a different nature, and where the SNR can be remarkably low.
Iii Technical Approach
In this paper, we employ two different learning schemes to detect the acoustic events and localise the respective sound sources. A full representation of the framework is provided in Fig. 1. The network we use to perform segmentation and event classification is reported in the blue area of the figure (cf. SeC), while the one used to regress the direction of arrival of the sound is reported in the pink area (cf. SL). The acoustic events we are interested in analysing are sirens of emergency vehicles and horns. Specifically, we consider three types of siren: “yelp”, “wail” and “hi-low”. A description of the features employed is presented in Section III-A, while a more detailed illustration of the deep architectures utilised is delineated in Sections III-B, and IV-D.
Iii-a Feature Representation
Traditionally, audio classification tasks have been approached relying on the use of Mel-Frequency Cepstrum Coefficients (MFCCs) . Recent works , demonstrated that MFCCs do not provide an acoustic signature which is robust to interference, leading to a deterioration in the classification performance, when operating in noisy scenarios. Given the potentially high level of noise we might encounter in traffic scenes, we here choose a different signal representation, based on the use of gammatone filterbanks , which have been originally introduced in , as an approximation to the human cochlear frequency response, and that, as such, can be used to generate a tempo-spectral representation of audio signals, where the frequency range is parsed in a human-like fashion. The sounds we are interested in spotting (i.e. horn and sirens) are explicitly designed to be heard by humans, even in the presence of conspicuous traffic noise. Exploiting features mimicking human auditory perception, then, can be particularly convenient, as able to provide an additional pre-filtering of the signals. The impulse response of a gammatone filter centred at frequency is:
where indicates the order of the filter, and is the bandwidth. The bandwidth increases as the centre frequency increases, generating narrower filters at low frequencies and broader filters at high frequencies. Following , we utilise fourth-order filters (i.e. ), and approximate as:
The centre frequencies are selected by applying the Equivalent Rectangular Bandwidth (ERB) scale . Let be the original waveform audio signal, the output response of a filter characterised by the centre frequency can, then, be computed as:
Extending the filtering to the entire bank, across overlapping time frames, we obtain a gammatone-like spectrogram, also known as gammatonegram. The gammatonegrams of a stereo combination (corresponding to two different receivers, i.e. two different channels) of the original signals are computed and used in the semantic segmentation and event classification network, as well as in the acoustic source localisation one. Examples of gammatonegrams are provided in Fig. 2.
Iii-B Acoustic Event Classification and Signal Denoising
We perform acoustic event classification and signal denoising utilising a multi-task learning (MTL) scheme. Multi-task learning has been successfully employed following various implementation strategies, and in several domain applications, such as language processing , and traffic flow forecasting . By taking advantage of information in training samples of related tasks, it has proved to be a valuable tool to reduce overfitting and, consequently, improve models’ generalisation capabilities. In this work, we opt for hard parameter sharing, which was firstly introduced by , having the tasks directly share some of the architecture.
We implement noise removal, by treating the gammatonegrams of the incoming sound as intensity images, and feeding them to a Unet , to carry out semantic segmentation. Specifically, we rely on an architecture similar to the one defined in . As the result of the segmentation will later be used to localise the sound source, we use as input to the network a concatenation of the gammatonegrams of the two signals in the stereo combination, rather than analysing one channel at a time. We make this choice, as we believe it will help capture inter-channel information, and allow us to obtain a more stereo-aware segmentation. The Unet can be seen as an autoencoder, relying on two main processing phases: encoding, and following decoding. The encoding generates a more compact representation of the input (i.e. the code), while the decoding attempts to reconstruct the original input, filtered depending on the specific task. In this case, the output of the decoding step will be a segmented version of the original gammatonegrams. In order to perform acoustic event classification, the network is augmented by fully connected layers, which, operating on the code generated in the encoding step, assign the input signal to one of three classes of interest: siren of an emergency vehicle, horn, and any other kind of traffic noise. The MTL scheme aims to simultaneously assign a label (i.e. siren, horn, other sound) to each pixel (corresponding to a time-frequency slot of the gamamtonegrams) of the input image (i.e. segmentation), and to the entire image (i.e. event classification).
The complete structure of the network is reported in Fig. 1. The encoding part of the network (i.e. left side of the network) consists of three layers, where each layer presents the application of two successive convolutions, followed by a max pooling operation, with stride 2 for downsampling. The first layer is characterised by feature channels, which double at each layer. The decoding part of the network (i.e. right side of the network), instead, presents a upsampling step, which reduces in half the number of feature channels, followed by the application of two successive convolutions. After each upsampling operation, the resulting feature map is concatenated to the respective one from the left side of the network. All convolutions occur with an Exponential Linear Unit (ELU) . A final convolution is applied to assign a label to each pixel. Acoustic event classification is obtained by adding two fully connected layers at the bottom of the Unet (cf. yellow area in Fig. 1). We train the multi-task learning architecture by minimising the loss , defined as:
where and refer to the loss related to the segmentation and classification tasks, respectively, and are computed applying a soft-max combined with a cross-entropy loss function. In the case of , the soft-max is applied pixel-wise over the final feature map, and defined as:
where indicates the activation in feature map at pixel , and is the number of classes. Training is performed by minimising the loss with regularisation, using back-propagation.
Iii-C Sound Source Localisation
In this work, we are interested in horizontal acoustic source localisation, relying on a stereo composition of the sound (i.e. as perceived by two different, spatially separated, microphones). Specifically, we are interested in learning a direct mapping between the clean gammatonegrams of the stereo signal and the direction of arrival of the sound. Once the segmentation has been performed, as described in Section III-B, the gammatonegrams of two target signals are recovered by applying the output of the segmentation as a mask on the original gammatonegrams. The cross-correlation between the resulting clean gammatonegrams, the cross-gammatonegram is then used as input to the CNN to regress the DoA. If and are the noisy gammatonegrams for the first and second channel of the stereo signal, and and the respective segmented images, the input to the network is given by:
where and denotes the complex conjugate. The network consists of two convolutional layers, followed by a max pool, and two fully connected layers. All layers are equipped with an ELU. We employ the network to regress the direction of arrival of the sound as the respective angle on the horizon plane, and define our loss function as , where and are the ground truth values, and the predictions of the network, respectively. Training is performed by minimising the Euclidean loss with regularisation, using back-propagation.
Iv Experimental Evaluation
We evaluate our framework, analysing the performance of both networks, and comparing their behaviour with other two different architectures, proving that the particular configuration chosen in this work, while providing comparable performance in the acoustic event classification task, yields significantly greater performance in the sound source localisation one. Specifically, we analyse two alternative networks:
Full-Sharing (FS): the two tasks in the MTL scheme share both the enconding and decoding side of the network. Classification does not take place at the bottom of the Unet, as in Fig. 1, but at the last decoding layer.
Mono: the multi-task learning scheme is identical to the one of the SeC network, but, in this case, it does not operate on the gammatonegrams of the stereo sound, but gammatonegrams of different channels are considered as separate samples.
We remind the reader that we apply semantic segmentation to the gammatonegrams of the stereo sound as a denoising technique, which allows us to recover the target signal from the background noise. As such, we are not interested in the performance of the segmentation per se, but rather in the accuracy we obtain when regressing the DoA from the gammatonegrams which have been cleaned by the segmentation. Thus, in this context, we will focus on the performance of the SL network and the classification task of the SeC one, and discuss the performance of the semantic segmentation only through the impact this has on the following DoA estimation.
Iv-a The Dataset
To evaluate the performance of our framework, we collected four hours of data by driving around Oxford, UK, on different kinds of road, and at different times of the day (i.e. different traffic conditions). The data was gathered using two Knowles omnidirectional boom microphones mounted on the roof of the car and an ALESIS IO4 audio interface. The data was recorded at a sampling frequency of at a resolution of bits. Furthermore, to obtain accurate ground truth values in the sound source localisation task against various levels of masking noise, we corrupted a stereo composition of specific target signals with the traffic noise recorded, generating samples at various SNRs. This kind of approach is commonly used in acoustics literature (e.g.  among others), especially when the impact of noise on classification and identification tasks has to be isolated and accurately quantified. Lastly, it allows us to address scenarios where no other sensors can be used to provide ground truth, as either the target sound is purely acoustic (e.g. horns), or the sound source is too far away and, thus, out of the field of view of additional sensors, potentially present. This additional data used has been extracted from the Urban Sound Dataset , and from other publicly available databases, such as www.freesound.org. We are interested in clean signals, as these will represent our ground truth data. Thus, we select only samples, where any background noise is either absent or can be easily removed through traditional filtering. The clean signals obtained are, then, mixed with the traffic noise recorded, simulating different direction of arrival of the sound, with the acoustic source moving at different velocities, following different paths. Frequency shifts, due to the Doppler effect, are applied accordingly. The simulation also takes into account additional propagation effects, such as echoes (i.e. delayed, less powerful copies of the original signal), and small perturbations (i.e. variation in the power of the perceived signal depending on the direction of arrival) to consider potential reflections, and different kinds of microphones’ response patterns. A schematic representation of the microphone configuration is given in Fig. 3. The direction of arrival of the sound, computed as the angle between the sound source and the vehicle is denoted by . The current framework operates on a space; yet it can be easily extended to with an additional microphone. In total we generate more than samples, equally distributed among the three classes: sirens, horns, and others (i.e. any other traffic sound, which is neither a siren nor a horn). Each sample refers to one frame of .
Iv-B Implementation Details
We trained the networks using mini-batch gradient descent based on back propagation, employing the Adam optimisation algorithm . We applied dropout  to each non-shared layer for both tasks’ architectures with a keeping probability of . The models were implemented using the Tensorflow  libraries. We confine our frequency analysis to a range between and , corresponding to the maximum reliable frequency resolution available, and utilise frequency channels in the gammatone filterbank. The filtering is computed on time domain frames of with overlap, after applying a Hamming window to avoid spectral leakage. Similarly to previous works on deep learning in the auditory domain (cf. , ), we randomly split our dataset into training set () and test set ().
Iv-C Acoustic Event Classification
Table I reports the confusion matrix obtained by employing the SeC network. The average classification rate for all classes is shown along the diagonal of the matrix. Fig. 4 shows the accuracy obtained in the classification, averaged over the three classes, at varying of the SNR, when applying the SeC network, and the two benchmarks: the FS, and the Mono networks. Results suggest that all three architectures are able to provide a great classification accuracy, despite the presence of copious noise in the original gammatonegrams. Specifically, SeC provides an average accuracy of , while FS and Mono provide an accuracy of and , respectively. Furthermore, Table I confirms that this accuracy is stable among the different classes.
Iv-D Sound Source Localisation
Table II reports the median absolute error obtained in the regression of the DoA, when the preliminary segmentation is carried out through the SeC network (cf. ), and through the two benchmarks: the FS (cf. ), and the Mono networks (cf. ). The correspondent normalised histograms of the absolute error are shown in Fig. 5. In all three cases, the DoA estimation relies on the SL network. We observe that our framework (SeC followed by SL) is able to accurately localise the acoustic source, successfully coping with scenes characterised by extremely low SNRs (). Furthermore, it is the one yielding the greatest performance, when compared to the FS and the Mono networks. In particular, while losing only , and in the event classification task accuracy, we obtain a improvement on and on in the DoA estimation, when considering both sirens and horns. We conclude that:
SeC vs FS: performing classification at the bottom of the Unet allows the segmentation to learn more task-specific patterns, allowing an increase in the DoA estimation accuracy.
SeC vs Mono: performing segmentation on the stereo gammatonegrams, rather than on the one of each channel independently, does allow us to learn a more stereo-aware representation of the sound, which makes the DoA regression more robust and accurate.
We also observe that, in all the frameworks, the estimation process is more accurate with the sirens, than with the horns. This is as expected, as horns have characteristics similar to the ones of pure tones, and consist of a dominant frequency component, whose patterns tend to variate, only slightly, over time. Such characteristics reduce the ability of the system to detect the auditory cues necessary to correctly regress the direction of arrival of the sound, which are based on the difference between the gammatonegrams of the two signals in the stereo combination. From Table II, we see that, in the case of the horns, employing SeC for segmentation provides a improvement on and on .
All the experiments considered so far, rely on a random split of the dataset into training and test sets. This scenario, however, is particularly challenging, and does not adhere faithfully to the reality, where the testing frames will be part of the same data stream and additional processing can be applied. In this last experiment, we build an additional test dataset, consisting of consecutive audio frames and apply median filters of different orders to the DoA estimates to remove potential outliers. Results are shown in Fig. 6. We observe that the system becomes extremely reliable: we provide, for instance, estimates a median absolute error of when employing audio frames of , which is an acceptable time frame in our scenario, as our system works properly also at really low SNRs, which imply the sound source being still at a considerable distance from the microphones.
In this paper we proposed a framework to detect alerting sound events in urban environments, and localise the respective sound source. As traffic scenarios are characterised by copious, unstructured and unpredictable noise, we proposed a new denoising method based on semantic segmentation of the stereo gammatonegram of the signals in a multi-task learning scheme, to simultaneously recover the original clean sound, and identify its nature. The direction of arrival of the sound is, then, obtained by training a CNN with the cross-gammatongrams of the denoised signals. Our experimental evaluation, which included challenging scenarios characterised by extremely low SNRs (), showed an average classification rate of , and a median absolute error of when operating on audio frames of , and of when operating on frames of .
-  M. Hersh and M. A. Johnson, Assistive technology for visually impaired and blind people. Springer Science & Business Media, 2010.
-  L. Marchegiani and I. Posner, “Leveraging the urban soundscape: Auditory perception for smart vehicles,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 6547–6554.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  S. Argentieri, P. Danès, and P. Souères, “A survey on sound source localization in robotics: From binaural to array processing methods,” Computer Speech & Language, vol. 34, no. 1, pp. 87–112, 2015.
-  B. Fazenda, H. Atmoko, F. Gu, L. Guan, and A. Ball, “Acoustic based safety emergency vehicle detection for intelligent transport systems,” in Proceedings of the ICROS-SICE International Joint Conference 2009. IEEE Xplore, 2009.
-  F. Meucci, L. Pierucci, E. Del Re, L. Lastrucci, and P. Desii, “A real-time siren detector to improve safety of guide in traffic environment,” in Signal Processing Conference, 2008 16th European. IEEE, 2008, pp. 1–5.
-  J. Schröder, S. Goetze, V. Grutzmacher, and J. Anemüller, “Automatic acoustic siren detection in traffic noise by part-based models.” in ICASSP, 2013, pp. 493–497.
-  S. Ntalampiras, I. Potamitis, and N. Fakotakis, “An adaptive framework for acoustic monitoring of potential hazards,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009, p. 13, 2009.
-  J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in 22st ACM International Conference on Multimedia (ACM-MM’14), Orlando, FL, USA, Nov. 2014.
-  B. Widrow and S. D. Stearns, “Adaptive signal processing,” Englewood Cliffs, NJ, Prentice-Hall, Inc., 1985, 491 p., vol. 1, 1985.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561, 2015.
-  G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille, “Weakly-and semi-supervised learning of a dcnn for semantic image segmentation,” arXiv preprint arXiv:1502.02734, 2015.
-  H. Deshpande, R. Singh, and U. Nam, “Classification of music signals in the visual domain,” in Proceedings of the COST-G6 Conference on Digital Audio Effects. sn, 2001, pp. 1–4.
-  T. Dutta, “Text dependent speaker identification based on spectrograms,” Proceedings of Image and vision computing, pp. 238–243, 2007.
-  K.-S. Fu and J. Mui, “A survey on image segmentation,” Pattern recognition, vol. 13, no. 1, pp. 3–16, 1981.
-  B. Rudzyn, W. Kadous, and C. Sammut, “Real time robot audition system incorporating both 3d sound source localisation and voice characterisation,” in Robotics and Automation, 2007 IEEE International Conference on. IEEE, 2007, pp. 4733–4738.
-  L. Marchegiani, F. Pirri, and M. Pizzoli, “Multimodal speaker recognition in a conversation scenario,” in International Conference on Computer Vision Systems. Springer, 2009, pp. 11–20.
-  W. He, P. Motlicek, and J.-M. Odobez, “Deep neural networks for multiple speaker detection and localization,” arXiv preprint arXiv:1711.11565, 2017.
-  N. Ma, T. May, and G. J. Brown, “Exploiting deep neural networks and head movements for robust binaural localization of multiple sources in reverberant environments,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 25, no. 12, pp. 2444–2453, 2017.
-  X. Zhuang, X. Zhou, T. S. Huang, and M. Hasegawa-Johnson, “Feature analysis and selection for acoustic event detection,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 17–20.
-  D. Chakrabarty and M. Elhilali, “Abnormal sound event detection using temporal trajectories mixtures,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 216–220.
-  R. F. Lyon, A. G. Katsiamis, and E. M. Drakakis, “History and future of auditory filter models,” in Proceedings of 2010 IEEE International Symposium on Circuits and Systems. IEEE, 2010, pp. 3809–3812.
-  J. Holdsworth, I. Nimmo-Smith, R. Patterson, and P. Rice, “Implementing a gammatone filter bank,” Annex C of the SVOS Final Report: Part A: The Auditory Filterbank, vol. 1, pp. 1–5, 1988.
-  I. Toshio, “An optimal auditory filter,” in Applications of Signal Processing to Audio and Acoustics, 1995., IEEE ASSP Workshop on. IEEE, 1995, pp. 198–201.
-  B. R. Glasberg and B. C. Moore, “Derivation of auditory filter shapes from notched-noise data,” Hearing research, vol. 47, no. 1, pp. 103–138, 1990.
-  R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160–167.
-  F. Jin and S. Sun, “Neural network multitask learning for traffic flow forecasting,” in Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008, pp. 1897–1901.
-  R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, pp. 95–133.
-  D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511.07289
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964.
-  L. Marchegiani and X. Fafoutis, “On cross-language consonant identification in second language noise,” The Journal of the Acoustical Society of America, vol. 138, no. 4, pp. 2206–2209, 2015.
-  K. Noda, N. Hashimoto, K. Nakadai, and T. Ogata, “Sound source separation for robot audition using deep learning,” in Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on. IEEE, 2015, pp. 389–394.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014, published as a conference paper at the 3rd International Conference for Learning Representations (ICLR) 2015.
-  G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: http://tensorflow.org/
-  S. Deng, J. Han, C. Zhang, T. Zheng, and G. Zheng, “Robust minimum statistics project coefficients feature for acoustic environment recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 8232–8236.
-  N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audio features for video analysis,” arXiv preprint arXiv:1701.00599, 2017.