Adversarially Training for Audio Classifiers
Abstract
In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and nontargeted adversarial attacks. We firstly show that, the ResNet56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and blackbox) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to nonadversarially trained models.
I Introduction
The existence of adversarial attacks has been characterized for datadriven audio and speech recognition models for both waveform and representation domains [5, 6]. During the past years, many strong white and blackbox adversarial algorithms have been introduced which they basically recast costly optimization problems against victim classifiers. Unfortunately, these attacks effectively degrade the classification performance of almost all datadriven models from conventional classifiers (e.g., support vector machines) to the stateoftheart deep neural networks [7]. This poses an extreme growing concern about the security and the reliability of the classifiers.
The typical approach in crafting adversarial example is to solve an optimization problem in order to obtain the smallest possible perturbations for the legitimate samples, undetectable by a human, aiming at fooling the classifier. The commonly used measures to compare the altered sample with the original one are or similarity metrics. The computational complexity of this optimization process is dependent to the dimensions of the given input samples. Consequently, it requires considerable computational overhead for high dimensional data, even in the case of short audio signals [5]. However, regardless of the computational cost of the attacks, this threat actively exists for any endtoend audio and speech classifier. Since the highest recognition accuracies have been reported on 2D representations of audio signals [6, 2], the optimized attack algorithms developed for computer vision applications such as fast gradient sign method (FGSM) [9] led to security concerns for audio classifiers [7].
Although some approaches have been introduced for defending victim models against adversarial attacks, there is not yet a reliable framework achieving the required efficiency. Based on the detailed discussion in [1], common defence algorithms usually obfuscate gradient information but running stronger attack algorithms against them consistently fool these detectors. Unfortunately, vulnerability against adversarial attacks is an open problem in datadriven classification and though the generated fake examples look very similar to noisy samples, they lie in dissimilar subspaces [7, 18]. It has been shown that adversarial examples lie in the manifolds marginally over the decision boundary of the victim classifier, where the model lacks of generalizability [7]. Therefore, integrating these examples into the training set of the victim classifier could improve the robustness. This approach, known as adversarially training [9], might be a more reasonable defense approach without shattering gradient vectors [1]. However, there is no guarantee for the safety of the adversarially trained classifiers [33].
Although there are some discussions in the computer vision domain about the negative effect of adversarially training on the recognition performance of the victim classifiers [22], to the best of our knowledge, this potential side effect has not been yet studied for the 2D representation of audio signals. We address this issue in this paper and report our results on two benchmarking environmental sound datasets. Specifically, our main contributions in this paper are:

characterizing the adversarially training impact on six advanced deep neural network architectures for diverse audio representations,

demonstrating that deep neural networks specially those with residual blocks have higher recognition performance on tonnetz features concatenated with DWT spectrograms compared to STFT representations,

showing the adversarially trained AlexNet model outperforms ResNets with limiting the perturbation magnitude,

experimentally proving that although adversarially training reduces recognition accuracy of the victim model, it makes the attack more costly for the adversary in terms of required perturbation.
The rest of this paper is organized as follows. In Section II, we review some related works about adversarial attacks developed for 2D domains. Details about signal transformation and 2D representation production are provided in Section III and IV, respectively. In Section V, we briefly introduce our selected frontend audio classifiers which are stateoftheart deep learning architectures. The adversarial attack procedures and budget allocation for the adversary are discussed in Section VI. Accordingly, section VII explains the adversarially training framework and obtained results are summarized in Section VIII.
Ii Related Works
There is a large volume of published studies on attacking classifiers using different optimization techniques aiming to effectively disrupt their recognition performances. In this paper, we focus on five strong whitebox targeted and nontargeted attack algorithms which have been reported to be very destructive when used on advanced deep learning models trained on audio representations [6]. Moreover, we also use a blackbox adversarial attack, based on the gradient approximation, against the victim classifiers .
The fast gradient sign method is a wellestablished baseline in targeted adversarial attack. The computational cost of this oneshot approach at runtime is low, taking advantage of the linear characteristics in deep neural networks. Kurakin et al. [17] introduced an iterative version of FGSM, known as the basic iterative method (BIM), for running stronger attacks against victim classifiers and is formulated at:
(1) 
where the legitimate and its associated adversarial examples are represented by and , respectively. The initial state in this recursive formulation is in the neighbourhood (the distance measured by a similarity metric such as ) of the legitimate manifold. This is followed by a clipping operation for keeping the adversarial perturbation within . Moreover, and stand for the label of and the general sign function. In Eq. 1, the step size , though it is tunable according to the adversary’s wishes. Two types of optimizations can be used with Eq. 1: (1) optimizing up to reach the first adversarial example (BIMa) and (2) continuing the optimization up to a predefined number of iterations (BIMb). For measuring the , two similarity metrics are suggested: and . In this work, we focus on the latter.
Gradient information of a deep neural network contains direction of intensity variation associated with the model decision boundary. Exploiting these information vectors for finding the least likely probability distribution is the key idea of the Jacobianbased Saliency map attack (JSMA) [23]. For the adversarial label , this iterative attack algorithm runs against the model and strives to achieve . The JSMA increases the probability of the target label while minimizes those of the other classes including the groundtruth using a saliency map as shown in Eq. 2.
(2) 
where denotes the forward derivative of the model for the feature computed as:
(3) 
the Jacobian vectors associated with label and values of the saliency map less or greater than zero (no variation shield), . This whitebox attack algorithm searches, iteratively, the feature index on which the perturbation will be applied in order to fool the model toward the target label using the similarity metric .
The perturbation required for pushing a sample over the decision boundary of the victim classifier should be as minimal as possible. In a white box scenario, the optimization process uses local properties of the decision boundary. It has been shown that linearizing the boundary in the subspace of the original samples can yield to adversarial perturbation smaller than FGSM attack. This approach, known as the DeepFool attack, is shown in Eq. 4 [21]:
(4) 
where the refers to the weight function of the recognition model. Unlike other abovementioned adversarial attacks, DeepFool is a nontargeted attack and it iterates as many times as needed for pushing random samples to be marginally over the locally linearized decision boundary with the condition of maximizing the prediction likelihood toward any labels other than the groundtruth. Though both or measurement metrics can be used in the DeepFool attack, we use the latter in accordance with BIM algorithms.
Presumably, a straightforward approach for keeping an adversarial perturbation undetectable can be achieved by reducing its magnitude and distribute it over all input features. Additionally, not every feature should be perturbed and their gradient vectors should not be shattered. Following these two assumptions, Carlini and Wagner attack (CWA) has been introduced [4]. The general framework of their proposed algorithm is based on the following minimization problem:
(5) 
where the constant is obtainable through a binary search. Finding the most appropriate value for this hyperparameter is very challenging since it may easily dominate the distance function and push the sample too far away from the adversarial subspace. Although in Eq. 5 the similarity metric for computing the adversarial perturbation is employed, CWA properly generalizes for both and . In the configuration of this adversarial attack, the loss function is defined over the logits of for the trained model as shown in the following equation:
(6) 
where controls the effectiveness and the adjacency of the adversarial examples to the decision boundary of the model. In this regard, higher values for this parameter in conjunction with a minimum neighbourhood results in adversarial examples with higher confidence.
For achieving the overall unrestricted adversarial perturbation () with small enough magnitude, CWA solves Eq. 5 through the following optimization framework:
(7) 
where and the unrestricted approximate perturbation is as the following.
(8) 
This perturbation is unrestricted and it should be tuned for feature values by measuring . For feature intensities with negligible gradient values, the actual adversarial perturbation truncates to zero, and for the rest: .
Attacking victim classifiers while there is an unrestricted access to the details of the attacked models, including the training dataset, hyperparameters, architecture, and more importantly gradient information, like all the abovementioned attack algorithms, is less challenging compared to the blackbox attack scenarios. Usually, in the latter scheme, the adversary runs gradient estimation via querying the classifier by training a surrogate model. In this paper, the chosen blackbox attack is the natural evolution strategy (NES [34]) which has been employed for gradient approximation in [14]. This iterative algorithm is known as partial information attack (PIA) and it encodes similarity metric as part of its targeted optimization problem. Finding the proper adversarial perturbation bound for PIA is to some extent challenging and requires a very high number of querying to the victim model.
Before discussing how adversarial attack and adversarially training on various deep neural network architectures have been implemented, we firstly need to provide a brief overview on the transformation of an audio signal into 2D representations. The next section will describe spectrogram generation using short time Fourier transformation (STFT), discrete wavelet transformation (DWT), and tonnetz feature extraction. We will then train our classifiers using these representations and investigate how adversarially training impacts their robustness to adversarial attacks.
Iii Audio Transformation
Since audio and speech signals have high dimensionality in time domain, their 2D representations with lower dimensionalities have been widely used for training advanced classifiers originally developed for 2D computer vision applications [8]. In this work, we use STFT and DWT, both with and without tonnetz features for generating 2D representations of audio signals. This section briefly reviews the required transformations by this work.
For a discrete signal distributed over time using the Hann window function , we can compute the complex Fourier transformation using the following equation:
(9) 
where is the time scale and . Additionally, stands for the continuous frequency coefficient. This transformation applies on shorter overlapping subsignals with a predefined sampling rate and forms the STFT spectrogram as shown in Eq. III. {dmath} SP_STFT{a[n]}(m,ω)= — ∑_n=∞^∞a[n]w[nm]e^jωn —^2
There are several variants of the STFT transformation such as melscale and cepstral coefficient, producing even lower dimensionality, that have been widely used for various speech processing tasks [25, 15]. However in this work, we use the standard STFT representation for training the frontend dense classifiers.
Generating DWT spectrogram is very similar to the Fourier transformation as they both employ continuous and differentiable basis functions. For the wavelet transformation, several functions have been studied and their effectiveness for audio signals have been investigated in [20, 26]. The general form of this transformation for a continuous function is shown in Eq. 10.
(10) 
where and refer to the time variations in the transformation and the wavelet scale, respectively. Moreover, stands for the basis mother functions. Common choices for this function are Haar, Mexican Hat, and complex Morlet. The latter has been extensively used in signal processing, mainly because of its nonlinear characteristics [8] (see Eq. 11).
(11) 
The complex Morlet is continuous in its conjugate manifold. The convolution of this function with overlapping chunks of the given audio signal results in its spectral visualization as described in Eq. 12.
(12) 
where and are integer numbers associated with scales of .
The two aforementioned transformations represent spatiotemporal modulation features of a signal in the frequency domain, regardless of its harmonic characteristics. It has been demonstrated that using harmonic change detection function (HCDF) provides distinctive features for the audio signal [12]. This function provides chromagram from the constantQ transformation (CQT) which are also known as tonnetz features. According to [12], there are four major steps in a HCDF system. Firstly, the audio signal is converted into a logarithmic spectrum vectors using CQT. Then, pitchclass vectors are extracted from the tonal transformation based on the quantized chromagram. In the third step, 6dimensional centroid vectors form a tensor from the tonal transformation. Finally, a smoothing operation postprocesses this tensor for distance calculation.
We use HCDF system for generating spectrogram from audio signals in order to enhance recognition performance of the classifiers. In the next section, we provide details of this process for two benchmarking environmental sound datasets.
Iv Spectrogram Production
We produce STFT representation based on the instructions provided by the open source Python library Librosa [19]. We set the windows size and the hop length ( and in Eq. 9) to 2048 and 512, respectively. Additionally, we initialize the number of filters to 2048 which is the standard value for the environmental sounds task [8]. Audio chunks associated with each window are padded in order to reduce the potential negative effect of loosing temporal dependencies. Furthermore, the frames are overlapped using a ratio of 50%.
For generating DWT spectrograms, we use our modified version of the wavelet sound explorer [11] with the complex Morlet mother function. As proposed by [2], we set the DWT sampling frequency to 16 KHz for ESC50 and 8 KHz for UrbanSound8K with the uniform 50% overlapping ratio. For enhancement purposes, we use the logarithmic visualization on the generated spectrograms to better characterize high frequency areas.
For the tonnetz chromagram, we use the default settings provided by Librosa with the sampling rate of 22.05 KHz. We resize the resulting chromagrams in such a way that the result will comply with the aforementioned representations. Inspired from [31], we append these features to the STFT and DWT spectrograms and organize them into two additional representations. In the next section, we provide more details about the training of the frontend classifiers using these four spectrogram sets.
V Classification Models
Since an adversary runs the adversarial attack against the classifier, the choice of the victim network architecture affects the fooling rate of the model. This issue has been studied in [6] for the advanced GoogLeNet [32] and AlexNet [16] architectures trained on DWT (with linear, logarithmic, and logarithmic real visualizations), STFT, and their pooled spectrograms. Since our main objective is investigating the impact of adversarially training on advanced deep learning classifiers, we additionally include ResNetX architectures with [13] and VGG16 [30] architectures.
The pretrained models of these six classifiers have been used and the input and output layers have been finetuned as described in [6]. Computational hardware used for all experiments are two NVIDIA GTXTi with GB memory in addition to a bit Intel Corei ( GHz) CPU with GB RAM. We carry out our experiments using the fivefold cross validation setup for all the spectrogram sets. As a common practice in model performance analysis, we preserve 70% of the entire samples for training and development followed by running the early stopping scenario. We report recognition accuracy of these models for the remaining 30% samples.
In the next section, we provide the detailed setup for the adversarial algorithms mentioned in section II. We additionally discuss budget allocations required by the adversary for successfully attacking the six finely trained victim models.
Vi Adversarial Attack Setup
For effectively attacking the classifiers, the adversary should tune the hyperparameters required by the attack algorithms such as the number of iteration, the perturbation limitation, the number of line search within the manifold, which we express them all as the budget allocations. For finding the optimal required budgets, we bind the fooling rates of the attack algorithms to a predefined threshold associated with the area under curve of the attack success. In other words, we allocate as much budget as needed for reaching the for all attacks against the victim models. This is a critical threshold for demonstrating the extreme vulnerability of neural networks against adversarial attacks.
In accordance to the above note, we use Foolbox [28], the freely available python package in support of the uniform reproducible implementations of the attack algorithms. For the BIMa and BIMb algorithms, we define the with the confidence of (). In the JSMA framework, we set the number of iterations to a maximum of 1000 and the scaling factor within (with equivalent displacement of 50). The number of iterations in the DeepFool attack is initialized to 100 with the supremum value in light of 600 and the static step of 100. For the costly CWA attack, we set the search step within the number of iteration associated with every . Except of the DeepFool which is a nontargeted attack, we randomly select targeted wrong labels for the rest of the algorithms.
There are four hyperparameters required for the blackbox PIA algorithm. We empirically limit the perturbation bound to followed by an iterative line search to find the most approximately optimal variance in the NES gradient estimation. We initialize the number of iteration to 500 with decay rate of and the learning rate .
In the framework which we attack the frontend audio classifiers, we run the algorithms on the shuffled batches of 500 samples up to 50 batches of 100 samples randomly selected from the clean spectrograms in every step toward spanning the entire datasets. These attacks are performed considering the abovementioned allocated budgets once before and after adversarially training in order to measure the robustness of the models. Section VII provides details on how adversarially training has been implemented.
Vii Adversarially Training
The idea of adversarially training was firstly proposed in [9], where authors showed that, augmenting the training dataset with the oneshot FGSM adversarial examples improves the robustness of the victim models. As commonly known, the main advantage of this simple approach is that, it does not shatter nor obfuscate gradient information while runs a fast noniterative procedure. This has made the adversarially training to be a relatively reliable defense approach. However, it may not confidently defend against stronger whitebox adversarial algorithms [33].
Many adversarial defense approaches have been introduced during the past years which have been reported to outperform FGSMbased adversarially training [24, 3, 10]. However, some studies have been reported that these advanced defense approaches shatter gradient vectors and they might easily break against strong adversarial attacks which do not incorporate the exact gradient information such as the backward pass differentiable approximation [1].
Augmenting the clean training dataset with adversarial examples in the adversarially trained framework is shown in Eq. 13 [9].
(13) 
where is a subjective weight scalar definable by the adversary. Additionally, and denote the loss function and the derived weight vector of the victim model, respectively. Moreover and refer to the legitimate and adversarial example associated with the genuine label . Adversarially training using a costly attack algorithm is very timeconsuming and memory prohibitive in practice. Therefore, we use the FGSM for augmenting the original spectrogram datasets with the adversarial examples according to the assumption of .
In the next section, we report our achieved results for the dense neural network models about the adversarial attacks and adversarially training on four different representations, namely STFT, DWT, STFT appended with tonnetz features, and DWT appended with tonnetz chromagrams.
Dataset  Representations  GoogLeNet  AlexNet  ResNet18  ResNet34  ResNet56  VGG16 

ESC50  STFT  ,  ,  ,  ,  69.77,  , 
DWT  ,  ,  ,  ,  71.56,  ,  
STFT Tonnetz  ,  ,  ,  ,  70.22,  ,  
DWT Tonnetz  ,  ,  ,  ,  71.79,  ,  
UrbanSound8K  STFT  ,  ,  ,  ,  88.77,  , 
DWT  ,  ,  ,  ,  90.14,  ,  
STFT Tonnetz  ,  ,  ,  ,  ,  89.42,  
DWT Tonnetz  ,  ,  ,  ,  91.36,  , 
Dataset  Representations  GoogLeNet  AlexNet  ResNet18  ResNet34  ResNet56  VGG16 

ESC50  STFT  50.97  
DWT  51.03  
STFT Tonnetz  50.46  
DWT Tonnetz  49.33  
UrbanSound8K  STFT  53.24  
DWT  51.92  
STFT Tonnetz  50.71  
DWT Tonnetz  52.23 
Dataset  Representations  GoogLeNet  AlexNet  ResNet18  ResNet34  ResNet56  VGG16 

ESC50  STFT  2.312  
DWT  2.307  
STFT Tonnetz  2.161  
DWT Tonnetz  2.609  
UrbanSound8K  STFT  2.439  
DWT  2.892  
STFT Tonnetz  2.308  
DWT Tonnetz  2.501 
Viii Experimental Results
We conduct our experiments on two environmental sounds datasets: UrabanSound8K [29] and ESC50 [27]. The first dataset contains 8732 short recording arranged in 10 classes (car horn, dog bark, drilling, jackhammer, street music, siren, children playing, air conditioner, engine idling and gun shot) with the audio length of seconds. ESC50 dataset contains 2K audio signals with an equal length of five seconds organized in 50 classes.
For enhancing both quality and quantity of these datasets, especially for ESC50, we filter samples using the pitchshifting operation in the temporal domain as proposed in [8]. According to their proposed 1D filtration setup, we use the scales of . This increases the size of the datasets by a factor of 4.
Following the explanations provided in section IV about the spectrogram production, the dimension of each resulting spectrogram is for both STFT and DWT (the logarithmic scale) representations on the two datasets. Moreover, the dimensions of the resulting chromagrams is , which will be appended to the aforementioned representations. Table I summarizes recognition accuracies of the classifiers trained on these spectrograms. Additionally, this table shows the effect of the adversarially training on the recognition performance of these models.
The classifiers in Table I have been selected for evaluation on the test sets after running the fivefold crossvalidation scenario on the randomized development portion of the training datasets. Regarding this table, different architectures of the deep neural networks show competitive performances. However, in the majority of the cases, the ResNet56 outperforms other classifiers averaged over 10 repeated experiments on the spectrograms. The highest recognition accuracy has been achieved by the ResNet56 architecture, trained on the appended representation of DWT and tonnetz chromagrams for both UrbanSound8K and ESC50 datasets. The number of parameters in the ResNet56 is 11.3% and 14.26% higher than its rival models VGG16 and ResNet34, respectively.
Fig. 29 visually compares the adversarial examples crafted against the outperforming classifier, the ResNet56, using the six adversarial attacks with a randomly selected audio sample and represented with the four spectrograms approaches described earlier. Although the generated spectrograms are visually very similar to their legitimate counterparts, they all make the classifier to predict wrong labels.
Table I also shows the drop ratio of the recognition accuracies after adversarially trained the models following the procedure explained in section VII. The maximum required adversarial perturbation for complying with the fooling rate of is achieved at , averaged over all the attacks. In attacking the adversarially trained models, the procedures outlined in section VI has been implemented individually for every audio classifier. According to the obtained results, adversarially training considerably reduces the performance of all models. For the ESC50, the neural networks trained on the appended representation of STFT and tonnetz features (STFT Tonnetz) has experienced the most negative impact compared to other representations. The average drop ratio for adversarially trained models on the DWT Tonnetz representations is slightly more than the STFT Tonnetz counterparts for the UrbanSound8K dataset. However, for both datasets, these ratio for models trained on the DWT spectrogram are considerably higher than those trained with the STFT representations.
We measure the fooling rate of adversarially trained models after attacking them using the same six adversarial algorithms following the procedure explained in section VI with the imposed condition of for the adversarial perturbation. This experiment uncovers the impact of adversarially training on the robustness of the audio classifiers (see Table II). We applied the aforementioned condition to make this table comparable with Table I. Regarding the results reported in Table II, adversarially training has improved the robustness of all the classifiers, particularly AlexNet.
For investigating the overall impact of the adversarially training on the robustness of audio classifiers, we attack the adversarially trained models using the same six attack algorithms without the condition of . Unfortunately, we could achieve the fooling rate with for all the classifiers following the attack procedure explained in section VI. However, attacking the adversarially trained models requires larger values for the adversarial perturbation () compared to attacking the original models and consequently, increases the number of callbacks to the original spectrogram with extra batch gradient computations. This might degrade the quality of the generated spectrograms. In order to analytically compare the maximum adversarial perturbation required for the original and the adversarially trained models, we compute the average perturbation ratio as shown in Eq. 14:
(14) 
where and denote the average adversarial perturbation required for successfully attacking the adversarially trained and original models (both with ), respectively. Table III summarizes values for for the victim models trained on different representations.
Note that an indicates the positive impact of adversarially training on the robustness of the audio classifiers via increasing the computational cost of the attack by expanding the magnitude of the required adversarial perturbation. With respect to the measured metric for all the frontend classifiers, the ResNet56 architecture showed better robustness against adversarial attacks in average for 50% of the experiments. In other words, attacking this model adds additional cost for the adversary in crafting adversarial examples with the .
Ix Conclusion
In this paper, we presented the impact of adversarially training as a gradient obfuscationfree defense approach against adversarial attacks. We trained six advanced deep learning classifiers on four different 2D representations of environmental audio signals and run five whitebox and one blackbox attack algorithms against these victim models. We demonstrated that adversarially training considerably reduces the recognition accuracy of the classifier but improves the robustness against six types of targeted and nontargeted adversarial examples by constraining over the maximum required adversarial perturbation to . In other words, adversarially training is not a remedy for the threat of adversarial attacks, however it escalates the cost of attack for the adversary with demanding larger adversarial perturbations compared to the nonadversarially trained models.
Acknowledgment
This work was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant RGPIN 201604855 and Grant RGPIN 201606628.
References
 (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §I, §VII.
 (2017) Classifying environmental sounds using image recognition networks. Procedia Computer Science 112, pp. 2048–2056. Cited by: §I, §IV.
 (2018) Thermometer encoding: one hot way to resist adversarial examples. In International Conference on Learning Representations, Cited by: §VII.
 (2017) Towards evaluating the robustness of neural networks. In IEEE Symp Secur Priv, pp. 39–57. Cited by: §II.
 (2018) Audio adversarial examples: targeted attacks on speechtotext. arXiv preprint arXiv:1801.01944. Cited by: §I, §I.
 (2020) A robust approach for securing audio classification against adversarial attacks. IEEE Transactions on Information Forensics and Security 15 (), pp. 2147–2159. Cited by: §I, §I, §II, §V, §V.
 (2020) Detection of adversarial attacks and characterization of adversarial subspace. In IEEE Intl Conf on Acoustics, Speech and Signal Processing (ICASSP), pp. 3097–3101. Cited by: §I, §I, §I.
 (2020) Unsupervised feature learning for environmental sound classification using weighted cycleconsistent generative adversarial network. Applied Soft Computing 86, pp. 105912. Cited by: §III, §III, §IV, §VIII.
 (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I, §I, §VII, §VII.
 (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: §VII.
 (2008) Wavelet sound explorer software. Note: \urlhttp://stevehanov.ca/wavelet/ Cited by: §IV.
 (2006) Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on Audio and music computing multimedia, pp. 21–26. Cited by: §III.
 (2016) Deep residual learning for image recognition. In Proc. IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §V.
 (2018) Blackbox adversarial attacks with limited queries and information. arXiv preprint arXiv:1804.08598. Cited by: §II.
 (2018) Speech waveform synthesis from mfcc sequences with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5679–5683. Cited by: §III.
 (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §V.
 (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §II.
 (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: §I.
 (2015) Librosa: audio and music signal analysis in python. In 14th Python in Science Conf, Vol. 8. Cited by: §IV.
 (2008) Content based audio classification: a neural network approach. Soft Computing 12 (7), pp. 639–646. Cited by: §III.
 (2016) Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conf Comp Vis Patt Recog, pp. 2574–2582. Cited by: §II.
 (2016) Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §I.
 (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §II.
 (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, pp. 582–597. Cited by: §VII.
 (2010) Speech recognition using hidden markov model with mfccsubband technique. In 2010 International Conference on Recent Trends in Information, Telecommunication and Computing, pp. 168–172. Cited by: §III.
 (2014) Classification of cardiac sound signals using constrained tunableq wavelet transform. Expert Systems with Applications 41 (16), pp. 7161–7170. Cited by: §III.
 (2015) ESC: dataset for environmental sound classification. In Proc. 23rd ACM international conference on Multimedia, pp. 1015–1018. Cited by: §VIII.
 (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: §VI.
 (2014Nov.) A dataset and taxonomy for urban sound research. In 22nd ACM Intl Conf on Multimedia, Orlando, FL, USA. Cited by: §VIII.
 (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V.
 (2019) Environment sound classification using a twostream cnn based on decisionlevel fusion. Sensors 19 (7), pp. 1733. Cited by: §IV.
 (2015) Going deeper with convolutions. In IEEE Conf Comp Vis Patt Recog, pp. 1–9. Cited by: §V.
 (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §I, §VII.
 (2008) Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp. 3381–3387. Cited by: §II.