Adversarially Training for Audio Classifiers

Adversarially Training for Audio Classifiers


In this paper, we investigate the potential effect of the adversarially training on the robustness of six advanced deep neural networks against a variety of targeted and non-targeted adversarial attacks. We firstly show that, the ResNet-56 model trained on the 2D representation of the discrete wavelet transform appended with the tonnetz chromagram outperforms other models in terms of recognition accuracy. Then we demonstrate the positive impact of adversarially training on this model as well as other deep architectures against six types of attack algorithms (white and black-box) with the cost of the reduced recognition accuracy and limited adversarial perturbation. We run our experiments on two benchmarking environmental sound datasets and show that without any imposed limitations on the budget allocations for the adversary, the fooling rate of the adversarially trained models can exceed 90%. In other words, adversarial attacks exist in any scales, but they might require higher adversarial perturbations compared to non-adversarially trained models.

Spectrogram, Chromagram, Tonnetz Features, Discrete Wavelet Transformation (DWT), Short-Time Fourier Transformation (STFT), Sound Classification, Deep Neural Network, ResNet, VGG, AlexNet, GoogLeNet, Adversarial Attack, Adversarially Training.

I Introduction

The existence of adversarial attacks has been characterized for data-driven audio and speech recognition models for both waveform and representation domains [5, 6]. During the past years, many strong white and black-box adversarial algorithms have been introduced which they basically recast costly optimization problems against victim classifiers. Unfortunately, these attacks effectively degrade the classification performance of almost all data-driven models from conventional classifiers (e.g., support vector machines) to the state-of-the-art deep neural networks [7]. This poses an extreme growing concern about the security and the reliability of the classifiers.

The typical approach in crafting adversarial example is to solve an optimization problem in order to obtain the smallest possible perturbations for the legitimate samples, undetectable by a human, aiming at fooling the classifier. The commonly used measures to compare the altered sample with the original one are or similarity metrics. The computational complexity of this optimization process is dependent to the dimensions of the given input samples. Consequently, it requires considerable computational overhead for high dimensional data, even in the case of short audio signals [5]. However, regardless of the computational cost of the attacks, this threat actively exists for any end-to-end audio and speech classifier. Since the highest recognition accuracies have been reported on 2D representations of audio signals [6, 2], the optimized attack algorithms developed for computer vision applications such as fast gradient sign method (FGSM) [9] led to security concerns for audio classifiers [7].

Although some approaches have been introduced for defending victim models against adversarial attacks, there is not yet a reliable framework achieving the required efficiency. Based on the detailed discussion in [1], common defence algorithms usually obfuscate gradient information but running stronger attack algorithms against them consistently fool these detectors. Unfortunately, vulnerability against adversarial attacks is an open problem in data-driven classification and though the generated fake examples look very similar to noisy samples, they lie in dissimilar subspaces [7, 18]. It has been shown that adversarial examples lie in the manifolds marginally over the decision boundary of the victim classifier, where the model lacks of generalizability [7]. Therefore, integrating these examples into the training set of the victim classifier could improve the robustness. This approach, known as adversarially training [9], might be a more reasonable defense approach without shattering gradient vectors [1]. However, there is no guarantee for the safety of the adversarially trained classifiers [33].

Although there are some discussions in the computer vision domain about the negative effect of adversarially training on the recognition performance of the victim classifiers [22], to the best of our knowledge, this potential side effect has not been yet studied for the 2D representation of audio signals. We address this issue in this paper and report our results on two benchmarking environmental sound datasets. Specifically, our main contributions in this paper are:

  • characterizing the adversarially training impact on six advanced deep neural network architectures for diverse audio representations,

  • demonstrating that deep neural networks specially those with residual blocks have higher recognition performance on tonnetz features concatenated with DWT spectrograms compared to STFT representations,

  • showing the adversarially trained AlexNet model outperforms ResNets with limiting the perturbation magnitude,

  • experimentally proving that although adversarially training reduces recognition accuracy of the victim model, it makes the attack more costly for the adversary in terms of required perturbation.

The rest of this paper is organized as follows. In Section II, we review some related works about adversarial attacks developed for 2D domains. Details about signal transformation and 2D representation production are provided in Section III and IV, respectively. In Section V, we briefly introduce our selected front-end audio classifiers which are state-of-the-art deep learning architectures. The adversarial attack procedures and budget allocation for the adversary are discussed in Section VI. Accordingly, section VII explains the adversarially training framework and obtained results are summarized in Section VIII.

Ii Related Works

There is a large volume of published studies on attacking classifiers using different optimization techniques aiming to effectively disrupt their recognition performances. In this paper, we focus on five strong white-box targeted and non-targeted attack algorithms which have been reported to be very destructive when used on advanced deep learning models trained on audio representations [6]. Moreover, we also use a black-box adversarial attack, based on the gradient approximation, against the victim classifiers .

The fast gradient sign method is a well-established baseline in targeted adversarial attack. The computational cost of this one-shot approach at runtime is low, taking advantage of the linear characteristics in deep neural networks. Kurakin et al. [17] introduced an iterative version of FGSM, known as the basic iterative method (BIM), for running stronger attacks against victim classifiers and is formulated at:


where the legitimate and its associated adversarial examples are represented by and , respectively. The initial state in this recursive formulation is in the -neighbourhood (the distance measured by a similarity metric such as ) of the legitimate manifold. This is followed by a clipping operation for keeping the adversarial perturbation within . Moreover, and stand for the label of and the general sign function. In Eq. 1, the step size , though it is tunable according to the adversary’s wishes. Two types of optimizations can be used with Eq. 1: (1) optimizing up to reach the first adversarial example (BIM-a) and (2) continuing the optimization up to a predefined number of iterations (BIM-b). For measuring the , two similarity metrics are suggested: and . In this work, we focus on the latter.

Gradient information of a deep neural network contains direction of intensity variation associated with the model decision boundary. Exploiting these information vectors for finding the least likely probability distribution is the key idea of the Jacobian-based Saliency map attack (JSMA) [23]. For the adversarial label , this iterative attack algorithm runs against the model and strives to achieve . The JSMA increases the probability of the target label while minimizes those of the other classes including the ground-truth using a saliency map as shown in Eq. 2.


where denotes the forward derivative of the model for the feature computed as:


the Jacobian vectors associated with label and values of the saliency map less or greater than zero (no variation shield), . This white-box attack algorithm searches, iteratively, the feature index on which the perturbation will be applied in order to fool the model toward the target label using the similarity metric .

The perturbation required for pushing a sample over the decision boundary of the victim classifier should be as minimal as possible. In a white box scenario, the optimization process uses local properties of the decision boundary. It has been shown that linearizing the boundary in the subspace of the original samples can yield to adversarial perturbation smaller than FGSM attack. This approach, known as the DeepFool attack, is shown in Eq. 4 [21]:


where the refers to the weight function of the recognition model. Unlike other abovementioned adversarial attacks, DeepFool is a non-targeted attack and it iterates as many times as needed for pushing random samples to be marginally over the locally linearized decision boundary with the condition of maximizing the prediction likelihood toward any labels other than the ground-truth. Though both or measurement metrics can be used in the DeepFool attack, we use the latter in accordance with BIM algorithms.

Presumably, a straightforward approach for keeping an adversarial perturbation undetectable can be achieved by reducing its magnitude and distribute it over all input features. Additionally, not every feature should be perturbed and their gradient vectors should not be shattered. Following these two assumptions, Carlini and Wagner attack (CWA) has been introduced [4]. The general framework of their proposed algorithm is based on the following minimization problem:


where the constant is obtainable through a binary search. Finding the most appropriate value for this hyperparameter is very challenging since it may easily dominate the distance function and push the sample too far away from the adversarial subspace. Although in Eq. 5 the similarity metric for computing the adversarial perturbation is employed, CWA properly generalizes for both and . In the configuration of this adversarial attack, the loss function is defined over the logits of for the trained model as shown in the following equation:


where controls the effectiveness and the adjacency of the adversarial examples to the decision boundary of the model. In this regard, higher values for this parameter in conjunction with a minimum -neighbourhood results in adversarial examples with higher confidence.

For achieving the overall unrestricted adversarial perturbation () with small enough magnitude, CWA solves Eq. 5 through the following optimization framework:


where and the unrestricted approximate perturbation is as the following.


This perturbation is unrestricted and it should be tuned for feature values by measuring . For feature intensities with negligible gradient values, the actual adversarial perturbation truncates to zero, and for the rest: .

Attacking victim classifiers while there is an unrestricted access to the details of the attacked models, including the training dataset, hyperparameters, architecture, and more importantly gradient information, like all the abovementioned attack algorithms, is less challenging compared to the black-box attack scenarios. Usually, in the latter scheme, the adversary runs gradient estimation via querying the classifier by training a surrogate model. In this paper, the chosen black-box attack is the natural evolution strategy (NES [34]) which has been employed for gradient approximation in [14]. This iterative algorithm is known as partial information attack (PIA) and it encodes similarity metric as part of its targeted optimization problem. Finding the proper adversarial perturbation bound for PIA is to some extent challenging and requires a very high number of querying to the victim model.

Before discussing how adversarial attack and adversarially training on various deep neural network architectures have been implemented, we firstly need to provide a brief overview on the transformation of an audio signal into 2D representations. The next section will describe spectrogram generation using short time Fourier transformation (STFT), discrete wavelet transformation (DWT), and tonnetz feature extraction. We will then train our classifiers using these representations and investigate how adversarially training impacts their robustness to adversarial attacks.

Iii Audio Transformation

Since audio and speech signals have high dimensionality in time domain, their 2D representations with lower dimensionalities have been widely used for training advanced classifiers originally developed for 2D computer vision applications [8]. In this work, we use STFT and DWT, both with and without tonnetz features for generating 2D representations of audio signals. This section briefly reviews the required transformations by this work.

For a discrete signal distributed over time using the Hann window function , we can compute the complex Fourier transformation using the following equation:


where is the time scale and . Additionally, stands for the continuous frequency coefficient. This transformation applies on shorter overlapping sub-signals with a predefined sampling rate and forms the STFT spectrogram as shown in Eq. III. {dmath} SP_STFT{a[n]}(m,ω)= — ∑_n=-∞^∞a[n]w[n-m]e^-jωn —^2

There are several variants of the STFT transformation such as mel-scale and cepstral coefficient, producing even lower dimensionality, that have been widely used for various speech processing tasks [25, 15]. However in this work, we use the standard STFT representation for training the front-end dense classifiers.

Generating DWT spectrogram is very similar to the Fourier transformation as they both employ continuous and differentiable basis functions. For the wavelet transformation, several functions have been studied and their effectiveness for audio signals have been investigated in [20, 26]. The general form of this transformation for a continuous function is shown in Eq. 10.


where and refer to the time variations in the transformation and the wavelet scale, respectively. Moreover, stands for the basis mother functions. Common choices for this function are Haar, Mexican Hat, and complex Morlet. The latter has been extensively used in signal processing, mainly because of its nonlinear characteristics [8] (see Eq. 11).


The complex Morlet is continuous in its conjugate manifold. The convolution of this function with overlapping chunks of the given audio signal results in its spectral visualization as described in Eq. 12.


where and are integer numbers associated with scales of .

The two aforementioned transformations represent spatiotemporal modulation features of a signal in the frequency domain, regardless of its harmonic characteristics. It has been demonstrated that using harmonic change detection function (HCDF) provides distinctive features for the audio signal [12]. This function provides chromagram from the constant-Q transformation (CQT) which are also known as tonnetz features. According to [12], there are four major steps in a HCDF system. Firstly, the audio signal is converted into a logarithmic spectrum vectors using CQT. Then, pitch-class vectors are extracted from the tonal transformation based on the quantized chromagram. In the third step, 6-dimensional centroid vectors form a tensor from the tonal transformation. Finally, a smoothing operation postprocesses this tensor for distance calculation.

We use HCDF system for generating spectrogram from audio signals in order to enhance recognition performance of the classifiers. In the next section, we provide details of this process for two benchmarking environmental sound datasets.

(a) DWT
(h) DWT Tonnetz
(o) STFT
(v) STFT Tonnetz
Fig. 29: Crafted adversarial examples for the ResNet-56 using the six optimization-based attack algorithms. The first column of the figure denotes the original representations for the randomly selected sample from the class of ’children playing’ in the UrbanSound8K dataset. Other columns are associated with the attack algorithms namely, BIM-a, BIM-b, JSMA, DeepFool, CWA, and PIA, respectively. Adversarial Perturbation values have been written at the bottom of each adversarial spectrogram.

Iv Spectrogram Production

We produce STFT representation based on the instructions provided by the open source Python library Librosa [19]. We set the windows size and the hop length ( and in Eq. 9) to 2048 and 512, respectively. Additionally, we initialize the number of filters to 2048 which is the standard value for the environmental sounds task [8]. Audio chunks associated with each window are padded in order to reduce the potential negative effect of loosing temporal dependencies. Furthermore, the frames are overlapped using a ratio of 50%.

For generating DWT spectrograms, we use our modified version of the wavelet sound explorer [11] with the complex Morlet mother function. As proposed by [2], we set the DWT sampling frequency to 16 KHz for ESC-50 and 8 KHz for UrbanSound8K with the uniform 50% overlapping ratio. For enhancement purposes, we use the logarithmic visualization on the generated spectrograms to better characterize high frequency areas.

For the tonnetz chromagram, we use the default settings provided by Librosa with the sampling rate of 22.05 KHz. We resize the resulting chromagrams in such a way that the result will comply with the aforementioned representations. Inspired from [31], we append these features to the STFT and DWT spectrograms and organize them into two additional representations. In the next section, we provide more details about the training of the front-end classifiers using these four spectrogram sets.

V Classification Models

Since an adversary runs the adversarial attack against the classifier, the choice of the victim network architecture affects the fooling rate of the model. This issue has been studied in [6] for the advanced GoogLeNet [32] and AlexNet [16] architectures trained on DWT (with linear, logarithmic, and logarithmic real visualizations), STFT, and their pooled spectrograms. Since our main objective is investigating the impact of adversarially training on advanced deep learning classifiers, we additionally include ResNet-X architectures with [13] and VGG-16 [30] architectures.

The pretrained models of these six classifiers have been used and the input and output layers have been fine-tuned as described in [6]. Computational hardware used for all experiments are two NVIDIA GTX--Ti with GB memory in addition to a -bit Intel Core-i- ( GHz) CPU with GB RAM. We carry out our experiments using the five-fold cross validation setup for all the spectrogram sets. As a common practice in model performance analysis, we preserve 70% of the entire samples for training and development followed by running the early stopping scenario. We report recognition accuracy of these models for the remaining 30% samples.

In the next section, we provide the detailed setup for the adversarial algorithms mentioned in section II. We additionally discuss budget allocations required by the adversary for successfully attacking the six finely trained victim models.

Vi Adversarial Attack Setup

For effectively attacking the classifiers, the adversary should tune the hyperparameters required by the attack algorithms such as the number of iteration, the perturbation limitation, the number of line search within the manifold, which we express them all as the budget allocations. For finding the optimal required budgets, we bind the fooling rates of the attack algorithms to a predefined threshold associated with the area under curve of the attack success. In other words, we allocate as much budget as needed for reaching the for all attacks against the victim models. This is a critical threshold for demonstrating the extreme vulnerability of neural networks against adversarial attacks.

In accordance to the above note, we use Foolbox [28], the freely available python package in support of the uniform reproducible implementations of the attack algorithms. For the BIM-a and BIM-b algorithms, we define the with the confidence of (). In the JSMA framework, we set the number of iterations to a maximum of 1000 and the scaling factor within (with equivalent displacement of 50). The number of iterations in the DeepFool attack is initialized to 100 with the supremum value in light of 600 and the static step of 100. For the costly CWA attack, we set the search step within the number of iteration associated with every . Except of the DeepFool which is a non-targeted attack, we randomly select targeted wrong labels for the rest of the algorithms.

There are four hyperparameters required for the black-box PIA algorithm. We empirically limit the perturbation bound to followed by an iterative line search to find the most approximately optimal variance in the NES gradient estimation. We initialize the number of iteration to 500 with decay rate of and the learning rate .

In the framework which we attack the front-end audio classifiers, we run the algorithms on the shuffled batches of 500 samples up to 50 batches of 100 samples randomly selected from the clean spectrograms in every step toward spanning the entire datasets. These attacks are performed considering the abovementioned allocated budgets once before and after adversarially training in order to measure the robustness of the models. Section VII provides details on how adversarially training has been implemented.

Vii Adversarially Training

The idea of adversarially training was firstly proposed in [9], where authors showed that, augmenting the training dataset with the one-shot FGSM adversarial examples improves the robustness of the victim models. As commonly known, the main advantage of this simple approach is that, it does not shatter nor obfuscate gradient information while runs a fast non-iterative procedure. This has made the adversarially training to be a relatively reliable defense approach. However, it may not confidently defend against stronger white-box adversarial algorithms [33].

Many adversarial defense approaches have been introduced during the past years which have been reported to outperform FGSM-based adversarially training [24, 3, 10]. However, some studies have been reported that these advanced defense approaches shatter gradient vectors and they might easily break against strong adversarial attacks which do not incorporate the exact gradient information such as the backward pass differentiable approximation [1].

Augmenting the clean training dataset with adversarial examples in the adversarially trained framework is shown in Eq. 13 [9].


where is a subjective weight scalar definable by the adversary. Additionally, and denote the loss function and the derived weight vector of the victim model, respectively. Moreover and refer to the legitimate and adversarial example associated with the genuine label . Adversarially training using a costly attack algorithm is very time-consuming and memory prohibitive in practice. Therefore, we use the FGSM for augmenting the original spectrogram datasets with the adversarial examples according to the assumption of .

In the next section, we report our achieved results for the dense neural network models about the adversarial attacks and adversarially training on four different representations, namely STFT, DWT, STFT appended with tonnetz features, and DWT appended with tonnetz chromagrams.

Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16
ESC-50 STFT , , , , 69.77, ,
DWT , , , , 71.56, ,
STFT Tonnetz , , , , 70.22, ,
DWT Tonnetz , , , , 71.79, ,
UrbanSound8K STFT , , , , 88.77, ,
DWT , , , , 90.14, ,
STFT Tonnetz , , , , , 89.42,
DWT Tonnetz , , , , 91.36, ,
TABLE I: Recognition performance (%) of the audio classifiers trained on the original spectrogram datasets (without adversarial example augmentation). Values inside of the parenthesis indicate the recognition percentage drop after adversarially training the models with the fooling rate . Accordingly, the maximum perturbation is achieved at . Outperforming accuracies are shown in bold face.
Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16
ESC-50 STFT 50.97
DWT 51.03
STFT Tonnetz 50.46
DWT Tonnetz 49.33
UrbanSound8K STFT 53.24
DWT 51.92
STFT Tonnetz 50.71
DWT Tonnetz 52.23
TABLE II: Robustness comparison (average ) of the adversarially trained models attacked with the constraint . Victim models with lower fooling rates are indicated in bold.
Dataset Representations GoogLeNet AlexNet ResNet-18 ResNet-34 ResNet-56 VGG-16
ESC-50 STFT 2.312
DWT 2.307
STFT Tonnetz 2.161
DWT Tonnetz 2.609
UrbanSound8K STFT 2.439
DWT 2.892
STFT Tonnetz 2.308
DWT Tonnetz 2.501
TABLE III: Comparison of for attacking the original and adversarially trained models with the constraint of . Higher values for associated with each representation are shown in bold.

Viii Experimental Results

We conduct our experiments on two environmental sounds datasets: UrabanSound8K [29] and ESC-50 [27]. The first dataset contains 8732 short recording arranged in 10 classes (car horn, dog bark, drilling, jackhammer, street music, siren, children playing, air conditioner, engine idling and gun shot) with the audio length of seconds. ESC-50 dataset contains 2K audio signals with an equal length of five seconds organized in 50 classes.

For enhancing both quality and quantity of these datasets, especially for ESC-50, we filter samples using the pitch-shifting operation in the temporal domain as proposed in [8]. According to their proposed 1D filtration setup, we use the scales of . This increases the size of the datasets by a factor of 4.

Following the explanations provided in section IV about the spectrogram production, the dimension of each resulting spectrogram is for both STFT and DWT (the logarithmic scale) representations on the two datasets. Moreover, the dimensions of the resulting chromagrams is , which will be appended to the aforementioned representations. Table I summarizes recognition accuracies of the classifiers trained on these spectrograms. Additionally, this table shows the effect of the adversarially training on the recognition performance of these models.

The classifiers in Table I have been selected for evaluation on the test sets after running the five-fold cross-validation scenario on the randomized development portion of the training datasets. Regarding this table, different architectures of the deep neural networks show competitive performances. However, in the majority of the cases, the ResNet-56 outperforms other classifiers averaged over 10 repeated experiments on the spectrograms. The highest recognition accuracy has been achieved by the ResNet-56 architecture, trained on the appended representation of DWT and tonnetz chromagrams for both UrbanSound8K and ESC-50 datasets. The number of parameters in the ResNet-56 is 11.3% and 14.26% higher than its rival models VGG-16 and ResNet-34, respectively.

Fig. 29 visually compares the adversarial examples crafted against the outperforming classifier, the ResNet-56, using the six adversarial attacks with a randomly selected audio sample and represented with the four spectrograms approaches described earlier. Although the generated spectrograms are visually very similar to their legitimate counterparts, they all make the classifier to predict wrong labels.

Table I also shows the drop ratio of the recognition accuracies after adversarially trained the models following the procedure explained in section VII. The maximum required adversarial perturbation for complying with the fooling rate of is achieved at , averaged over all the attacks. In attacking the adversarially trained models, the procedures outlined in section VI has been implemented individually for every audio classifier. According to the obtained results, adversarially training considerably reduces the performance of all models. For the ESC-50, the neural networks trained on the appended representation of STFT and tonnetz features (STFT Tonnetz) has experienced the most negative impact compared to other representations. The average drop ratio for adversarially trained models on the DWT Tonnetz representations is slightly more than the STFT Tonnetz counterparts for the UrbanSound8K dataset. However, for both datasets, these ratio for models trained on the DWT spectrogram are considerably higher than those trained with the STFT representations.

We measure the fooling rate of adversarially trained models after attacking them using the same six adversarial algorithms following the procedure explained in section VI with the imposed condition of for the adversarial perturbation. This experiment uncovers the impact of adversarially training on the robustness of the audio classifiers (see Table II). We applied the aforementioned condition to make this table comparable with Table I. Regarding the results reported in Table II, adversarially training has improved the robustness of all the classifiers, particularly AlexNet.

For investigating the overall impact of the adversarially training on the robustness of audio classifiers, we attack the adversarially trained models using the same six attack algorithms without the condition of . Unfortunately, we could achieve the fooling rate with for all the classifiers following the attack procedure explained in section VI. However, attacking the adversarially trained models requires larger values for the adversarial perturbation () compared to attacking the original models and consequently, increases the number of callbacks to the original spectrogram with extra batch gradient computations. This might degrade the quality of the generated spectrograms. In order to analytically compare the maximum adversarial perturbation required for the original and the adversarially trained models, we compute the average perturbation ratio as shown in Eq. 14:


where and denote the average adversarial perturbation required for successfully attacking the adversarially trained and original models (both with ), respectively. Table III summarizes values for for the victim models trained on different representations.

Note that an indicates the positive impact of adversarially training on the robustness of the audio classifiers via increasing the computational cost of the attack by expanding the magnitude of the required adversarial perturbation. With respect to the measured metric for all the front-end classifiers, the ResNet-56 architecture showed better robustness against adversarial attacks in average for 50% of the experiments. In other words, attacking this model adds additional cost for the adversary in crafting adversarial examples with the .

Ix Conclusion

In this paper, we presented the impact of adversarially training as a gradient obfuscation-free defense approach against adversarial attacks. We trained six advanced deep learning classifiers on four different 2D representations of environmental audio signals and run five white-box and one black-box attack algorithms against these victim models. We demonstrated that adversarially training considerably reduces the recognition accuracy of the classifier but improves the robustness against six types of targeted and non-targeted adversarial examples by constraining over the maximum required adversarial perturbation to . In other words, adversarially training is not a remedy for the threat of adversarial attacks, however it escalates the cost of attack for the adversary with demanding larger adversarial perturbations compared to the non-adversarially trained models.


This work was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant RGPIN 2016-04855 and Grant RGPIN 2016-06628.


  1. A. Athalye, N. Carlini and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §I, §VII.
  2. V. Boddapati, A. Petef, J. Rasmusson and L. Lundberg (2017) Classifying environmental sounds using image recognition networks. Procedia Computer Science 112, pp. 2048–2056. Cited by: §I, §IV.
  3. J. Buckman, A. Roy, C. Raffel and I. Goodfellow (2018) Thermometer encoding: one hot way to resist adversarial examples. In International Conference on Learning Representations, Cited by: §VII.
  4. N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symp Secur Priv, pp. 39–57. Cited by: §II.
  5. N. Carlini and D. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. arXiv preprint arXiv:1801.01944. Cited by: §I, §I.
  6. M. Esmaeilpour, P. Cardinal and A. L. Koerich (2020) A robust approach for securing audio classification against adversarial attacks. IEEE Transactions on Information Forensics and Security 15 (), pp. 2147–2159. Cited by: §I, §I, §II, §V, §V.
  7. M. Esmaeilpour, P. Cardinal and A. L. Koerich (2020) Detection of adversarial attacks and characterization of adversarial subspace. In IEEE Intl Conf on Acoustics, Speech and Signal Processing (ICASSP), pp. 3097–3101. Cited by: §I, §I, §I.
  8. M. Esmaeilpour, P. Cardinal and A. L. Koerich (2020) Unsupervised feature learning for environmental sound classification using weighted cycle-consistent generative adversarial network. Applied Soft Computing 86, pp. 105912. Cited by: §III, §III, §IV, §VIII.
  9. I. Goodfellow, J. Shlens and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I, §I, §VII, §VII.
  10. C. Guo, M. Rana, M. Cisse and L. Van Der Maaten (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: §VII.
  11. S. Hanov (2008) Wavelet sound explorer software. Note: \url Cited by: §IV.
  12. C. Harte, M. Sandler and M. Gasser (2006) Detecting harmonic change in musical audio. In Proceedings of the 1st ACM workshop on Audio and music computing multimedia, pp. 21–26. Cited by: §III.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §V.
  14. A. Ilyas, L. Engstrom, A. Athalye and J. Lin (2018) Black-box adversarial attacks with limited queries and information. arXiv preprint arXiv:1804.08598. Cited by: §II.
  15. L. Juvela, B. Bollepalli, X. Wang, H. Kameoka, M. Airaksinen, J. Yamagishi and P. Alku (2018) Speech waveform synthesis from mfcc sequences with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5679–5683. Cited by: §III.
  16. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §V.
  17. A. Kurakin, I. Goodfellow and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §II.
  18. X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, M. E. Houle, G. Schoenebeck, D. Song and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613. Cited by: §I.
  19. B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg and O. Nieto (2015) Librosa: audio and music signal analysis in python. In 14th Python in Science Conf, Vol. 8. Cited by: §IV.
  20. V. Mitra and C. Wang (2008) Content based audio classification: a neural network approach. Soft Computing 12 (7), pp. 639–646. Cited by: §III.
  21. S. Moosavi-Dezfooli, A. Fawzi and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In IEEE Conf Comp Vis Patt Recog, pp. 2574–2582. Cited by: §II.
  22. N. Papernot, P. McDaniel and I. Goodfellow (2016) Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §I.
  23. N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik and A. Swami (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §II.
  24. N. Papernot, P. McDaniel, X. Wu, S. Jha and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, pp. 582–597. Cited by: §VII.
  25. I. Patel and Y. S. Rao (2010) Speech recognition using hidden markov model with mfcc-subband technique. In 2010 International Conference on Recent Trends in Information, Telecommunication and Computing, pp. 168–172. Cited by: §III.
  26. S. Patidar and R. B. Pachori (2014) Classification of cardiac sound signals using constrained tunable-q wavelet transform. Expert Systems with Applications 41 (16), pp. 7161–7170. Cited by: §III.
  27. K. J. Piczak (2015) ESC: dataset for environmental sound classification. In Proc. 23rd ACM international conference on Multimedia, pp. 1015–1018. Cited by: §VIII.
  28. J. Rauber, W. Brendel and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: §VI.
  29. J. Salamon, C. Jacoby and J. P. Bello (2014-Nov.) A dataset and taxonomy for urban sound research. In 22nd ACM Intl Conf on Multimedia, Orlando, FL, USA. Cited by: §VIII.
  30. K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V.
  31. Y. Su, K. Zhang, J. Wang and K. Madani (2019) Environment sound classification using a two-stream cnn based on decision-level fusion. Sensors 19 (7), pp. 1733. Cited by: §IV.
  32. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich (2015) Going deeper with convolutions. In IEEE Conf Comp Vis Patt Recog, pp. 1–9. Cited by: §V.
  33. F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §I, §VII.
  34. D. Wierstra, T. Schaul, J. Peters and J. Schmidhuber (2008) Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp. 3381–3387. Cited by: §II.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description