Deep Learning and Music Adversaries
An adversary is essentially an algorithm intent on making a classification system perform in some particular way given an input, e.g., increase the probability of a false negative. Recent work builds adversaries for deep learning systems applied to image object recognition, which exploits the parameters of the system to find the minimal perturbation of the input image such that the network misclassifies it with high confidence. We adapt this approach to construct and deploy an adversary of deep learning systems applied to music content analysis. In our case, however, the input to the systems is magnitude spectral frames, which requires special care in order to produce valid input audio signals from network-derived perturbations. For two different train-test partitionings of two benchmark datasets, and two different deep architectures, we find that this adversary is very effective in defeating the resulting systems. We find the convolutional networks are more robust, however, compared with systems based on a majority vote over individually classified audio frames. Furthermore, we integrate the adversary into the training of new deep systems, but do not find that this improves their resilience against the same adversary.
Deep learning is impacting the research domain of music content analysis and music information retrieval (MIR)[34, 28, 63, 31, 57, 41, 44, 19, 65], but recent developments raise the spectre that the high performance of these systems does not reflect how well they have learned to solve high-level problems of music listening. MIR aims to produce systems that help make “music, or information about music, easier to find” . This is of principal importance for confronting the vast amount of music data that exists and continues to be created. Listening machines that can flexibly produce accurate, meaningful and searchable descriptions of music can greatly reduce the cost of processing music data, and can facilitate a diversity of applications. These extend from music identification , author attribution , recommendation , transcription , and playlist generation , to extracting semantic descriptors such as genre and mood [9, 64, 49], to computational musicology , and even synthesis and music composition .
Recent surveys of the domain of deep learning record impressive results for several benchmark problems [17, 6]. In addition to these major successes, deep learning methods are very attractive for three other reasons: there now exist efficient and effective training algorithms for deep learning, not to mention completely free and open cross-platform implementations, e.g., Theano [4, 8]; they entail jointly optimising feature learning and classification, thus allowing one to forgo many difficulties inherent to formally encoding expert knowledge into a machine; and their layered structures seems to favour hierarchical representations of structures in data. One caveat, however, is that these methods require a lot of data in order to estimate parameters and generalise well .
In MIR, the works in [34, 28, 35] are among the first to apply deep learning to music content analysis, and each describes results pointing to the conclusion that these systems can automatically learn features relevant for complex music listening tasks, e.g., recognition of genre or style. Results since then point to the same conclusion [63, 41, 44, 19, 65]. Humphrey et al.  highlight this fact to argue deep learning is naturally suited to learn relevant abstractions for music content analysis, provided enough data is available. Since music can be seen as a “whole greater than the sum of its parts” , deep learning can help MIR narrow the “semantic gap” , and move beyond what has been called a “glass ceiling” in performance .
However, it is now known how deceiving the appearance of high performance can be: an MIR system can appear to be very successful in solving a high-level music listening problem when in fact it is just exploiting some independent variables of questionable relevance unknowingly confounded with the ground truth of a music dataset by a poor experimental design [39, 22, 23, 47, 56, 49, 48, 49, 52, 51, 53]. In addition, recent work in machine learning has demonstrated deep learning systems behaving in ways that contradict their appearance of solving content-recognition problems. Nguyen et al.  show how a high-performing image object recognition system can label with high confidence non-sensical synthetic images. In a similar direction, we have shown  how a deep system that appears highly capable of recognising different musical rhythms confidently classifies synthesised rhythms, though they bear little similarity to the rhythms they supposedly represent. Szegedy et al.  show how deep high-performing image object recognition systems are highly sensitive to imperceptible perturbations created by an adversary: an agent that actively seeks to fool a classifier by perturbing the input such that it results in an incorrect output but with high confidence .
All of these results motivate several timely questions of deep learning systems for music content analysis specifically, and multimedia in general. First, how do the adversaries of Szegedy et al.  translate to the context of deep learning applied to music content analysis? The input of the systems studied by Szegedy et al.  is raw pixel data; however, in music content analysis only the system studied in  takes as input raw audio samples. The inputs to other deep learning systems have been features: windowed magnitude spectra [28, 44], sonograms [34, 57], autocorrelations of spectral energies [41, 51], or statistics of features [63, 65]. Second, can we generate an adversary for such deep learning music content analysis systems that produce adversarial examples that are perceptually identical to the originals? Third, can we “harness” an adversary to train deep learning systems that are robust to its “malfeasance”? Finally, and more broadly, what is deep learning contributing to music content analysis? Can we use adversaries to reveal whether these deep systems are using better models of the content than other state of the art systems using hand-crafted features?
Our preliminary work  shows that it is possible to create highly effective adversaries of the music content analysis deep neural networks (DNN) studied in [28, 44]. These adversaries can make the systems always wrong, always right, and anywhere in-between, with high confidence by applying only minor perturbations of the input magnitude spectra. Furthermore, we created an ensemble of adversaries that can coax the DNN into assigning with high confidence any label to the same music by perturbing the input by very small amounts (e.g., dB SNR). In this article, we greatly expand upon our prior work  to include convolutional deep learning systems, more extensive testing in a larger benchmark MIR dataset, and the results of incorporating an adversary into the training of these different deep learning systems.
In the next section, we provide an overview of work applying deep learning to music content analysis and MIR. We then review two different deep learning architectures, and our construction of several music content analysis systems using two partitions of two MIR benchmark datasets. In Sec. III we review adversaries, and design an adversary for our deep systems. We then present in Sec. IV a series of experiments using our adversary. In Section V we provide a discussion of our work in wider contexts. We conclude in section VI. Some of our results can be produced with the software here: https://github.com/coreyker/dnn-mgr.
Ii Deep Learning for Music Content Analysis
We first provide an overview of research in applying deep learning approaches to music content analysis. We then discuss two different architectures, train two music content analysis systems, and test them in two benchmark MIR datasets. These systems are the subjects of our experiments in Section IV.
Artificial neural networks have been applied to many music content analysis problems, , for instance, fingerprinting , genre recognition , emotion recognition , artist recognition , and even composition . Advances in training have enabled the creation of more advanced and deeper architectures. Deng and Yu  (Chapter 7) provide a review of successful applications of deep learning to the analysis of audio, highlighting in particular its significant contributions to speech recognition in conversational settings. Humphrey et al.  provide a review for applications to music in particular, and motivate the capacity of deep architectures to automatically learn hierarchical relationships in accordance with the hierarchical nature of music: “pitch and loudness combine over time to form chords, melodies and rhythms.” They argue that this is key for moving beyond the reliance on “shallow” and hand-designed features that were designed for different tasks.
Lee et al.  are perhaps the first to apply deep learning to music content analysis, specifically genre and artist recognition. They train a convolutional deep belief network (CDBN) with two hidden layers in an unsupervised manner in an attempt to make the hidden layer activations produce meaningful features from a pre-processed spectrogram input computed using 20 ms 50%-overlapped windows. The spectrogram is “PCA-whitened”, which involves projecting it onto a lower-dimensional space using scaled eigenvectors. Important details are missing in the description of the work, but it appears they use the activations as features in some train/test task using a standard machine learning approach. A table of their experimental results, using some portion of the dataset ISMIR2004, shows higher accuracies for their deep learned features compared to those for standard MFCCs. For genre recognition, Li et al.  use convolutional deep neural networks (CDNN) with three hidden layers, into which they input a sequence of 190 13-dimensional MFCC feature vectors. The architecture of their CDNN is such that the first hidden layer considers data from 127 ms duration, and the last hidden layer is capable of summarising events over a 2.2 s duration. van den Oord et al.  apply CDNN to mel-frequency spectrograms for automatic music content analysis.
For genre recognition and more general descriptors, Hamel and Eck  train a DNN with three hidden layers of 50 units each, taking as input 513 discrete Fourier transform (DFT) magnitudes computed from a single 46 ms audio frame. They use a train/valid/test partition of the benchmark music genre dataset GTZAN [55, 49]. They also explore “aggregated” features, which are the mean and variance in each dimension of activations over 5 second durations. They find in the test set, and for both short-term and aggregated features, that SVM classifiers trained with features built from hidden layer activations reproduce more ground truth than an SVM classifier trained with features built from MFCCs. They report an accuracy of over 0.84 for features that aggregate activations of all three hidden layers. Sigtia and Dixon  explore modifications to the system in , in particular using different combinations of architectures, training procedures, and regularisation. They use the activations of their trained DNN as features for a train/test task using a random forest classifier. They report an accuracy of about 0.83 using features aggregating activations of all hidden layers of 500 units each. For genre recognition, Yang et al.  combine 263-dimensional modulation features with a DBN. For music rhythm classification, Pikrakis  employs a DBN, which we studied further in [53, 52, 51].
Dieleman et al.  build and apply CDBN to music key detection, artist recognition, and genre recognition. There are three major differences with respect to the work above [34, 63, 28, 35, 44]. First, Dieleman et al. employ 24-dimensional input features computed by averaging short-time chroma and timbre features over the time scales of single musical beats. Second, they employ expert musical knowledge to guide decisions about the architecture of the system. Finally, they use the output posteriors of their system for classification, instead of using the hidden layer activations as features for a separate classifier. Their experiments in a portion of the “million song dataset”  show large differences in classification accuracies between their systems and a naive Bayesian classifier using the same input features. In a unique direction for audio, Dieleman and Schrauwen  explore “end-to-end” learning, where a CDNN is trained with input of about 3 s of raw audio samples for a music content analysis task (autotagging). They find that the lowest layer of the trained CDNN appears to learn some filters that are frequency selective. They evaluate this system for a multilabel problem.
To recognise music mood, Weninger et al.  use recurrent DNN with input constructed of several statistics of low-level features computed over second-long excerpts of music recordings. Battenberg and Wessel  apply DBN for identifying the beat numbers over several measures of percussive music, with input features consisting of quantised onset times and magnitudes. Boulanger-Lewandowski et al.  train a recurrent neural network to produce chord classifications using input of PCA-whitened magnitude DFT. In a similar direction, Humphrey and Bello  build a DNN that maps input spectrogram features to guitar-specific fingerings of chords.
Ii-B Two types of deep architectures
We now review two different architectures of deep learning systems, and the way they are trained. A DNN is an artificial neural network with several hidden layers . The output of each layer is a non-linear function of its inputs, obtained by a matrix multiplication cascaded with a non-linearity, e.g., tanh, sigmoid and rectifier. By chaining together several hidden layers, composite representations of the input emerge in deeper layers. This fact can give deep networks greater representational power than shallower networks containing an equivalent number of parameters .
A CDNN is a special type of DNN with weights that are shared between multiple points between adjacent layers. The weight sharing in CDNNs not only reduces the number of trainable parameters, but also causes matrix multiplications to reduce to convolutions, which can be implemented efficiently. Furthermore, many natural signals have local spatial or temporal structures that are repeated globally. For example, natural images often consist of oriented edges; and audio signals often consist of harmonic and repetitive structures. CDNNs can learn these types of structures very well. Figure 1 illustrates our CDNN, which we discuss in the following subsection.
The contemporary success of deep learning comes with computationally efficient training methods. Systems that have such deep architectures are usually trained using gradient descent, which consists of backpropagating error derivatives from the cost function through the network. There are a plethora of useful tips and tricks to augment training, including stochastic gradient descent, dropout regularisation, weight decay, momentum, learning rate decay, and so on .
Ii-C Deep learning with two music genre benchmarks
We now build DNNs and CDNNs using two music genre benchmarks: GTZAN [49, 55] and the Latin Music Database (LMD) . GTZAN consists of 100 30-second music recording excerpts in each of ten categories, and is the most-used public dataset in MIR research . LMD is a private dataset, consisting of 3,229 full-length music track recordings non-uniformly distributed among ten categories, and has been used in the annual MIREX audio latin music genre classification evaluation campaign since 2008.111http://www.music-ir.org/mirex/wiki/MIREX_HOME We use the first 30 seconds of each track in LMD.
We build several DNNs and CDNNs using different partitionings of these datasets. One partitioning of GTZAN we create by randomly selecting 500/250/250 excerpts for training/validation/testing. The other partitioning of GTZAN is “fault-filtered,” which we construct by hand to include 443/197/290 excerpts. This involves removing files including exact replicas, recording replicas, and distorted files , and then dividing the excerpts such that no artist is repeated across the training, validation, and test partitions. We partition LMD in two ways: 1) partitioning by 60/20/20% sampling in each class; 2) a hand-constructed artist-filtered partitioning containing approximately the same division of excerpts in each class. We retain all 213 replicas in LMD.222https://highnoongmt.wordpress.com/2014/02/08/faults_in_the_latin_music_database
The input to our systems is derived from the short-time Fourier transform (STFT) of a sampled audio signal :
where the parameter defines both the window length and the number of frequency bins. We define as a Hann window of length , which corresponds to a duration of 46ms for recordings sampled at 22050 Hz. The window is hopped along with a stride of samples (adjacent windows overlap by 50%).
Since audio signals can be of any duration, we define the input to our systems as a sequence , where the sequence length depends on the input audio’s duration. We define the n element of the input sequence to be
where for each DNN and for each CDNN. Thus, when , is a sequence of vectors; when , is a sequence of matrices.
A (C)DNN processes each element in this sequence independently, outputting a sequence from the final (softmax) layer. The output vector , , is the posterior distribution of labels assigned to the n element in the input sequence by the network. Therefore, we may write where represents the trainable network parameters, i.e., the set of weights and biases. We define the confidence of a (C)DNN in a particular label for an input sequence as the sum of all posteriors, i.e.,
We apply a label to an input sequence as the one maximising the confidence
Paralleling the work in , we build DNNs with 3 fully connected hidden layers, and either 50 or 500 units per layer. Our CDNN has two convolutional layers (accompanied by max pooling layers) followed by a fully connected hidden layer with 50 units. Figure 1 illustrates the architecture of our CDNN. Its first convolutional layer contains 32 filters, each arranged in a rectangular grid. We choose this long rectangular shape instead of the small square patches typically used when training on images based on our knowledge that many sounds exhibit strong harmonic structures that span a large portion of the audible spectrum. The second convolutional layer contains 32 filters, each connected in an pattern. Our two pooling neighborhoods are and have strides of . All of our deep learning systems use rectified linear units (ReLUs), and have a softmax unit in the final layer. As is typical, we standardise the (C)DNN inputs by subtracting the training set mean and dividing by the standard deviation in each of the input dimensions. We perform this with a linear layer above the input layer of each network. The raw inputs to the network are still .
Also paralleling , we build several music classification systems treating our DNN as a feature extractor. In this case, we construct a set of features by concatenating the activations from the DNN’s three hidden layers, and aggregating them over 5-second texture windows (hopped by 50%). The aggregation summarises the mean and standard deviation of the feature dimensions over the texture window and may be seen as a form of late-integration of temporal information. We use this new set of features to train a random forest (RF) classifier  with 500 trees. Thus, to classify a music audio recording from its set of aggregated features, we use majority voting over all classifications, which is also used in .
Ii-D Preliminary evaluation
Figure 2 and Table I show the results of RF classification using the features produced by the DNN when trained on GTZAN with the two different partitioning strategies; and Fig. 3 shows those for the (C)DNNs we train and test in LMD. Across each partition strategy we see significant differences in performance. The mean recall in each class in Figure 2 on the fault-filtered partition is much lower than that on the random test partition — involving drops higher than 30 percentage points in most cases. Table I shows similar drops in performance that persist over the inclusion of drop-out regularisation. Such significant drops in performance from partitioning based on artists is not unusual, and has been studied before as a bias coming from the experimental design [39, 23, 49]. Partitioning a music genre recognition dataset along artist lines has been recommended to avoid this bias [39, 23], and is in fact used in several MIREX audio classification tasks.333http://www.music-ir.org/mirex/wiki/MIREX_HOME Experiments using GTZAN with fault-filtering partitioning has not been used in many benchmark experiments with GTZAN because its artist information has only recently been made available .
Iii Adversaries in music content analysis
An adversary is an agent that tries to defeat a classification system in order to maximise its gain, e.g., SPAM detection. Dalvi et al.  pose this problem as a game between a classifier and adversary, and analyse the strategies involved for an adversary with complete knowledge of the classification system, and for a classifier to adapt to such an adversary. Szegedy et al.  propose using adversaries for testing the assumption that deep learning systems are “smooth classifiers,” i.e., stable in their classification to small perturbations around examples in the training data. They define an adversary of a classifier as an algorithm using complete knowledge of the classifier to perturb an observation such that , where is some small perturbation. Specifically, their adversary solves the constrained optimisation problem for a given :
For , Szegedy et al.  employ a line search along the direction of the loss function of the network starting from until the classifier produces the requested class. They find that adversarial examples of one classifier can fool other classifiers trained on independent data; hence, one need not have complete knowledge of a classifier in order to fool it.
|50||1||76.00 (40.69)||80.40 (45.17)|
|2||78.80 (45.17)||80.40 (43.10)|
|3||79.60 (43.79)||78.80 (44.48)|
|All||80.40 (43.79)||80.00 (43.79)|
|500||1||68.40 (40.34)||75.60 (40.69)|
|2||74.40 (40.69)||80.00 (50.34)|
|3||77.60 (43.79)||79.20 (48.62)|
|All||76.00 (42.41)||81.20 (48.97)|
Goodfellow et al.  provide an intuitive explanation of these adversaries: even though the perturbations in each dimension might be small, their contribution to the magnitude of a projection grows linearly with input dimensionality. With a deep neural network involving many such projections in each layer, a small perturbation at its high-dimensional input layer can create major consequences at the output layer. Goodfellow et al.  show that adversarial examples can be easily generated by making the perturbation proportional to the sign of the partial derivative of the loss function used to train a particular network, evaluated with the requested class. They also find that the direction of perturbation is important, not necessarily its size. Hence, it seems adversarial examples of one model will likely fool other models because they occur in large volumes in high-dimensional spaces. This is also found by Gu and Rigazio .
As for Szegedy et al. , we are interested the robustness of our deep learning music content analysis systems to an adversary. Do these systems suffer just as dramatically as the image content recognition systems in [54, 27, 24]? In other words, can we find imperceptible perturbations of audio recordings, yet make the systems produce any label with high confidence? If so, can we adapt the training of the systems such that they become more robust? In the next subsections, we define an adversary as an optimisation problem, but with care of the fact that the input to our deep learning systems are magnitude STFT (2). We then present an approach to integrate adversaries into the training of our systems. We present our experimental results in Section IV.
Iii-a Adversaries for music audio
The explicit goal of our adversary is to perturb a music recording such that a system will confidently classify it with some class . Specifically, we define the adversary as the constrained optimisation problem:
where we define the feasible set of adversarial examples to input sequence as:
with the parameter
limiting the maximum acceptable perturbation caused by the adversary. The loss function in (6) is the cross-entropy loss function, , which we use in training our (C)DNNs. Given the network parameters , this adversary can compute the derivative of this loss function by backpropagating derivatives through the network. This suggests that our adversary can accomplish its goal by searching for a new input sequence via gradient descent on the loss function with any label that differs from the ground truth. This is the approach used by Szegedy et al.  in the context of image object recognition.
A local minimum of (6) can be found using projected gradient descent, initialised with the exemplar , and iterating
where the scalar is the gradient descent step size, and computes the least squares projection of its argument onto the set defined in (7). Note that we define operations on sequences element-wise, e.g., .
The main difficulty with this approach is that not all sequences can be mapped back to valid time-domain signals . This is because the analysis in (1) uses overlapping windows, which causes adjacent elements in the sequence to become dependent. This means that individual elements from the sequence cannot be adjusted arbitrarily if we want to have an analog in the time-domain. Therefore, in order to generate valid adversarial examples, we include an additional processing step that projects the sequence onto the space of time-frequency coefficients arising from valid time-domain sequences. This is done using the Griffin and Lim algorithm , which seeks to minimise
where denotes the set of all valid sequences. This minimization can be performed using alternating projections, and we have found that in practice it is sufficient to apply a single set of projections. We do this by first rebuilding a complex valued time-frequency representation from the sequence
where and is the phase from the exemplar’s Fourier transform. The inverse Fourier transform is a time-domain signal, and so the Fourier transform of this signal, , will yield a valid DFT spectrum that can be used to build a valid input sequence for our (C)DNN, i.e., by replacing by in (2).
The pseudo-code in Alg. 1 summarises this approach. The algorithm may be terminated when the mean posterior of the target adversarial label exceeds the threshold , or after a maximum number of epochs (in which case an adversary cannot be found above the minimum SNR).
Iii-B Training with adversaries for music audio
As per  and , we can attempt to use our adversary as a regulariser, and to create systems robust against adversarial inputs. In particular, we create adversaries for the (C)DNN discussed above, and use them to generate a (possibly) infinite supply of new samples during training. The iterative procedure for generating adversaries in Alg 1 is too slow to be practical for training, which requires on the order of 50 to 200 training epochs. Therefore, we apply the single gradient step procedure suggested in . In our experience, this procedure often generates inputs that confuse the network, although not typically with a high confidence. The pseudo-code in Alg. 2 illustrates our training algorithm, where represent the training data, i.e., the set of input audio sequences and their labels, and is a set of adversarial labels.
|Classification in GTZAN|
|Little Richard, “Last Year’s Race Horse”||32 (23)||29 (23)||36 (25)||36 (26)||36 (25)||33 (24)||32 (24)||31 (25)||42 (26)||36 (25)|
|Rossini, “William Tell Overture”||32 (25)||37 (30)||40 (29)||43 (28)||34 (24)||36 (29)||33 (25)||34 (26)||37 (26)||37 (28)|
|Willie Nelson, “A Horse Called Music”||25 ( )||25 (20)||30 (27)||30 (20)||26 (19)||30 (25)||27 (23)||21 (20)||30 (23)||29 (23)|
|Simian Mobile Disco, “10000 Horses Can’t Be Wrong”||31 (30)||36 (31)||38 (32)||45 (34)||41 (33)||40 (32)||33 (31)||47 (34)||42 (33)||38 (33)|
|Rubber Bandits, “Horse Outside”||27 (27)||27 (27)||36 (29)||42 (31)||38 (29)||34 (28)||32 (28)||37 (29)||36 (29)||35 (29)|
|Leonard Gaskin, “Riders in the Sky”||32 (23)||30 (25)||32 (23)||35 (25)||31 (22)||35 (29)||34 (23)||26 (23)||35 (25)||35 (24)|
|Jethro Tull, “Heavy Horses”||29 (26)||28 (26)||40 (29)||42 (29)||38 (28)||36 (28)||34 (28)||34 (28)||37 (28)||36 (29)|
|Echo and The Bunnymen, “Bring on the Dancing Horses”||29 (25)||28 (26)||38 (28)||43 (28)||35 (26)||34 (26)||33 (26)||33 (26)||36 (27)||38 (28)|
|Count Prince Miller, “Mule Train”||32 (30)||29 (30)||41 (33)||37 (34)||43 (33)||36 (31)||33 (31)||42 (34)||40 (33)||33 (33)|
|Rolling Stones, “Wild Horses”||30 (22)||32 (24)||37 (25)||40 (25)||31 (22)||34 (25)||31 (26)||32 (23)||37 (25)||37 (26)|
Iv Experimental Results
We can design an adversary (Alg. 1) such that it will attempt to make a system behave in different ways. For instance, an adversary could attempt to perturb an input within some limit (SNR) such that the (C)DNN makes a high-confidence classification () that is correct with probability . Another adversary could attempt to make the system label any input using the same label. We can also make an ensemble of adversaries such that they produce adversarial examples that a (C)DNN classifies in every possible way.
We define our adversaries (Alg. 1) using: , SNR dB, , and , and with the directive to make the (C)DNN correct with probability . More concretely, for each test observations, the adversary draws uniformly one of the dataset labels , then seeks to find in no more than iterations using step size a valid perturbation no larger than dB SNR, and which the (C)DNN labels as with confidence . Figure 4(a) shows the FoM of the DNN-based classification system in Fig. 2(b), but with input intercepted by this adversary. Note that in this case the classification is performed by the same random forest classifier using the aggregated hidden layer activations, but the adversary is unaware of this. In other words, it is only trying to force the DNN to misclassify inputs that have been subject to minor perturbations. Compared with a normalised accuracy of in Fig. 2(b), we see our adversary has successfully confused the random forest classifier to be no better than random. Figure 5 shows one of the adversarial examples from this experiment. Apart from some significant high-frequency deviations, the spectrum of the adversary is very similar to that of the original. The SNR in this example is dB.
Figure 4(b) shows the FoM of the CDNN classification system in Fig. 3(d) attacked by the same adversary. In this case, the CDNN proved more difficult to fool, but still the adversary is able to significantly reduce the normalised classification accuracy from to with high confidence classifications at rather high SNR. If we reduce the minimum confidence and lessen the SNR constraint to dB, then the adversary makes the CDNN perform even worse: a normalised accuracy of with a mean SNR of dB.
For the same system in Fig. 2(b), and using , SNR dB, and , we show in  that we able to create adversaries that make the system always right, always wrong, and always select “Jazz.” Table II shows the results of two ensembles of adversaries, each intent on making the system in Fig. 2(b) choose one of every label in GTZAN for the same music with SNR dB, and . The adversaries of one ensemble insist upon a classification confidence of at least ; and in the other of at least . These music recordings are the same 30-second excerpts used in . We see that in all case by one, the ensembles are able to elicit high confidence classifications from the system with minor perturbations of the input. We also see that larger perturbations are produced on average when the adversaries insist on a higher minimum confidence: dB for a confidence of at least , and dB for a confidence of at least .
These results can be heard here: http://www.eecs.qmul.ac.uk/~sturm/research/DNN_adversaries. We find that the perturbations caused by these adversaries are certainly perceptible, unlike those found for image data in  and ; however, the distortion is very minor, and the music remains exactly the same, e.g., pitches, rhythm, lyrics, instrumentation, dynamics, and style all remain the same.
|Norm.||Norm. Acc.||SNR (dB)|
|Deep Learning System||Acc||w/ Adversary||mean std. dev.|
|DNN-LMD Fig. 3(b)||0.63||0.03||37.84.6|
|CDNN-LMD Fig. 3(d)||0.63||0.21||9.625.8|
We now perform an experiment to compare (C)DNNs trained with adversarial examples (as per Alg. 2) to the systems in Fig. 3(b,d). To do this we test the response of these systems against an adversary aimed at always eliciting an incorrect response. (This is different from the adversary used above, which seeks to make the system correct with probability .) For this experiment, we set and SNR to dB in order to allow arbitrarily large perturbations to force misclassifications. Table III illustrates the results of this experiment from which we observe several interesting results. Column 1 shows the normalized accuracy on the original test set (with no adversary present). We see that training against adversarial examples leads to a slight deflation in accuracy on new test data. Column 2 shows the normalized accuracy of these systems against our adversary intent on forcing a 100% error rate. We see that the CDNN systems are more robust to this adversary, and that the systems trained against adversarial examples confer little to no advantage. Column 3 shows the average perturbation size of the adversarial examples that led to misclassifications. We notice that larger perturbations (corresponding to lower SNRs) were required to get the CDNN systems to misclassify test inputs. The minimum SNR produced was dB, while the maximum was dB. The results of this experiment point to the conclusions that a) the CDNN systems are more robust to this adversary; and b) training against adversarial examples (contrary to what we hypothesized) does not seem reduce the misclassification rate against new adversarial examples. A possible explanation for the latter results is that, due to the high-dimensional nature of the input space, the set of possible adversarial examples is densely packed, so that training on a small number of these points is not sufficient to allow the systems to generalize to new adversarial examples.
Returning to the broadest question motivating our work, we seek to measure the contribution of deep learning to music content analysis. The previous sections describe a series of experiments we have conducted using deep learning systems of a variety of architectures, which we have trained and tested in two different partitions of two benchmark music datasets We have evaluated the robustness of these systems to an adversary that has complete knowledge of the classifiers, and have also investigated the use of an adversary in the training of deep learning systems.
Our experimental results in Fig. 2 and Table I are essentially reproductions of those reported in . Based on the results of their experiments with random partitionings of GTZAN, Sigtia et al.  claim that their DNN-based systems learn features that “better represent the audio” than standard or “hand-crafted” features, e.g., those referenced in  like MFCCs. Similar conclusions are made about the deep learning systems in , also based on experiments using a random partitioning of GTZAN. However, we see in Fig. 2 and Table I that when we consider the faults in the GTZAN dataset and partition it along artist lines, as for the LMD dataset in Fig. 3, our deep learning systems perform significantly worse. This is an expected outcome [39, 23, 49], but the artist information in GTZAN was not available until 2012 .
This motivates the question of whether DNN-based systems really do perform better than that of a classifier using standard, low-level and “hand-crafted” features. To examine this, we build baseline systems that use low-level features, and train and test them in the same fault-filtered partition of GTZAN as in Fig. 2(b), and the artist-filtered partition of LMD as in Fig. 3(b,d). Mimicking [28, 44], we compute these features based on a short-time analysis using 46ms frames hopped by 50%. From each frame we extract the first 13 Mel-frequency cepstral coefficients (MFCCs) and zero-crossings, and compute their mean and variance over five-second texture windows (which are also hopped by 50%). We combine the features of the training and validation sets of the fault-filtered partition of GTZAN, and the artist filtered partition of LMD. Both systems use a minimum Mahalanobis distance classifier, and assign a class by majority vote from the classifications of the individual texture windows. Figure 6 shows the FoM produced by these baseline systems. We see that for GTZAN it actually reproduces more ground truth than the DNN in Fig. 2(b) and all but one in Table I. Our simple baseline system for LMD reproduces much less ground truth than the (C)DNN in Fig. 3(b,d). Nonetheless, we have no reason to accept the conclusion that deep learning features “perform better” than “hand-crafted” features for the particular architectures considered here and those in [28, 44]. Different experiments are needed to address such a conclusion.
A tempting conclusion is that since the normalised classification accuracies in Figs. 2(b) and 3(d) are extremely unlikely to arise by chance ( for GTZAN and for LMD by a Binomial test) it is therefore entirely reasonable to reject the hypothesis that our (C)DNN are choosing outputs at random. Hence, one might argue that these (C)DNN must have learned features that are “relevant” to music genre recognition [28, 44, 31]. This argument appears throughout the MIR research discipline , and turns on the strong assumption that there are only two ways a system can reproduce the ground truth of a dataset: by chance or by learning to solve a specific problem thought to be well-posed by a cleanly labeled dataset . In fact, there is a third way a system can reproduce the ground truth of a music dataset: by learning to exploit characteristics shared between the training and testing datasets that arise not from a relationship in the real world, but from the curation and partitioning of a dataset in the experimental design of an evaluation [48, 49, 53]. Since the evaluations producing Figs. 2 and 3, as well as all results in [28, 44], not to mention a significant number of published studies in MIR , do not control for this third way, we cannot validly conclude upon the “relevance” of whatever has been learned by these music content analysis systems.
A notion of this problem is given by the significant decreases in the FoM we measure when partitioning GTZAN and LMD along artist lines. By doing so, we are controlling for some independent variables that a system might be exploiting to reproduce ground truth, but which arguably have little relevance to the high-level labels of the dataset . More concretely, consider that all 100 excerpts labeled Pop in GTZAN come from recordings of music by four artists, 25 from each artist. If we train and test a system on a random partition of GTZAN, we cannot know whether the system is recognising Pop, recognising the artist, or recognising other aspects that may or may not be related to Pop. If we train a system instead with Pop excerpts by three artists, test with the Pop excerpts by the fourth artist, then we might be testing something closer to Pop recognition. This all depends on defining what knowledge is relevant to the problem.
A common retort to these arguments is that a system should be able to reproduce ground truth “by any means.” One thereby defines “relevant knowledge” as any correlations that helps a system reproduce an amount of dataset ground truth that is inconsistent with chance. However, this can lead to circular reasoning: system X has learned “relevant knowledge” because it reproduces Y amount of ground truth; system X reproduces Y amount of ground truth because it has learned “relevant knowledge.” It is also deaf to one of the major aims of research in music content analysis : “to make music, or information about music, easier to find.” If a music content analysis system is describing music in ways that do not align with those of its users, then its usability is in jeopardy no matter its FoM in benchmark datasets [56, 42]. Finally, this means that the problem thought to be well-posed by a cleanly labeled dataset can be many things simultaneously — which leads to the problem of how to validly compare apples and oranges . In other words, why compare systems when they are solving different problems? This also applies to the comparisons above with the FoM in Fig. 6.
While we have no idea whether our (C)DNN systems in Fig. 3 are exploiting “irrelevant” characteristics in LMD, our experimental results with adversaries in Figs. 4 and 5, and Tables II and III, indicate that their decision machinery is incredibly sensitive in very strange ways. Our adversaries are able to fool the high-performing deep learning systems by perturbing their input in minor ways. Auditioning the results in Table II show that while the music in each recording remains exactly the same, and the perturbations are very small, the DNN is nearly always fooled into choosing with high confidence every class it has supposedly learned. The CDNN is similarly defeated by our adversary; however, it is quite notable that it requires perturbations of far lower SNR than does the DNN. We are currently studying the reasons for this.
Our application of adversaries here is close to the “method of irrelevant transformations” that we apply in [48, 53, 52] to assess the internal models of music content analysis systems, and to test the hypothesis, “the system is using relevant criteria to make its decisions.” In , we take a brute force approach whereby we apply random but linear time-invariant and minor filtering to inputs of systems trained in three different music recording datasets until their FoM becomes perfect or random. We also make each system apply every one of its classes to the same music recordings in Table II.444These results can be auditioned here: http://www.eecs.qmul.ac.uk/~sturm/research/TM_expt2/index.html In , we instead apply subtle pitch-preserving time-stretching of music recordings to fool a deep learning system trained in the benchmark music dataset BALLROOM . We find that through such a transformation we can make the system perform perfectly or no better than random by applying tempo changes of at most 6% to test dataset recordings. We find a similar result for the same kind of deep learning system but trained in LMD .
Our adversary in Alg. 1 moves instead right to the achilles heel of a deep learning system, coaxing it to behave in arbitrary ways for an input simply by making minor perturbations to the sampled audio waveform that have no effect on the music content it possesses. We observe in Fig. 5 and auditioning Table II that the low- to mid-frequency content of adversarial examples differs very little from the original recordings, but find more significant differences in the high-frequency spectra. This suggests that the distribution of energy in the high-frequency spectrum has significant impact on the decision machinery of our (C)DNN. The apparent high relevance of such slight characteristics in proportion to that of the actual musical content of a music recording does not bode well for one of the most important aims of machine learning: generalisation.
As observed by Goodfellow et al.  in their deep learning systems taught to recognise objects in images, the impressive FoM we measure of our deep learning systems may be merely a colourful “Potemkin village.” Employing an adversary to scratch a little below the surface reveals the FoM to be curiously hollow. A system that appears to be solving a complex problem but actually is not is what we term a “horse” , which is a nod to the famous horse Clever Hans: a real horse that appeared to be a capable mathematician but was merely responding to involuntary cues that went undetected because his public demonstrations had no validity to attest to such an ability. Measuring the number of correct answers Hans gives in an uncontrolled environment does not give reason to conclude he comprehends what he appears to be doing. It is the same with the experiments we perform above with systems labelling observations in GTZAN and LMD. In fact, Goodfellow et al.  come to the same conclusion: “The existence of adversarial examples suggests that … being able to correctly label the test data does not imply that our models truly understand the tasks we have asked them to perform” . This observation is now well-known in MIR [47, 48, 49, 50], but deserves to be repeated.
In this article, we have shown how to adapt the adversary of Szegedy et al.  to work within the context of music content analysis using deep learning. We have shown how our adversary is effective at fooling deep learning systems of different architectures, trained on different benchmark datasets. We find our convolutional networks are more robust against this adversary than our deep neural networks. We have also sought to employ the adversary as part of the training of these systems, but find it results in systems that remain as sensitive to the same adversary.
It is of course not very popular for one to be an “adversary” to research, moving quickly to refute conclusions and break systems reported in the literature; however, we insist that breaking systems leads ultimately to progress. Considerable insight can be gained by looking behind the veil of performance metrics in an attempt to determine the mechanisms by which a system operates, and whether the evaluation is any valid reflection of the qualities we wish to measure. Such probing is necessary if we are truly interested in ascertaining what a system has learned to do, what its vulnerabilities might be, how it compares to competing systems supposedly solving the same problem, and how well we can expect it to perform when used in real-world applications.
CK and JL were supported in part by the Danish Council for Strategic Research of the Danish Agency for Science Technology and Innovation under the CoSound project, case number 11-115328. This publication only reflects the authors’ views.
-  J. B. Allen and L. Rabiner. A unified approach to short-time Fourier analysis and synthesis. Proc. IEEE, 65(11):1558–1564, Nov. 1977.
-  J.-J. Aucouturier and F. Pachet. Scaling up music playlist generation. In Multimedia and Expo, 2002. ICME ’02. Proceedings. 2002 IEEE International Conference on, volume 1, pages 105–108 vol.1, 2002.
-  J-.J. Aucouturier and F. Pachet. Improving timbre similarity: How high is the sky? J. of Negative Results in Speech and Audio Sciences, 1(1), 2004.
-  Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
-  E. Battenberg and D. Wessel. Analyzing drum patterns using conditional deep belief networks. In Proc. ISMIR, 2012.
-  Y. Bengio, I. Goodfellow, and A. Courville. Deep Learning. MIT Press, 2015 (in preparation).
-  Yoshua Bengio. Learning deep architectures for AI. Foundations and trends in Machine Learning, 2(1):1–127, 2009.
-  James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.
-  T. Bertin-Mahieux, D. Eck, and M. Mandel. Automatic tagging of audio: The state-of-the-art. In W. Wang, editor, Machine Audition: Principles, Algorithms and Systems. IGI Publishing, 2010.
-  T. Bertin-Mahieux, D. P.W. Ellis, B. Whitman, and P. Lamere. The million song dataset. In Proc. ISMIR, 2011.
-  N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Audio chord recognition with recurrent neural networks. In Proc. ISMIR, 2013.
-  C. J. C. Burges, J. C. Platt, and S. Jana. Distortion discriminant analysis for audio fingerprinting. IEEE Trans. Speech Audio Process., 11(3):165–174, May 2003.
-  M. Casey, C. Rhodes, and M. Slaney. Analysis of minimum distances in high-dimensional musical spaces. IEEE Trans. Audio, Speech, Lang. Process., 16(5):1015–1028, July 2008.
-  M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney. Content-based music information retrieval: Current directions and future challenges. Proc. IEEE, 96(4):668–696, Apr. 2008.
-  Nick Collins. Computational analysis of musical influence: A musicological case study using mir tools. In ISMIR, pages 177–182, 2010.
-  N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. KDD, 2004.
-  L. Deng and D. Yu. Deep Learning: Methods and Applications. Now Publishers, 2014.
-  S. Dieleman, P. Brakel, and B. Schrauwen. Audio-based music classification with a pretrained convolutional network. In Proc. ISMIR, 2011.
-  S. Dieleman and B. Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6964–6968, May 2014.
-  S. Dixon, F. Gouyon, and G. Widmer. Towards characterisation of music via rhythmic patterns. In Proc. ISMIR, pages 509–517, 2004.
-  S. Ewert, B. Pardo, M. Muller, and M.D. Plumbley. Score-informed source separation for musical audio recordings: An overview. Signal Processing Magazine, IEEE, 31(3):116–124, May 2014.
-  A. Flexer. A closer look on artist filters for musical genre classification. In Proc. ISMIR, pages 341–344, Sep. 2007.
-  A. Flexer, D. Schnitzer, M. Gasser, and T. Pohle. Combining features reduces hubness in audio similarity. In Proc. Int. Symp. Music Info. Retrieval, 2010.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In Proc. ICLR, 2015.
-  Daniel Griffin and Jae S Lim. Signal estimation from modified short-time fourier transform. Acoustics, Speech and Signal Processing, IEEE Transactions on, 32(2):236–243, 1984.
-  Niall Griffith and Peter M Todd. Musical networks: Parallel distributed perception and performance. MIT Press, 1999.
-  S. Gu and L. Rigazio. Towards Deep Neural Network Architectures Robust to Adversarial Examples. ArXiv e-prints, December 2014.
-  P. Hamel and D. Eck. Learning features from music audio with deep belief networks. In Proc. ISMIR, 2010.
-  T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2 edition, 2009.
-  M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun. Unsupervised learning of sparse features for scalable audio classification. In Proc. Int. Soc. Music Info. Retrieval, Miami, FL, Oct. 2011.
-  E. J. Humphrey, J. P. Bello, and Y. LeCun. Feature learning and deep architectures: New directions for music informatics. J. Intell. Info. Systems, 41(3):461–481, 2013.
-  E.J. Humphrey and J.P. Bello. From music audio to chord tablature: Teaching deep convolutional networks toplay guitar. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6974–6978, May 2014.
-  C. Kereliuk, B. L. Sturm, and J. Larsen. Deep learning, audio adversaries, and music content analysis. In Proc. WASPAA, 2015.
-  H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning for audio classification using convolutional deep belief networks. In Proc. Neural Info. Process. Systems, Vancouver, B.C., Canada, Dec. 2009.
-  T. LH. Li, A. B. Chan, and A. HW. Chun. Automatic musical pattern feature extraction using convolutional neural network. In Proc. Int. Conf. Data Mining and Applications, 2010.
-  B. Matityaho and M. Furst. Neural network based model for classification of music type. In Proc. Conv. Electrical and Elect. Eng. in Israel, pages 1–5, Mar. 1995.
-  G. Montavon, G. B. Orr, and K.-R. Müller, editors. Neural Networks, Tricks of the Trade, Reloaded. Lecture Notes in Computer Science (LNCS 7700). Springer, 2012.
-  A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proc. NIPS, 2014.
-  E. Pampalk, A. Flexer, and G. Widmer. Improvements of audio-based music similarity and genre classification. In Proc. Int. Soc. Music Info. Retrieval, pages 628–233, Sep. 2005.
-  G. Papadopoulos and G. Wiggins. Ai methods for algorithmic composition: A survey, a critical view and future prospects. In Proc. AISB Symposim on Musical Creativity, pages 110–117, 1999.
-  A. Pikrakis. A deep learning approach to rhythm modeling with applications. In Proc. Int. Workshop Machine Learning and Music, 2013.
-  M. Schedl, A. Flexer, and J. Urbano. The neglected user in music information retrieval research. J. Intell. Info. Systems, 41(3):523–539, 2013.
-  D. Schwarz. Concatenative sound synthesis: The early years. J. New Music Research, 35(1):3–22, Mar. 2006.
-  S. Sigtia and S. Dixon. Improved music feature learning with deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6959–6963, May 2014.
-  C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. The Latin music database. In Proc. ISMIR, 2008.
-  B. L. Sturm. An analysis of the GTZAN music genre dataset. In Proc. ACM MIRUM Workshop, pages 7–12, Nara, Japan, Nov. 2012.
-  B. L. Sturm. Classification accuracy is not enough: On the evaluation of music genre recognition systems. J. Intell. Info. Systems, 41(3):371–406, 2013.
-  B. L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Trans. Multimedia, 16(6):1636–1644, 2014.
-  B. L. Sturm. The state of the art ten years after a state of the art: Future research in music information retrieval. J. New Music Research, 43(2):147–172, 2014.
-  B. L. Sturm. A survey of evaluation in music genre recognition. In A. Nürnberger, S. Stober, B. Larsen, and M. Detyniecki, editors, Adaptive Multimedia Retrieval: Semantics, Context, and Adaptation, volume LNCS 8382, pages 29–66, Oct. 2014.
-  B. L. Sturm. “horse” inside: Seeking causes of the behaviours of music content analysis systems. ACM Computers in Entertainment, 2015 (submitted).
-  B. L. Sturm, C. Kereliuk, and J. Larsen. ?‘ el caballo viejo? latin genre recognition with deep learning and spectral periodicity. In Proc. Int. Conf. on Mathematics and Computation in Music, 2015.
-  B. L. Sturm, C. Kereliuk, and A. Pikrakis. A closer look at deep learning neural networks with low-level spectral periodicity features. In Proc. Int. Workshop on Cognitive Info. Process., 2014.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In Proc. ICLR, 2014.
-  G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process., 10(5):293–302, July 2002.
-  J. Urbano, M. Schedl, and X. Serra. Evaluation in music information retrieval. J. Intell. Info. Systems, 41(3):345–369, Dec. 2013.
-  A. van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In Proc. NIPS, 2013.
-  N. Vempala and F. Russo. Predicting emotion from music audio features using neural networks. In Proc. CMMR, 2012.
-  A. Wang. An industrial strength audio search algorithm. In Proc. Int. Soc. Music Info. Retrieval, Oct. 2003.
-  F. Weninger, F. Eyben, and B. Schuller. On-line continuous-time music mood regression with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 5412–5416, May 2014.
-  B. Whitman, G. Flake, and S. Lawrence. Artist detection in music with minnowmatch. Proc. IEEE Workshop on Neural Networks for Signal Processing, pages 559–568, 2001.
-  G. A. Wiggins. Semantic gap?? Schemantic schmap!! Methodological considerations in the scientific study of music. In Proc. IEEE Int. Symp. Mulitmedia, pages 477–482, Dec. 2009.
-  X. Yang, Q. Chen, S. Zhou, and X. Wang. Deep belief networks for automatic music genre classification. In Proc. INTERSPEECH, pages 2433–2436, 2011.
-  Y.-H. Yang and H. H. Chen. Music Emotion Recognition. CRC Press, 2011.
-  Chiyuan Zhang, G. Evangelopoulos, S. Voinea, L. Rosasco, and T. Poggio. A deep representation for invariance and music classification. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6984–6988, May 2014.