# An End-to-End Neural Network for Polyphonic Piano Music Transcription

###### Abstract

We present a supervised neural network model for polyphonic piano music transcription. The architecture of the proposed model is analogous to speech recognition systems and comprises an acoustic model and a music language model. The acoustic model is a neural network used for estimating the probabilities of pitches in a frame of audio. The language model is a recurrent neural network that models the correlations between pitch combinations over time. The proposed model is general and can be used to transcribe polyphonic music without imposing any constraints on the polyphony. The acoustic and language model predictions are combined using a probabilistic graphical model. Inference over the output variables is performed using the beam search algorithm. We perform two sets of experiments. We investigate various neural network architectures for the acoustic models and also investigate the effect of combining acoustic and music language model predictions using the proposed architecture. We compare performance of the neural network based acoustic models with two popular unsupervised acoustic models. Results show that convolutional neural network acoustic models yields the best performance across all evaluation metrics. We also observe improved performance with the application of the music language models. Finally, we present an efficient variant of beam search that improves performance and reduces run-times by an order of magnitude, making the model suitable for real-time applications.

EDICS Category: AUD-MSP, AUD-MIR, MLR-DEEP

## I Introduction

Automatic Music Transcription (AMT) is a fundamental problem in Music Information Retrieval (MIR). AMT aims to generate a symbolic, score-like transcription, given a polyphonic acoustic signal. Music transcription is considered to be a difficult problem even by human experts and current music transcription systems fail to match human performance [klapuri2007signal]. Polyphonic AMT is a difficult problem because concurrently sounding notes from one or more instruments cause a complex interaction and overlap of harmonics in the acoustic signal. Variability in the input signal also depends on the specific type of instrument being used. Additionally, AMT systems with unconstrained polyphony have a combinatorially very large output space, which further complicates the modeling problem. Typically, variability in the input signal is captured by models that aim to learn the timbral properties of the instrument being transcribed [berg2014unsupervised, benetos2012shift], while the issues relating to a large output space are dealt with by constraining the models to have a maximum polyphony [klapuri2003multiple, emiya2008automatic].

The majority of current AMT systems are based on the principle of describing the input magnitude spectrogram as a weighted combination of basis spectra corresponding to pitches. The basis spectra can be estimated by various techniques such as non-negative matrix factorisation (NMF) and sparse decomposition. Unsupervised NMF approaches [smaragdis2003non, abdallah2004polyphonic] aim to learn a dictionary of pitch spectra from the training examples. However purely unsupervised approaches can often lead to bases that do not correspond to musical pitches, therefore causing issues with interpreting the results at test time. These issues with unsupervised spectrogram factorisation methods are addressed by incorporating harmonic constraints in the training algorithm [vincent2010adaptive, bertin2010enforcing]. Spectrogram factorisation based techniques were extended with the introduction of probabilistic latent component analysis (PLCA) [smaragdis2006probabilistic]. PLCA aims to fit a latent variable probabilistic model to normalised spectrograms. PLCA based models are easy to train with the expectation-maximisation (EM) algorithm and have been extended and applied extensively to AMT problems [grindlay2010probabilistic, benetos2012shift].

As an alternative to spectrogram factorisation techniques, there has been considerable interest in discriminative approaches to AMT. Discriminative approaches aim to directly classify features extracted from frames of audio to the output pitches. This approach has the advantage that instead of constructing instrument specific generative models, complex classifiers can be trained using large amounts of training data to capture the variability in the inputs. When using discriminative approaches, the performance of the classifiers is dependent on the features extracted from the signal. Recently, neural networks have been applied to raw data or low level representations to jointly learn the features and classifiers for a task [lecun2015deep]. Over the years there have been many experiments that evaluate discriminative approaches for AMT. Poliner and Ellis [poliner2007discriminative] use support vector machines (SVMs) to classify normalised magnitude spectra. Nam et. al. [nam2011classification] superimpose an SVM on top of a deep belief network (DBN) in order to learn the features for an AMT task. Similarly, a bi-directional recurrent neural network (RNN) is applied to magnitude spectrograms for polyphonic transcription in [bock2012polyphonic].

In large vocabulary speech recognition systems, the information contained in the acoustic signal alone is often not sufficient to resolve ambiguities between possible outputs. A language model is used to provide a prior probability of the current word given the previous words in a sentence. Statistical language models are essential for large vocabulary speech recognition [rabiner1993fundamentals]. Similarly to speech, musical sequences exhibit temporal structure. In addition to an accurate acoustic model, a model that captures the temporal structure of music or a music language model (MLM), can potentially help improve the performance of AMT systems. Unlike speech, language models are not common in most AMT models due to the challenging problem of modelling the combinatorially large output space of polyphonic music. Typically, the outputs of the acoustic models are processed by pitch specific, two-state hidden Markov models (HMMs) that enforce smoothing and duration constraints on the output pitches [benetos2012shift, poliner2007discriminative]. However, extending this to modelling the high-dimensional outputs of a polyphonic AMT system has proved to be challenging, although there are some studies that explore this idea. A dynamic Bayesian network is used in [raczynski2013dynamic], to estimate prior probabilities of note combinations in an NMF based transcription framework. Similarly in [sigtiarnn], a recurrent neural network (RNN) based MLM is used to estimate prior probabilities of note sequences, alongside a PLCA acoustic model. A sequence transduction framework is proposed in [boulanger2013high], where the acoustic and language models are combined in a single RNN.

The ideas presented in this paper are extensions of the preliminary experiments in [sigtia2014hybrid]. We propose an end-to-end architecture for jointly training both the acoustic and the language models for an AMT task. We evaluate the performance of the proposed model on a dataset of polyphonic piano music. We train neural network acoustic models to identify the pitches in a frame of audio. The discriminative classifiers can in theory be trained on complex mixtures of instrument sources, without having to account for each instrument separately. The neural network classifiers can be directly applied to the time-frequency representation, eliminating the need for a separate feature extraction stage. In addition to the deep feed-forward neural network (DNN) and RNN architectures in [sigtia2014hybrid], we explore using convolutional neural nets (ConvNets) as acoustic models. ConvNets were initially proposed as classifiers for object recognition in computer vision, but have found increasing application in speech recognition [abdel2012applying, abdel2013exploring]. Although ConvNets have been applied to some problems in MIR [schluter2014improved, humphrey2012rethinking], they remain unexplored for transcription tasks. We also include comparisons with two state-of-the-art spectrogram factorisation based acoustic models [benetos2012shift, vincent2010adaptive] that are popular in AMT literature. As mentioned before, the high dimensional outputs of the acoustic model pose a challenging problem for language modelling. We propose using RNNs as an alternative to state space models like factorial HMMs [vincent2004music] and dynamic Bayesian networks [raczynski2013dynamic], for modeling the temporal structure of notes in music. RNN based language models were first used alongside a PLCA acoustic model in [sigtiarnn]. However, in that setup, the language model is used to iteratively refine the predictions in a feedback loop resulting in a non-causal and theoretically unsatisfactory model. In the hybrid framework, approximate inference over the output variables is performed using beam search. However beam search can be computationally expensive when used to decode long temporal sequences. We apply the efficient hashed beam search algorithm proposed in [sigtiachords] for inference. The new inference algorithm reduces decoding time by an order of magnitude and makes the proposed model suitable for real-time applications. Our results show that convolutional neural network acoustic models outperform the remaining acoustic models over a number of evaluation metrics. We also observe improved performance with the application of the music language models.

The rest of the paper is organised as follows: Section II describes the neural network models used in the experiment, Section III discusses the proposed model and the inference algorithm, Section IV details model evaluation and experimental results. Discussion, future work and conclusions are presented in Section V.

## Ii Background

In this section we describe the neural network models used for the acoustic and language modelling. Although neural networks are an old concept, they have recently been applied to a wide range of machine learning problems with great success [lecun2015deep]. One of the primary reasons for their recent success has been the availability of large datasets and large-scale computing infrastructure [dean2012large], which makes it feasible to train networks with millions of parameters. The parameters of any neural network architecture are typically estimated with numerical optimisation techniques. Once a suitable cost function has been defined, the derivatives of the cost with respect to the model parameters are found using the backpropagation algorithm [rumelhart1988learning] and parameters are updated using stochastic gradient descent (SGD) [lecun2012efficient]. SGD has the useful property that the model parameters are iteratively updated using small batches of data. This allows the training algorithm to scale to very large datasets. The layered, hierarchical structure of neural nets makes end-to-end training possible, which implies that the network can be trained to predict outputs from low-level inputs without extracting features. This is in contrast to many other machine learning models whose performance is dependent on the features extracted from the data. Their ability to jointly learn feature transformations and classifiers makes neural networks particularly well suited to problems in MIR [humphrey2013feature].

### Ii-a Acoustic Models

#### Ii-A1 Deep Neural Networks

DNNs are powerful machine learning models that can be used for classification and regression tasks. DNNs are characterised by having one or more layers of non-linear transformations. Formally, one layer of a DNN performs the following transformation:

(1) |

In Equation 1, are the weight matrix and bias for layer , and is some non-linear function that is applied element-wise. For the first layer, , where is the input. In all our experiments, we fix to be the sigmoid function (). The output of the final layer is transformed according to the given problem to yield a posterior probability distribution over the output variables . The parameters , are numerically estimated with the backpropagation algorithm and SGD. Figure (a)a shows a graphical representation of the DNN architecture, the dashed arrows represent intermediate hidden layers. For acoustic modelling, the input to the DNN is a frame of features, for example a magnitude spectrogram or the constant Q transform (CQT) and the DNN is trained to predict the probability of pitches present in the frame at some time .

#### Ii-A2 Recurrent Neural Networks

DNNs are good classifiers for stationary data, like images. However, they are not designed to account for sequential data. RNNs are natural extensions of DNNs, designed to handle sequential or temporal data. This makes them more suited for AMT tasks, since consecutive frames of audio exhibit both short-term and long-term temporal patterns [eck2002finding]. RNNs are characterised by recursive connections between the hidden layer activations at some time and the hidden layer activations at , as shown in Figure (b)b. Formally, the hidden layer of an RNN at time performs the following computation:

(2) |

In Equation 2, is the weight matrix from the input to the hidden units, is the weight matrix for the recurrent connection and are the biases for layer . From Equation 2, we can see that the recursive update of the hidden state at time , implies that is implicitly a function of all the inputs till time , . Similar to DNNs, RNNs are made up of one or more layers of hidden units. The outputs of the final layer are transformed with a suitable function to yield the desired distribution over the ouputs. The RNN parameters are calculated using the back propagation through time algorithm (BPTT) [werbos1990backpropagation] and SGD. For acoustic modelling, the RNN acts on a sequence of input features to yield a probability distribution over the outputs , where .

#### Ii-A3 Convolutional Networks

ConvNets are neural nets with a unique structure. Convolutional layers are specifically designed to preserve the spatial structure of the inputs. In a convolutional layer, a set of weights act on a local region of the input. These weights are then repeatedly applied to the entire input to produce a feature map. Convolutional layers are characterised by the sharing of weights across the entire input. As shown in Figure (c)c, ConvNets are comprised of alternating convolutional and pooling layers, followed by one or more fully connected layers (same as DNNs). Formally, the repeated application of the shared weights to the input signal constitutes a convolution operation:

(3) |

The input is a vector of inputs from different channels, for example RGB channels for images. Formally, , where each input represents an input channel. Each input band has an associated weight matrix. All the weights of a convolutional layer are collectively represented as a four dimensional tensor. Given an region from a feature map , the max pooling function returns the maximum activation in the region. At any time , the input to the ConvNet is a window of feature frames . The outputs of the final layer yield the posterior distribution distribution .

There are several motivations for using ConvNets for acoustic modelling. There are many experiments in MIR that suggest that rather than classifying a single frame of input, better prediction accuracies can be achieved by incorporating information over several frames of inputs [sigtiachords, boulanger2013audio, bergstra2006aggregate]. Typically, this is achieved either by applying a context window around the input frame or by aggregating information over time by calculating statistical moments over a window of frames. Applying a context window around a frame of low level spectral features, like the short time fourier transform (STFT) would lead to a very high dimensional input, which is impractical. Secondly, taking mean, standard deviation or other statistical moments makes very simplistic assumptions about the distribution of data over time in neighbouring frames. ConvNets, due to their architecture [lecun2015deep], can be directly applied to several frames of inputs to learn features along both, the time and the frequency axes. Additionally, when using an input representation like the CQT, ConvNets can learn pitch-invariant features, since inter-harmonic spacings in music signals are constant across log-frequency. Finally, the weight sharing and pooling architecture leads to a reduction in the number of ConvNet parameters, compared to a fully connected DNN. This is a useful property given that very large quantities of labelled data are difficult to obtain for most MIR problems, including AMT.

### Ii-B Music Language Models

Given a sequence , we use the MLM to define a prior probability distribution . is a high-dimensional binary vector that represents the notes being played at (one time-step of a piano-roll representation). The high dimensional nature of the output space makes modelling a challenging problem. Most post-processing algorithms make the simplifying assumption that all the pitches are independent and model their temporal evolution with independent models [poliner2007discriminative]. However, for polyphonic music, the pitches that are active concurrently are highly correlated (harmonies, chords). In this section, we describe the RNN music language models first introduced in [boulanger2012modeling].

#### Ii-B1 Generative RNN

The RNNs defined in the earlier sections were used to map a sequence of inputs to a sequence of outputs . At each time-step , the RNN outputs the conditional distribution . However RNNs can be used to define a distribution over some sequence by connecting the outputs of the RNN at to the inputs of the RNN at , resulting in a distribution of the form:

(4) |

Although an RNN predicts conditioned on the high dimensional inputs , the individual pitch outputs are independent, where is the pitch index (Section IV-C). As mentioned earlier, this is not true for polyphonic music. Boulanger-Lewandowski et. al. [boulanger2012modeling] demonstrate that rather than predicting independent distributions, the parameters of a more complicated parametric output distribution can be conditioned on the RNN hidden state. In our experiments, we use the RNN to output the biases of a neural autoregressive distribution estimator (NADE) [boulanger2012modeling].

#### Ii-B2 Neural Autogressive Distribution Estimator

The NADE is a distribution estimator for high dimensional binary data [larochelle2011neural]. The NADE was initially proposed as a tractable alternative to the restricted Boltzmann machine (RBM). The NADE estimates the joint distribution over high dimensional binary variables as follows:

The NADE is similar to a fully visible sigmoid belief network [neal1992connectionist], since the conditional probability of is a non-linear function of . The NADE computes the conditional distributions according to:

(5) |

(6) |

where are weight matrices, is a submatrix of that denotes the first columns and are the hidden and visible biases, respectively. The gradients of the likelihood function with respect to the model parameters can be found exactly, which is not possible with RBMs [larochelle2011neural]. This property allows the NADE to be readily combined with other models and the models can be jointly trained with gradient based optimisers.

#### Ii-B3 Rnn-Nade

In order to learn high dimensional, temporal distributions for the MLM, we combine the NADE and an RNN, as proposed in [boulanger2012modeling]. The resulting model yields a sequence of NADEs conditioned on an RNN, that describe a distribution over sequences of polyphonic music. The joint model is obtained by letting the parameters of the NADE at each time step be a function of the RNN hidden state . is the hidden state of final layer of the RNN (Equation 2) at time . In order to limit the number of free parameters in the model, we only allow the NADE biases to be functions of the RNN hidden state, while the remaining parameters () are held constant over time. We compute the NADE biases as a linear transformation of the RNN hidden state plus an added bias term [boulanger2012modeling]:

(7) |

(8) |

and are weight matrices from the RNN hidden state to the visible and hidden biases, respectively. The gradients with respect to all the model parameters can be easily computed using the chain rule and the joint model is trained using the BPTT algorithm [boulanger2012modeling].

## Iii Proposed Model

In this section we review the proposed neural network model for polyphonic AMT. As mentioned earlier, the model comprises an acoustic model and a music language model. In addition to the acoustic models in [sigtia2014hybrid], we propose the use of ConvNets for identifying pitches present in the input audio signal and compare their performance to various other acoustic models (Section IV-F). The acoustic and language models are combined under a single training objective using a hybrid RNN architecture, yielding an end-to-end model for AMT with unconstrained polyphony. We first describe the hybrid RNN model, followed by a description of the proposed inference algorithm.

### Iii-a Hybrid RNN

The hybrid RNN is a graphical model that combines the predictions of any arbitrary frame level acoustic model, with an RNN-based language model. Let be a sequence of inputs and let be the corresponding transcriptions. The joint probability of can be factorised as follows:

(9) | ||||

The factorisation in Equation 9 makes the following independence assumptions:

(10) |

(11) |

These independence assumptions are similar to the assumptions made in HMMs [rabiner1989tutorial]. Figure 2 is a graphical representation of the hybrid model. In equation 9, is the emission probability of an input, given output . Using Bayes’s rule, the conditional distribution can be written as follows:

(12) |

where the marginals and priors , are assumed to be fixed w.r.t. the model parameters.

With this reformulation of the joint distribution, we observe that the conditional distribution is directly proportional to the product of two distributions. The prior distribution is obtained using a generative RNN (Section II-B1) and the posterior distribution over note-combinations can be modelled using any frame based classifier. The hybrid RNN graphical model is similar to an HMM, where the state transition probabilities for the HMM have been generalised to include connections from all previous outputs, resulting in the terms in Equation 12.

For the problem of automatic music transcription, the input time-frequency representation forms the input sequence , while the output piano-roll sequence denotes the transcriptions. The priors are obtained from the RNN-NADE MLM, while the posterior distributions are obtained from the acoustic models. The models can then be trained by finding the derivatives of the acoustic and language model objectives with respect to the model parameters and training using gradient descent. The independent training of the acoustic and language models is a useful property since datasets available for music transcription are considerably smaller in size as compared to datasets in computer vision and speech. However large corpora of MIDI music are relatively easy to find on the internet. Therefore in theory, the MLMs can be trained on large corpora of MIDI music, analogous to language model training in speech.

### Iii-B Inference

At test time, we would like to find the mode of the conditional output distribution:

(13) |

From Equation 12, we observe that the priors , tie the predictions of the acoustic model to all the predictions made till time . This prior term encourages coherence between predictions over time and allows musicological structure learnt by the language models to influence successive predictions. However, this more general structure leads to a more complex inference (or decoding) procedure at test time. This is due to the fact that at time , the history has not been optimally determined. Therefore, the optimum choice of depends on all the past model predictions. Proceeding greedily in a chronological manner by selecting that optimises does not necessarily yield good solutions. We are interested in solutions that globally optimise . But exhaustively searching for the best sequence is intractable since the number of possible configurations of is exponential in the number of output pitches ( for pitches).

Beam search is a graph search algorithm that is commonly used to decode the conditional outputs of an RNN [graves2012sequence, boulanger2013high, sigtiachords]. Beam search scales to arbitrarily long sequences and the computational cost versus accuracy trade-off can be controlled via the width of the beam. The inference algorithm is comprised of the following steps: at any time , the algorithm maintains at most partial solutions, where is the beam width or the beam capacity. The solutions in the beam at correspond to sub-sequences of length . Next, all possible descendants of the partial solutions in the beam are enumerated and then sorted in decreasing order of log-likelihood. From these candidate solutions, the top solutions are retained as beam entries for further search. Beam search can be readily applied to problems where the number of candidate solutions at each step is limited, like speech recognition [boulanger2014phone] and audio chord estimation [sigtiachords]. However, using beam search for decoding sequences with a large output space is prohibitively inefficient.

When the space of candidate solutions is large, the algorithm can be constrained to consider only new candidates for each partial solution in the beam, where is known as the branching factor. The procedure for selecting the candidates can be designed according to the given problem. For the hybrid architecture, from Equation 12 we note:

(14) |

At time , the partial solutions in the beam correspond to configurations of . Therefore given , the configurations that maximise would be a suitable choice of candidates for . However for many families of distributions, it might not be possible to enumerate in decreasing order of likelihood. In [boulanger2013high], the authors propose forming a pool of candidates by drawing random samples from the conditional output distributions. However, random sampling can be inefficient and obtaining independent samples can be very expensive for many types of distributions. As an alternative, we propose to sample solutions from the posterior distribution of the acoustic model [sigtia2014hybrid]. There are main motivations for doing this. Firstly, the outputs of the acoustic model are independent class probabilities. Therefore, it is easy to enumerate samples in decreasing order of log-likelihood [boulanger2013high]. Secondly, we avoid the accumulation of errors in the RNN predictions over time [bengio2015scheduled]. The RNN models are trained to predict , given the true outputs . However at test time, outputs sampled from the RNN are fed back as inputs at the next time step. This discrepancy between the training and test objectives can cause prediction errors to accumulate over time.

Although generating candidates from the acoustic model yields good results, it requires the use of large beam widths. This makes the inference procedure computationally slow and unsuitable for real-time applications [sigtia2014hybrid]. In this study, we propose using the hashed beam search algorithm proposed in [sigtiachords]. Beam search is fundamentally limited when decoding long temporal sequences. This is due to the fact that solutions that differ at only a few time-steps, can saturate the beam. This causes the algorithm to search a very limited space of possible solutions. This issue can be solved by efficient pruning. The hashed beam search algorithm improves efficiency by pruning solutions that are similar to solutions with a higher likelihood. The metric that determines the similarity of sequences can be chosen in a problem dependent manner and is encoded in the form of a locality sensitive hash function [sigtiachords]. In Algorithm , we outline the beam search algorithm algorithm used for our experiments, while Algorithm describes the hash table beam object. In Algorithms and , is a sequence , is log-likelihood of , are acoustic and language model objects and is the hash function.

There are two key differences between Algorithm and the algorithm in [sigtia2014hybrid]. First, the priority queue that stores the beam is replaced by a hash table beam object (see Algorithm ). Secondly, for each entry in the beam we evaluate candidate solutions. This is in contrast to the algorithm in [sigtia2014hybrid], where once the beam is full, only candidate solutions are evaluated per iteration. It might appear that the hashed beam search algorithm might be more expensive, since it evaluates candidates instead of candidates. However, by efficiently pruning similar solutions, the algorithm yields better results for much smaller values of , resulting in a significant increase in efficiency (Section IV-F, Figure 3).

Algorithm describes the hash table beam object. The hashed beam search algorithm offers several advantages compared to the standard beam search algorithm. The notion of similarity of solutions can be encoded in the form of hash functions. For music transcription, we choose the similarity function to be the last frames in a sequence . corresponds to a dynamic programming like decoding (similar to HMMs) where all sequences with the same final state are considered to be equivalent, and the sequence with the highest log-likelihood is retained. len(sequence) corresponds to regular beam search. Additionally, the hash beam search algorithm can maintain solution per hash key through a process called chaining [cormen2001introduction].

## Iv Evaluation

In this section we describe how the performance of the proposed model is evaluated for a polyphonic transcription task.

### Iv-a Dataset

We evaluate the proposed model on the MAPS dataset [emiya2010multipitch]. The dataset consists of audio and corresponding annotations for isolated sounds, chords and complete pieces of piano music. For our experiments, we use only the full musical pieces for training and testing the neural network acoustic models and MLMs. The dataset consists of pieces of classical music and MIDI annotations. There are categories of recordings corresponding to different piano types and recording conditions, with recordings per category. categories of audio are produced by software piano synthesisers, while sets of recordings are obtained from a Yamaha Disklavier upright piano. Therefore the dataset consists of synthesised recordings and real recordings.

We perform 2 sets of investigations in this paper. The first set of experiments investigate the effect of the RNN MLMs on the predictions of the acoustic models. For this task, we divide the entire dataset set into 4 disjoint train/test splits, as to ensure that the folds are music piece-independent. Specifically, for some of the musical pieces in the dataset, audio for each piece is rendered using more than one piano. Therefore while creating the splits, we ensure that the training and test data do not contain any overlapping pieces^{1}^{1}1Details available at: http://www.eecs.qmul.ac.uk/~sss31/TASLP/info.html. For each split, we select of the data for training ( musical pieces) and the remaining for testing ( pieces). From each training split, we hold out tracks as a validation set for selecting the hyper-parameters for the training algorithm (Section IV-D). All the reported results are mean values of the evaluation metrics over the splits. From now on,
this evaluation configuration will be named as Configuration 1.

Although the above experimental setup is useful for investigating the effectiveness of the RNN MLMs, the training set contains examples from piano models which are used for testing. This is usually not true in practice, where the instrument models/sources at test time are unknown and usually do not coincide with the instruments used for training. A majority of experiments with the MAPS dataset train and test model on disjoint instrument types [benetos2012shift, berg2014unsupervised, o2014polyphonic]. We thus perform a second set of experiments to compare performance of the different neural network acoustic models in a more realistic setting. We train the acoustic models using the 210 tracks created using synthesized pianos (180 tracks for training and 30 tracks for validation) and we test the acoustic models on the 60 audio recordings obtained from Yamaha Disklavier piano recordings (models ‘ENSTDkAm’ and ‘ENSTDkCl’ in the MAPS database). In this experiment, we do not apply the language models since the train and test sets contain overlapping musical pieces. In addition to the neural network acoustic models, we include comparisons with two state-of-the-art unsupervised acoustic models [benetos2012shift, vincent2010adaptive] for both experiments. This instrument source-independent evaluation configuration will be named from now on as Configuration 2.

### Iv-B Metrics

We use both frame and note based metrics to assess the performance of the proposed system [bay2009evaluation]. Frame-based evaluations are made by comparing the transcribed binary output and the MIDI ground truth frame-by-frame. For note-based evaluation, the system returns a list of notes, along with the corresponding pitches, onset and offset time. We use the F-measure, precision, recall and accuracy for both frame and note based evaluation. Formally, the frame-based metrics are defined as:

where TP[t] is the number of true positives for the event at , FP is the number of false positives and FN is the number of false negatives. The summation over is carried out over the entire test data. Similarly, analogous note-based metrics can be defined [bay2009evaluation]. A note event is assumed to be correct if its predicted pitch onset is within a range of the ground truth onset.

### Iv-C Preprocessing

We transform the input audio to a time-frequency representation which is then input to the acoustic models. In [sigtia2014hybrid], we used the magnitude short-time Fourier transform (STFT) as input to the acoustic models. However, here we experiment with the constant Q transform (CQT) as the input representation. There are two motivations for this. Firstly, the CQT is fundamentally better suited as a time-frequency representation for music signals, since the frequency axis is linear in pitch [brown1991calculation]. Another advantage of using the CQT is that the resulting representation is much lower dimensional than the STFT. Having a lower dimensional representation is useful when using neural network acoustic models as it reduces the number of parameters in the model.

We downsample the audio to kHz from kHz. We then compute CQTs over octaves with bins per octave and a hop size of samples, resulting in a dimensional input vector of real values, with a frame rate of frames per second. Additionally, we compute the mean and standard deviation of each dimension over the training set and transform the data by subtracting the mean and diving by the standard deviation. These pre-processed vectors are used as inputs to the acoustic model. For the language model training, we sample the MIDI ground truth transcriptions of the training data at the same rate as the audio ( ms). We obtain sequences of dimensional binary vectors for training the RNN-NADE language models. The outputs correspond to notes A0-C8 on a piano.

The test audio is sampled at a frame rate of Hz yielding frames per test file. For test files over splits, we obtain a total of frames at test time^{2}^{2}2It should be noted that carrying out statistical significance tests on a track level is an over-simplification in the context of multi-pitch detection, as argued in [BenetosThesis]..

### Iv-D Network Training

In this section we describe the details of the training procedure for the various acoustic model architectures and the RNN-NADE language model. All the acoustic models have units in the output layer, corresponding to the output pitches. The outputs of the final layer are transformed by a sigmoid function and yield independent pitch probabilities . All the models are trained by maximising the log-likelihood over all the examples in the training set.

#### Iv-D1 DNN Acoustic Models

For DNN training, we constrain all the hidden layers of the model to have the same number of units to simplify searching for good model architectures. We perform a grid search over the following parameters: number of layers , number of hidden units , hidden unit activations where ReLU is the rectified linear unit activation function [glorot2011deep]. We found Dropout [srivastava2014dropout] to be essential for improving generalisation performance. A Dropout rate of was used for the input layer and all the hidden layers of the network. Rather than using learning rate and momentum update schedules, we use ADADELTA [zeiler2012adadelta] to adapt the learning over iterations. In addition to Dropout, we use early stopping to minimise overfitting. Training was stopped if the cost over the validation set did not decrease for epochs. We used mini batches of size for the SGD updates.

#### Iv-D2 RNN Acoustic Models

For RNN training, we constrain all the hidden layers to have the same number of units. We perform a grid search over the following parameters: , . We fix the hidden activations of the recurrent layers to be the hyperbolic tangent function. We found that ADADELTA was not particularly well suited for training RNNs. We use an initial learning rate of and linearly decrease it to over iterations. We use a constant momentum rate of . The training sequences are further divided into sub-sequences of length . The SGD updates are made one sub-sequence at a time, without any mini batching. Similar to the DNNs, we use early stopping and stop training if validation cost does not decrease after iterations. In order to prevent gradient explosion in the early stages of training, we use gradient clipping [bengio2013advances]. We clipped the gradients, when the norm of the gradient was greater than 5.

#### Iv-D3 ConvNet Acoustic Models

The input to the ConvNet is a context window of frames and the target is the central frame in the window [sigtiachords]. The frames at the beginning and end of the audio are zero padded so that a context window can be applied to each frame. Although pooling can be performed along both axes, we only perform pooling over the frequency axis. We performed a grid search over the following parameters: window size number of convolutional layers , number of filters per layer , number of fully connected layers , number of hidden units in fully connected layers . The convolution activation functions were fixed to be the hyperbolic tangent functions, while all the fully connected layer activations were set to the sigmoid function. The pooling size is fixed to be for all convolutional layers. Dropout with rate is applied to all convolutional layers. We tried a large permutation of window shapes for the convolutional layer and the following subset of window shapes yielded good results: . We observed that classification performance deteriorated sharply for longer filters along the frequency axis. Dropout was applied to all the fully connected layers. The model parameters were trained with SGD and a batch size of . An initial learning rate of was linearly decreased to over iterations. A constant momentum rate was used for all the updates. We stopped training if the validation error did not decrease after iterations over the entire training set.

Post Processing | Thresholding | HMM | Hybrid Architecture | |||
---|---|---|---|---|---|---|

Acoustic Model | Frame | Note | Frame | Note | Frame | Note |

Benetos [benetos2012shift] | ||||||

Vincent [vincent2010adaptive] | ||||||

DNN | ||||||

RNN | ||||||

ConvNet |

Acoustic Model | Frame | Note | Frame | Note | Frame | Note |
---|---|---|---|---|---|---|

Benetos [benetos2012shift] | ||||||

Vincent [vincent2010adaptive] | ||||||

DNN | ||||||

RNN | ||||||

ConvNet |

Acoustic Model | Benetos [benetos2012shift] | Vincent [vincent2010adaptive] | DNN | RNN | ConvNet |
---|---|---|---|---|---|

F-measure (Frame) | |||||

F-measure (Note) |

#### Iv-D4 RNN-NADE Language Models

The RNN-NADE models were trained with SGD and with sequences of length . We performed a grid search over the following parameters: number of recurrent units and number of hidden units for the NADE . The model was trained with an initial learning rate of which was linearly reduced to over iterations. A constant momentum rate of was applied throughout training.

We selected the model architectures by performing a grid search over the parameter values described earlier in the section. The various models were evaluated on one train/test split and the best performing architecture was then used for all other experiments.

### Iv-E Comparative Approaches

For comparative purposes, two state-of-the-art polyphonic music transcription methods were used for experiments [benetos2012shift, vincent2010adaptive]. In both cases, the non-binary pitch activation output of the aforementioned methods was extracted, for performing an in-depth comparison with the proposed neural network models. The multi-pitch detection method of [vincent2010adaptive] is based on non-negative matrix factorization (NMF) and operates by decomposing an input time-frequency representation as a series of basis spectra (representing pitches) and component activations (indicating pitch activity across time). This method models each basis spectrum as a weighted sum of narrowband spectra representing a few adjacent harmonic partials, enforcing harmonicity and spectral smoothness. As input time-frequency representation, an Equivalent Rectangular Bandwidth (ERB) filterbank is used. Since the method relies on a dictionary of (hand-crafted) narrowband harmonic spectra, system parameters remain the same for the two evaluation configurations.

The multiple-instrument transcription method of [benetos2012shift] is based on shift-invariant PLCA (a convolutive and probabilistic counterpart of NMF). In this model, the input time-frequency representation is decomposed into a series of basis spectra per pitch and instrument source which are shifted across log-frequency, thus supporting tuning changes and frequency modulations. Outputs include the pitch activation distribution and the instrument source contribution per pitch. Contrary to the parametric model of [vincent2010adaptive], the basis spectra are pre-extracted from isolated musical instrument sounds. As in the proposed method, the input time-frequency representation of [benetos2012shift] is the CQT. For the investigations with MLMs (configuration 1), the PLCA models are trained on isolated sound examples from all 9 piano models from the MAPS database (in order for the experiments to be comparable with the proposed method). For the second set of experiments which investigate the generalisation capabilities of the models (configuration 2), the PLCA acoustic model is trained on isolated sounds from the sysnthesised pianos and tested on recordings created using the Yamaha Disklavier piano.

### Iv-F Results

In this section we present results from the experiments on the MAPS dataset. As mentioned before, all results are the mean values of various metrics computed over the different train/test splits. The acoustic models yield a sequence of probabilities for the individual pitches being active (posteriograms). The post-processing methods are used to transform the posteriograms to a binary piano-roll representation. The various performance metrics (both frame and note based) are then computed by comparing the outputs of the systems to the ground truth.

Model | Architecture |
---|---|

DNN | |

RNN | |

ConvNet | |

RNN-NADE |

We consider kinds of post-processing methods. The simplest post-processing method is to apply a threshold to the output pitch probabilities obtained from the acoustic model. We select the threshold that maximises the F-measure over the entire training set and use this threshold for testing. Pitches with probabilities greater than the threshold are set to 1, while the remaining pitches are set to 0. The second post-processing method considered uses individual pitch HMMs for post-processing similar to [poliner2007discriminative]. The HMM parameters (transition probabilities, pitch marginals) are obtained by counting the frequency of each event over the MIDI ground truth data. The binary pitch outputs are obtained using Viterbi decoding [rabiner1989tutorial], where the scaled likelihoods are used as emission probabilities. Finally, we combine the acoustic model predictions with the RNN-NADE MLMs and obtain binary transcriptions using beam search.

In Table I, we present F-scores (both frame and note based) for all the acoustic models and the 3 post-processing methods using Configuration 1. From the table, we note that all the neural network models outperform the PLCA and NMF models in terms of frame-based F-measure by . The DNN and RNN acoustic model performance is similar, while the ConvNet acoustic model clearly outperforms all the other models. The ConvNets yield an absolute improvement of over the other neural network models, while outperforming the spectrogram factorisation models by in frame-wise F-measure. For the note-based F-measure, the RNN and ConvNet models perform better than the DNN acoustic model. This is largely due to the fact that these models include context information in their inputs, which implicitly smooths the output predictions.

We compare the different post-processing methods for Configuration 1 by observing the rows of Table I. We note that the MLM leads to improved performance on both frame-based and note-based F-measure for all the acoustic models. The performance increase is larger on the note-based F-measure. The relative improvement in performance is maximum for the DNN acoustic model, compared to the RNN and the ConvNet. This could be due to the fact that the independence assumption in Equation 11 is violated by the RNN and ConvNet, which include context information while making predictions. This leads to some factors being counted twice and we observe a smaller performance improvement in this case. From Rows and of Table I we observe that the RNN-NADE MLM yields a performance increase for the PLCA and NMF acoustic models, though the relative improvement is less as compared to the neural network acoustic models. This might be due to the fact that unlike the neural network models, these models are not trained to maximise the conditional probability of output pitches given the acoustic inputs. Another contributing factor is the fact that the PLCA and NMF posteriograms represent the energy distribution over pitches rather than explicit pitch probabilities, which results in many activations being greater than . This discrepancy in the scale of the acoustic and language predictions leads to an unequal weighting of predictions when used in the hybrid RNN framework. In Table I we observe that the acoustic model in [vincent2010adaptive] outperforms all other acoustic models on the note-based F-measure, while the frame based F-measure is significantly lower. This can be attributed to the use of an ERB filterbank input representation, which offers improved temporal resolution over the CQT for lower frequencies.

In Table II, we present additional metrics (precision, recall and accuracy) for the all the acoustic models after decoding with an RNN-MLM, using Configuration 1. We observe that that the NMF and PLCA models have low frame-based precision and high recall and the converse for the note-based precision. For the neural network models, we observe smaller differences between the both frame-based and note-based precision and recall values. Amongst all the neural network models, we observe that the ConvNet outperforms all the other models on all the metrics.

In Table III, we present F-measures for experiments where the acoustic models are trained on synthesised data and tested on real data (Configuration 2). From the table we note that frame based F-measure for the DNN and RNN models is similar to the PLCA model and the model in [vincent2010adaptive]. We note that the ConvNet outperforms all other models on the frame-based F-measure by . On the note based evaluations, we observe that both RNN and DNN are outperformed by all the other models. The ConvNet performance is similar to the PLCA model, while the acoustic model from [vincent2010adaptive] again has best performance on the note based metrics.

We now discuss details of the inference algorithm. The high dimensional hashed beam search algorithm has the following parameters: the beam width , the branching factor , number of entries per hash table entry and the similarity metric (Algorithm ). We observed that a value of produced good results. Larger values of do not yield a significant performance increase and result in much longer run times, therefore we set for all experiments. We observed that small values of (number of solutions per hash table entry), produced good results. Decoding accuracies deteriorate sharply for large values of , as observed in [sigtiachords]. Therefore, we set the number of entries per hash key for all experiments. We let the similarity metric be the last emitted symbols, . We experimented with varying the values of and observed that we were able to achieve good performance for small , . We did not observe any performance improvement for large , therefore for all experiments we fix . Figure 3 is a plot showing the effect of beam width on transcription performance. The results are average values of decoding accuracies over splits. We compare performance of the hashed beam search with the high dimensional beam search in [sigtia2014hybrid]. From Figure 3 we observe that the hashed beam search algorithm is able to achieve performance improvement with significantly smaller beam-widths. For instance, the high dimensional beam search algorithm takes hours to decode the entire test set with , while the hashed beam search takes minutes, with and achieves better decoding accuracy.

Figure 4 is a graphical representation of the outputs of a ConvNet acoustic model. We observe that some of the longer notes are fragmented and the offsets are estimated incorrectly. One reason for this is that the ground truth offsets don’t necessarily correspond to the offset in the acoustic signal (due to effects of the sustain pedal), implying noisy offsets in the ground truth. We also observe that the model does not make many harmonic errors in its predictions.

## V Conclusions and Future Work

In this paper, we present a hybrid RNN model for polyphonic AMT of piano music. The model comprises a neural network acoustic model and an RNN based music language model. We propose using a ConvNet for acoustic modelling, which to the best of the authors’ knowledge, has not been attempted before for AMT. Our experiments on the MAPS dataset demonstrate that the neural network acoustic models, especially the ConvNet, outperform 2 popular acoustic models from the AMT literature. We also observe that the RNN MLMs consistently improve performance on all evaluation metrics. The proposed inference algorithm with the hash beam search is able to yield good decoding accuracies with significantly shorter run times, making the model suitable for real-time applications.

We now discuss some of the limitations of the proposed model. As discussed earlier, one of the main contributing factors to the success of deep neural networks has been the availability of very large datasets. However datasets available for AMT research are considerably smaller than datasets available in speech, computer vision and natural language processing (NLP). Therefore the applicability of deep neural networks for acoustic modelling is limited to datasets with large amounts of labelled data, which is not common in AMT (at least in non-piano music). Although the neural network acoustic models perform competitively, their performance could be further improved in many ways. Noise or deformations can be added to training examples to encourage the classifiers to be invariant to commonly encountered input transformations. Additionally, the CQT input representation can be replaced by a representation with higher temporal resolution (like the ERB or a variable-Q transform), to improve performance on note based metrics.

The abundance of musical score data and recent progress in NLP tasks with neural networks provide strong motivation for further investigations into MLMs for AMT. Although our results demonstrate some improvement in transcription performance with MLMs, there are several limitations and open questions that remain. The MLMs are trained on binary vectors sampled from the MIDI ground truth. Depending on the sampling rate, most note events are repeated many times in this representation. The MLMs are trained to predict the next frame of notes, given an input sequence of binary note combinations. In cases where the same notes are repeated many times, log-likelihood can be trivially maximised by repeating previous inputs. This causes the MLM to perform a smoothing operation, rather than imposing any kind of musical structure on the outputs. A potential solution would be to perform beat-aligned language modelling for the training and the test data, rather than sampling the MIDI at some arbitrary sampling rate. Additionally, RNNs can be extended to include duration models for each of their pitch outputs, similar to second order HMMs. However, this is a challenging problem and currently remains unexplored. It would also be interesting to encourage RNNs to learn longer temporal note patterns by interfacing RNN controllers with external memory units [grefenstette2015learning] and also to incorporate a notion of timing or metre in the input representation for the MLMs.

The effect of tonality on the performance of the MLMs should be further investigated. The MLMs should ideally be invariant to transpositions of a musical piece to different pitches. The MIDI ground truth can be easily transposed to any tonality. MLMs can be trained on inputs with transposed tonalities or individual MLMs for each key can be trained. Additionally, the fully connected input layer of the RNN MLM can be substitued with a convolutive layer, with convolutions along the pitch axis to encourage the network to be invariant to pitch transpositions.

Another limitation of the proposed hybrid model is that the conditional probability in Equation 11 is derived by assuming that the predictions at time are only a function of the input at and independent of all other inputs and outputs. The violation of this assumption leads to certain factors being counted twice and therefore reduces the impact of the MLMs. The results clearly demonstrate that improvements with the MLM are maximum when the acoustic model is frame-based. The improvements are comparatively lower when combined with predictions from an RNN or ConvNet acoustic model. This is problematic since the ConvNet acoustic models yield the best performance.