Comparing Probabilistic Models for Melodic Sequences
Abstract
Modelling the real world complexity of music is a challenge for machine learning. We address the task of modeling melodic sequences from the same music genre. We perform a comparative analysis of two probabilistic models; a Dirichlet Variable Length Markov Model (DirichletVMM) and a Time Convolutional Restricted Boltzmann Machine (TCRBM). We show that the TCRBM learns descriptive music features, such as underlying chords and typical melody transitions and dynamics. We assess the models for future prediction and compare their performance to a VMM, which is the current state of the art in melody generation. We show that both models perform significantly better than the VMM, with the DirichletVMM marginally outperforming the TCRBM. Finally, we evaluate the short order statistics of the models, using the KullbackLeibler divergence between test sequences and model samples, and show that our proposed methods match the statistics of the music genre significantly better than the VMM.
Keywords:
melody modeling, music feature extraction, time convolutional restricted Boltzmann machine, variable length Markov model, Dirichlet prior
1 Introduction
In this paper we are interested in learning a generative model for melody directly from musical sequences. This task is challenging for machine learning methods. Repetition of musical phrases, which is essential for Western music, can occur in almost arbitrary points in time and with different degrees of variation. Furthermore, although pieces from the same genre are built using the same structural principles, the statistical relations among and within melodies from different pieces are highly complex, as melody depends on several different components, such as scale, rhythm and meter, which in many cases interdepend on each other.
Capturing the statistical regularities within a musical genre is a first step towards realistic music generation. Additionally, identifying and representing these dependencies in an unsupervised manner is particularly desirable, as descriptive features of the underlying structure of music can not only help in the analysis and synthesis of music, but also enhance the performance on a variety of musical tasks such as genre classification and music retrieval.
In this work we consider two methods for the problem of melody modeling; a Time Convolutional Restricted Boltzmann Machine (TCRBM) and a Dirichlet Variable Length Markov Model (DirichletVMM). The first is an adaptation of the Convolutional RBM (Lee et al., 2009) for modeling sequential data and is motivated by the ability of RBM type models to extract high quality latent features from the input space. The second one is a nonlatent variable model and is a novel form of VMM, the latter one being regarded as state of the art in melody generation (Paiement, 2008).
Our purpose is to answer the following questions. Are these probabilistic models able to learn the inherent structure in melodic sequences and generate samples that respect the statistics of the music genre? What aspects of the musical stucture can each of the models learn? Can melodies be decomposed into a set of musical features in the same way that images can be decomposed into sets of edges and documents into sets of topics?
We train the models on a set of traditional reel tunes and perform a comparative analysis of these with a standard VMM. We show that the TCRBM learns descriptive music features, such as underlying chordal structure, musical motifs and transformations of those. We assess the models on future prediction and find that our proposed methods perform significantly better than the standard VMM and are comparable to each other, with the DirichletVMM having slightly higher loglikelihood. Likewise, we evaluate the short order statistics of model samples, using the KullbackLeibler divergence, and show that samples from the TCRBM and the DirichletVMM match the statistics of the test data significantly better than samples from the VMM.
2 Related Work
In many cases, the difficulties associated with modeling music have been dealt with by incorporating domain knowledge in the models. In this line of research, Paiement (2008) proposes modeling different aspects of music, such as chord progressions, rhythm and melody, using graphical models and InputOutput HMMs. The structure of the models and the data representations used are based on musical theory. Additionally, Weiland et al. (2005) propose a Hierarchical Hidden Markov Model (HHMM) for pitch. The HHMM is structurally simple and its internal states are predefined with respect to music assumptions.
A different course of research examines more general machine learning methods, which are able to automatically capture complex relations in sequential data, without introducing much prior knowledge. In this paper we are taking this approach and consider models that do not make assumptions explicit to music.
Lavrenko and Pickens (2003) propose Markov Random Fields (MRFs) for modeling polyphonic music. In order for the MRF to remain tractable, much information needs to be discarded, thus making the model less suitable for realistic music.
Eck and Schmidhuber (2002) show that a LongShort Term Memory (LSTM) Recurrent Neural Network can successfully model longterm structure in two simple musical tasks. In Eck and Lapalme (2008) the LSTM is extended to include meter information. The output of the network is conditioned on the current chord and specific previous timesteps, chosen according to the metrical boundaries. Trained on a set of traditional Irish reels the LSTM is shown to generate pieces that respect the reel style.
Finally, Dubnov et al. (2003) propose Incremental Parsing (IP) and Prediction Suffix Trees (PSTs) for modeling melodies, the latter one being the data structure used to represent VMMs. Both algorithms train simple dictionarybased predictors that parse music into a lexicon of phrases or motifs. Paiement (2008) argues that despite their simple nature, these two models generate impressive musical results when sampled and can be considered state of the art in melody generation.
3 Preliminaries
3.1 Musical Motifs
Before describing the models, we explain the concept of motifs and their importance to music modeling, as we believe it is useful in understanding the types of structures that the VMM and the TCRBM are trying to capture.
In Western Music, the smallest building block of a piece is called a motif. Motifs typically comprise three, four or more notes and most pieces can be expressed as a combination of different motifs and their transformations. Frequent transformations include replacement, splitting and merging of notes, and typically respect the metrical boundaries of a piece. We believe that successful capturing of music motifs can be very useful when modeling melodies, as specific motifs and their transformations are highly likely to be repeated within a piece, as well as among pieces from the same musical form.
3.2 Variable Length Markov Model
The VMM (Ron et al., 1994) is a statistical model for discrete sequential data and has been shown to generate state of the art musical results when modeling melodies (Dubnov et al., 2003). Its advantage to a standard Markov Model (ngram) is that the order of the former is not fixed, but instead depends on the observed context.
A VMM is represented by a Prediction Suffix Tree. The edges of the tree are labeled with symbols from the alphabet, in this case the different music notes. Each node defines the conditional probability distribution of the next symbol given the context we acquire by concatenating all the edge symbols from the root to the node
To learn the tree, we start from a single root node labeled by the empty string and ‘grow’ the tree using a breadthfirst search for contexts that satisfy the following criteria:

The length of a context is upper bounded by a fixed length

The frequency counts of a context exceed a fixed threshold

The ratio of the conditional probability distribution defined at a node with that defined at its parent node exceeds a fixed threshold
The resulting tree comprises contexts corresponding to musical phrases that appear frequently in the data and convey significant information about the value of future timesteps. After the tree is built, the empirical conditional probability distributions are smoothed by adding a constant probability to all symbols in the alphabet and renormalizing.
3.3 Restricted Boltzmann Machine
The Restricted Boltzmann Machine (RBM) is a twolayer undirected graphical model with a set of visible and a set of hidden units. It is a special, bipartite form of the Boltzmann Machine (Ackley et al., 1985), in which the interaction terms are restricted to units from different layers. The joint distribution over observed and latent variables is defined through an energy function, which assigns a scalar energy to every possible configuration of the variables:
(1) 
where is a normalizing constant called the partition function and is used to denote the set of model parameters.
In its original form, an RBM has binary, logistic units in both layers
(2) 
where and are the biases for the visible and hidden units, respectively, and is the weight matrix for the interaction terms.
Inference in this model can be performed efficiently using block Gibbs sampling, as due to the bipartite structure of the model, the conditional distributions of the hidden units given the visibles and of the visible units given the hiddens factorize.
Maximum Likelihood learning in the RBM is difficult due to the partition function which is typically intractable
4 Models
4.1 DirichletVMM
The VMM is similar to an ngram model in that its performance is significantly influenced by the smoothing technique used. An alternative to a standard form of variable length Markov model is a hierarchical model, where each conditional multinomial distribution in the tree is sampled from a dirichlet Distribution, centered at the sample multinomial for the parent node. In this model smoothing is performed implicitly by taking a Bayesian approach and introducing an appropriate prior distribution at each node while building the tree.
More formally, let be defined by
Then we model each conditional distribution as:
(3) 
This forms a hierarchical tree with the marginal distribution as the root node, and successively more specific conditional distributions as we traverse down the tree. The intermediate nodes, though identified with particular distribution, are not used directly to model the data; that is done by the leaf nodes.
Learning this hierarchical distribution involves learning the posterior distributions at each level of the hierarchy from the data associated with the given node (i.e. the data that satisfies the conditional distribution).
(4) 
where the function counts the number of occurrences of sequence in the dataset where the last element is in state , and denotes expectation.
The mean of the posterior Dirichlet at each node is the prior Dirichlet for the data at the child nodes. Note the top levels of the hierarchy have a large amount of associated data, but as we progress down the tree the amount of data reduces. In the limit where there is no data the posterior distribution for that node is just given by the posterior for the parent node.
This model is directly related to the sequence memoizer (Wood et al., 2009), but is a finite model using Dirichlet distributions, instead of a Pitman Yor model. Using Dirichlet distributions makes the inference procedure entirely conjugate and thus no sampling is required. We call this model a DirichletVMM in this paper.
4.2 Time Convolutional RBM
We propose a Time Convolutional RBM (TCRBM) as a new way of modeling sequential data with an RBM type network. We believe that models based on the RBM are particularly suitable for capturing the componential structure of music, as they can learn distributed representations of the input space, decoupling the different factors of variation into features being “on” or “off”. The TCRBM is an adaptation of the Convolutional RBM for sequences and it is motivated by the successful application of such models in static image data (Lee et al., 2009; Norouzi et al., 2009).
Previous RBM approaches to sequence modeling use the RBM to model a single timestep and attempt to capture the temporal relations in the data by introducing different types of directed connections from units in previous timesteps (Taylor et al., 2007; Sutskever and Hinton, 2007; Taylor and Hinton, 2009). In contrast, the TCRBM is a fully undirected network and attempts to capture the structure of music at a motif level rather than a single timestep.
The TCRBM is depicted in Fig. 1. Local temporal dependencies are captured by learning an RBM on visible subsequences of fixed length  instead of single data points. This allows the hidden units to learn valid configurations for a whole subsequence and thus capture frequent motifs and their transformations. Longer sequences are modelled by applying convolution through time. This weight sharing mechanism allows us to better model boundary effects and provides the model with translation invariance along time, which is desirable as motifs can appear anywhere in a musical piece.
The energy function of the TCRBM is defined as:
(5) 
where is a visible sequence, is the hidden configuration for that sequence and is the size of the filter we apply
Similarly to an RBM, the joint probability distribution of the observed and hidden sequence under the TCRBM is defined as .
The conditional probability distributions of this model factorize over time and units and are given by softmax and logistic functions:
(6) 
(7) 
Inference can be performed using block Gibbs sampling. The computation of (6) and (7) can be performed efficiently by convolving along the time dimension the appropriate slice of the weight tensor with the hidden and visible sequence respectively. As in the RBM, learning can be performed using the Contrastive Divergence rule.
5 Experiments
In the following section we want to assess the ability of the models to learn the inherent structure of melodic sequences belonging to the same genre. An appropriate measure for this evaluation is the marginal likelihood of the data under each model , . However, computing the marginal likelihood under the TCRBM is intractable
In the music modeling literature, evaluation is primarily based on qualitative analysis, like listening to model generations. To our knowledge, the only quantitative measures used so far are nextstep prediction accuracy (Paiement, 2008; Lavrenko and Pickens, 2003) and perplexity (Lavrenko and Pickens, 2003). In this work, we broaden this evaluation framework to consider longer future prediction, instead of only nextstep, as this provides an insight regarding model performance through time.
To make our comparative analysis more rigorous, we also examine the short order statistics of the models and compare them with the data statistics. To perform this analysis we compute the KullbackLeibler divergence between the frequency distribution of events in test sequences and in model samples, which measures how well the model statistics match the data, or put differently, how much a model has yet to learn.
Besides the quantitative evaluation, we are also interested in assessing the capabilities of the models to identify and represent the statistical regularities of the data. In the VMM models, the learned lexicon of phrases determines the frequent musical motifs, but does not provide any information regarding the underlying structure, as the encoded patterns are fixed. On the other hand, the TCRBM learns a distributed representation of the input space; a set of latent features that are ‘on’ or ‘off’ depending on the input signal. We demonstrate that these features are music descriptors extracted from the data and convey information regarding music components such as scale, octaves and chords.
5.1 Data Processing and Representation
In the following experiments we use a dataset comprising 117 traditional reels collected from the Nottingham Folk
Music Database
Our representation is depicted in Fig. 2. The components we wish to model are pitch and duration of the notes in the melody. Duration is modelled implicitly by discretizing time in eighthnote intervals. At each timestep, pitch is encoded using a of vector. We use only two octaves, , giving rise to a dimensional vector. Values outside this octave range are trunctated to the nearest octave.
Finally, we augment the of vector with two more values. The first one is used to represent music silence. The second one is used to represent ‘continuation’ of an event and allows us to keep more accurate information concerning the duration of notes.
5.2 Implementation Details
We trained a VMM, a DirichletVMM and a TCRBM. To set the parameters , and of the VMM we applied grid search over the product space of the parameters and chose the values that maximize the data loglikelihood using leaveoneout cross validation on the training data. We used the same grid search procedure to set the parameter of the Dirichlet prior in the DirichletVMM
For the TCRBM, we used 50 hidden units. We chose the size of the filters to be 8 timesteps, which corresponds to the length of a music bar. For learning the model we used the following settings: CD5, 500 epochs, 0.5 learning rate decreasing on a fixed schedule, 0.0002 weight decay. We additionally used a sparsity term
5.3 Learning Musical Features
In the TCRBM each hidden unit is connected with all the visible units from eight subsequent timesteps. This gives rise to a dimensional filter for each hidden unit
The filters corresponding to 6 different hidden units from the learned TCRBM are depicted in Fig. 3. We can notice that all units prefer visible configurations with notes from the G major scale
For instance, filter (6) is fairly broad and may respond to several different configurations of notes from the G major scale, whereas filter (5) is highly selective, responding primarily to the downwardsupwards movement through the scale and certain variations of it.
An interesting property of the top two filters is their relation with respect to the octave. Both units respond to similar music phrases. For instance, both units respond to the motif starting at either position 1 or 3. However, the left unit operates in the lower octave (), whereas the right one operates in the higher octave ().
Another interesting property is the relation of the filters to the tone chords of the scale
In order to better understand how the filters behave, we looked at random visible configurations that tend to activate a hidden unit during sampling. Figure 4 shows two such visible configurations for the hidden unit corresponding to filter (5). Although the two configurations seem fairly different, they both contain the motif in positions to with either a pass through or ‘continuation’ of in position 3. Filter (5) is highly responsive to this motif, and although timesteps 6 to 8 in the visible configurations are not highly preferable, the unit is still very likely to turn ‘on’.
Overall, we can see that the learned filters encode familiar musical movements, such as arpeggios and scales
5.4 Prediction Task
Given an observed test subsequence we want to evaluate how well a model can predict the following timesteps. We define the prediction loglikelihood of a test sequence under a model with parameters , as the log probability of the actual future timestep given timesteps up to , averaged over all timesteps of the test sequence. More specifically:
(8) 
We use the empirical marginal distribution
Computing Prediction under the VMM and the DirichletVMM.
For we can compute (8) exactly under the VMM models. For we need to marginalize over the future timesteps that are between and , ie:
(9) 
We approximate this distribution by drawing a number of sampled paths from the VMM and averaging over the conditional probability distributions defined by these paths which are given exactly under the VMM:
(10) 
We use 100 sampled paths in the experiments reported here.
Computing Prediction under the TCRBM.
In order to evaluate (8) under the TCRBM, we need to marginalize over future visible timesteps that are between and for and over the possible configurations of hidden units for timesteps to . To avoid this computation we approximate the predictive distribution using samples from the model. The sampling procedure is given in Algorithm 1.
In our experiments, we use 100 chains and run 15 Gibbs iterations within each chain. Overall, we use 500 samples to approximate the predictive distribution, discarding the first 10 samples from each chain.
Results.
Figure 5 shows the loglikelihood of predicting the true succession given an observed subsequence from a test tune under different models. As already mentioned, our baseline for assessing model performance is the empirical marginal distribution. The loglikelihood of the test data under the empirical marginal corresponds to the black curve.
Compared to the empirical marginal distribution, the standard VMM (green x) performs significantly better in predicting the first two future timesteps, only slightly better for timesteps 3 and 4 and significantly worse than the empirical marginal after the 5th timestep.
Both the DirichletVMM (cyan crosses) and the TCRBM (blue stars) perform significantly better than the standrad VMM in predicting all future timesteps. These two models have similar performance in the prediction task, with the DirichletVMM outperforming the TCRBM for the first two timesteps and their prediction loglikelihood being almost the same from the 3rd timestep onwards.
We should note that the performance of the TCRBM in prediction may be compromised by the fact that the block Gibbs procedure samples the future subsequence as a whole at each iteration. This means that due to the convolutional structure of the model, the timestep we are trying to predict receives information not only from the past, which is clamped to the observed context, but also from the future which is initialized randomly and can thus drive the samples into different energy basins.
Compared to the empirical marginal distribution, both the DirichletVMM and the TCRBM perform better for the first 10 timesteps. The prediction loglikelihood under the models is initially much higher than the one under the empirical marginal distribution, but decays as we try to predict further into the future. The models slowly forget the information upon which they have been conditioned and after the 10th timestep converge to a steadystate distribution, which is slightly worse than the empirical marginal distribution for prediction.
While longterm prediction is useful for characterizing model behaviour through time, it is not adequate for evaluating the generative capabilities of the models. For instance, even if a musical phrase is highly predictable given a certain context, the models can get bad predictive performance if they are not able to determine the correct starting timestep for the phrase.
Nevertheless, we can note that in contrast to the standard VMM, our proposed models converge to the empirical marginal distribution over time and thus are better in capturing the statistical regularities in the data, which is the first step towards realistic music generation.
5.5 Using the KullbackLeibler divergence to compare statistics
The KullbackLeibler (KL) divergence is a measure of how different two probability distributions, P and Q, are. For discrete random variables, it is defined as and shows the average number of extra bits needed to encode events from a distribution P with a code based on an approximating distribution Q. If the true distribution that generated the data is P and the model distribution is Q, then the lower the KLdivergence the better the model matches the data.
To compare model statistics with data statistics, we compute the frequency distribution of events in samples generated by each of the models and in test sequences, and compute the KLdivergence between the normalized data and model frequencies. More specifically, let denote the observation of a single timestep at time . Then to compare firstorder statistics we estimate the KLdivergence between and by computing:
(11) 
where is the empirical marginal distribution of data sequences and is the marginal distribution of samples generated by model . Similarly, for pairwise statistics we compute , for third order statistics , and so on.
Since the true distribution that generated the data is unknown, we perform a bootstrapping procedure for the estimation of the KLdivergence. More specifically, we compute the KLdivergence for each statistic times, each time using a different data resample, obtained by random sampling with replacement from the original test dataset. In our results, we report the mean and variance of the KLdivergence for each statistic.
The number of possible events grows exponentially with the order we consider, which makes the statistics for higherorders less reliable, given that we have a finite test set. In order to get a better understanding of how the models perform through time, we additionally consider pairwise statistics with lags, that is statistics of events comprising two timesteps which are not adjacent in time. For instance for lag we consider the frequencies of events , for lag we consider and so on.
order  order  order  order  order  order  

Trainset    
TCRBM      
DirVMM    
VMM      
lag  lag  lag  lag  lag  lag  
Trainset      
TCRBM             
DirVMM  
VMM             
Results.
Table 1 shows the mean and variance of the KLdivergence between the statistics of test sequences and a priori samples for various models. The first row compares test sequences to train sequences and is used as a reference for interpreting the results. Looking at the first order statistics we can note that the TCRBM and the DirichletVMM have much lower KLdivergence than the VMM, with the DirichletVMM having the lowest amongst the models. In fact the KLdivergence for the former two models is very close to the KLdivergence between test sequences and train sequences, which indicates that samples generated from these models match the statistics of the test data well.
For the second, third and fourth order statistics, the TCRBM has the lowest KLdivergence, with the DirichletVMM following closely and the VMM lagging behind. Interestingly, the KLdivergence of these statistics for the TCRBM and the DirichletVMM is even lower than the one for the train data. We believe that this stems from the fact that the models are capturing the underlying structure that characterizes the whole musical genre, and to some extent ignore the finer structure that characterizes each individual music piece. This can result in model samples that have higher inter and lower intrapiece similarity than a set of real music sequences.
For fifth and sixth order statistics, the KLdivergence for the TCRBM and the VMM is close to the KLdivergence for the train data, whereas for the DirichletVMM is lower. As mentioned earlier, the estimates for higher order statistics are less reliable, since the number of possible configurations is exponentially large and thus very difficult to characterize from a finite set of samples.
Finally, for the pairwise statistics with lags, the KLdivergence for both the TCRBM and the DirichletVMM is low, very close to the one for the train data, whereas for the VMM it is considerably higher. This suggests that our proposed methods respect the short order statistics of the musical genre and are better than the VMM in capturing the statistical regularities of the data through time.
6 Discussion
We addressed the problem of learning a generative model for music melody by considering two probabilistic models, the DirichletVMM and the Time Convolutional RBM. We showed that the TCRBM, trained on a dataset of tunes from the same genre, learns descriptive musical features that can be used to decompose the underlying structure of the data into musical components such as scale, octave and chord.
We performed a comparative analysis of the two models with the standard VMM, which, to our knowledge is state of the art in melody generation. We showed that in a longterm prediction task both models perform significantly better than the VMM and comparably with each other. The DirichletVMM is a better nextstep predictor, which can be partially accredited to its main strength, that is its ability to use shorter or longer contexts depending on whether they provide useful information or not.
We evaluated the short order statistics of the models by comparing the KullbackLeibler divergence between test sequences and model samples. We demonstrated that sampled generations from our proposed methods match the statistics of the test sequences considerably better than samples from the VMM and respect the genre statistics, as the KLdivergence for the TCRBM and the DirichletVMM is very close to the KLdivergence between test and train sequences.
The ability of the TCRBM to extract descriptive musical features allows us to consider hierarchical approaches for melody generation, which can help modulate the appearance of features through time. We are currently experimenting with deeper TCRBM architectures, where TCRBMs are stacked on top of one another in a greedy manner (see Hinton et al. (2006) for the RBM case). Deep models have been shown to learn hierarchical representations of the input space, where more abstract features are captured in higher layers, which according to the tonal music theory (Lerdahl and Jackendoff, 1983) is how music composition should be understood.
Finally, an interesting direction for future research in music modeling involves exploration of methods that can distinguish between inter and intrapiece similarity. The methods examinded in this work can learn the statistical relations within a musical genre, but are not able to effectively model piecewise variation. Considering methods that enable us to sample a prior distribution for each piece, such as topic models, would be a first step towards this direction.
Acknowledgements.
Athina Spiliopoulou is partly funded by an EPSRC scholarship.
Footnotes
 email: {a.spiliopoulou,a.storkey}@ed.ac.uk
 email: {a.spiliopoulou,a.storkey}@ed.ac.uk
 Note that during prediction only the conditional probability distributions defined at the leaf nodes are used.
 The complete tree would represent a standard Markov Model of order .
 However, see for example Welling et al. (2004) on how to define RBMs with realvalued units.
 Computing the partition function involves a sum over all possible configurations of visible and hidden units.
 The filter size is the number of visible timesteps that a hidden unit receives input from.
 Each slice of the tensor is the weight matrix for the connections of hidden units at time with the visible units at time .
 Computing the data likelihood involves a sum over all possible configurations of visible and hidden units.
 We use the MIDI toolbox (Eerola and Toiviainen, 2004) to read and write MIDI files.
 In the VMM, the maximum length was set to a very large value (), which resulted in the depth of the tree being controlled by the parameter for the frequency counts. The resulting depth for the optimal tree is 13. In the DirichletVMM, we used a global parameter and applied grid search over the product space of , and .
 It has been suggested (Lee et al., 2009; Norouzi et al., 2009) that due to the overcomplete hidden representation of convolutional RBMs, encouraging sparsity is important and can facilitate learning.
 The filter for hidden unit is the slice of the weight tensor.
 Notes from the G major scale:
 These are chords of three, four or five notes built from alternate scale notes of Major.
 These can be loosely defined as groups of subsequent scale notes, either going up or going down.
 The empirical distribution of the training data under the assumption that all timesteps are iid (independently and identically distributed). This distribution is the best predictor in the absence of temporal dependencies.
References
 Ackley, D.H., Hinton, G.E., Sejnowski, T.J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science 9(1), 147–169.
 Dubnov, S., Assayag, G., Lartillot, O., and Bejerano, G. (2003). Using machinelearning methods for musical style modeling. Computer 36(10), 73–80.
 Eck, D. and Lapalme, J. (2008). Learning musical structure directly from sequences of music. Technical report, Université de Montreal.
 Eck, D. and Schmidhuber, J. (2002). Learning the longterm structure of the blues. In: Dorronsoro, J.R. (ed.) ICANN. LNCS, vol. 2415, pp. 284–289. Springer.
 Eerola, T. and Toiviainen, P. (2004). MIDI Toolbox: MATLAB Tools for Music Research. University of Jyväskylä, Jyväskylä, Finland, www.jyu.fi/musica/miditoolbox/
 Hinton, G.E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation 14(8), 1771–1800.
 Hinton, G.E., Osindero, S., and Teh, Y.W. (2006). A fast learning algorithm for deep belief nets. Neural Computation 18(7), 1527–1554.
 Lavrenko, V. and Pickens, J. (2003). Polyphonic music modeling with random fields. In: Rowe, L.A., Vin, H.M., Plagemann, T., Shenoy, P.J., Smith, J.R. (eds) ACM Multimedia. Proceedings of the Eleventh ACM International Conference on Multimedia, p. 120–129. ACM.
 Lee, H., Ekanadham, C., and Ng, A. Y. (2008). Sparse deep belief net model for visual area V2. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (eds.) NIPS. Advances in NIPS 20. MIT Press.
 Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) ICML. ACM ICPS, vol. 382, p. 77. ACM.
 Lerdahl, F. and Jackendoff, R. (1983). A Generative Theory of Tonal Music. The MIT Press, Cambridge, Massachusetts, London, England.
 Norouzi, M., Ranjbar, M., and Mori, G. (2009). Stacks of convolutional restricted Boltzmann machines for shiftinvariant feature learning. In: CVRP. 2009 IEEE Computer Society Conference on CVPR, p. 2735–2742. IEEE.
 Paiement, J.F. (2008). Probabilistic Models for Music. PhD thesis, Ecole Polytechnique Fédérale de Lausanne (EPFL).
 Ron, D., Singer, Y., and Tishby, N. (1994). The power of amnesia. In: Cowan, J.D., Tesauro, G., Alspector, J. (eds.) NIPS. Advances in NIPS 6, p. 176–183. Morgan Kaufmann.
 Sutskever, I. and Hinton, G.E. (2007). Learning multilevel distributed representations for highdimensional sequences. Journal of ML Research  Proceedings Track 2, 548–555.
 Taylor, G.W. and Hinton, G.E. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) ICML. ACM ICPS, vol. 382, p. 129. ACM.
 Taylor, G.W., Hinton, G.E., and Roweis, S.T. (2007). Modeling human motion using binary latent variables. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) NIPS. Advances in NIPS 19, p. 1345–1352. MIT Press.
 Weiland, M., Smaill, A., and Nelson, P. (2005). Learning musical pitch structures with hierarchical hidden Markov models. Technical report, University of Edinburgh.
 Welling, M., RosenZvi, M., and Hinton, G.E. (2004). Exponential family harmoniums with an application to information retrieval. In: NIPS. Advances in NIPS 17.
 Wood, F., Archambeau, C., Gasthaus, J., James, L., and Teh, Y.W. (2009). A stochastic memoizer for sequence data. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) ICML. ACM ICPS, vol. 382, p. 142. ACM.