This Time with Feeling:
Learning Expressive Musical Performance
Abstract: Music generation has generally been focused on either creating scores or interpreting them. We discuss differences between these two problems and propose that, in fact, it may be valuable to work in the space of direct performance generation: jointly predicting the notes and also their expressive timing and dynamics. We consider the significance and qualities of the data set needed for this. Having identified both a problem domain and characteristics of an appropriate data set, we show an LSTM-based recurrent network model that subjectively performs quite well on this task. Critically, we provide generated examples. We also include feedback from professional composers and musicians about some of these examples.
Keywords: music generation, deep learning, recurrent neural networks, artificial intelligence
Recognizing that “talking about music is like dancing about architecture”111This quote has been attributed to a range of individuals from Laurie Anderson to Miles Davis, and numerous others., we kindly ask the reader to listen to the linked audio in order to effectively understand the motivation, data, results, and conclusions of this paper. As this research is ultimately about producing music, we believe the actual results are most effectively perceived—indeed, only perceived—in the audio domain. This will provide necessary context for the verbal descriptions in the rest of the paper.
In this work, we discuss training a machine-learning system to generate music. The first two key words in the title are time and feeling: not coincidentally, our central thesis is that, given the current state of the art in music generation systems, it is effective to generate the expressive timing and dynamics information concurrently with the music. Here we do this by directly generating improvised performances rather than creating or interpreting scores. We begin with an exposition of some relevant musical concepts.
1.1 Scores, Performances and Musical Abstraction
Music exists in the audio domain, and is experienced through individuals’ perceptual systems. Any “music” that is not in the audio domain (e.g. a text or binary file of any sort) is of course a representation of music: if it is not physically vibrating, it is not (yet) sound, and if it is not sound, it is certainly not music. The obvious implication is that for any representation, there are additional steps to transform that representation—whatever it might be—into sound. Those steps might be as local as the conversion from digital to analog waves, or as global as the human performance of written score, for example. In generating music222In this text, we use the term “generation” to refer to computational generation, as opposed to human creation or performance., therefore, one must be aware of which of those steps is addressed directly by their generative system, which ones must be addressed in other ways, and, importantly, the impact of all of those choices on the listener’s perception of the music, where it is ultimately experienced.
A defining characteristic of a representation, then, is what is omitted: what still needs to be added or done to it in order to create music from it, and the relation of that abstraction to our perceptual experience. With that consideration in mind, we now discuss some common symbolic representations.
Figure 1 is an example of a musical score chopin-1830 (). It shows which notes to play and when to play them relative to each other. The timing in a score is aligned to an implicit and relative metrical grid. For example, quarter notes are the same duration as quarter note rests, twice the duration of eighth notes, and so on. Some scores additionally specify an absolute tempo, e.g. in quarter notes per minute.
And yet, by the time the music is heard as audio, most of this timing information will have been intentionally not followed exactly! For example, in classical music from the 1800’s onwards, rubato developed: an expressive malleability of timing that overrides metrical accuracy (i.e. can deviate very far from the grid), and this device is both frequent and essential for making perceptual sense of certain pieces. Another example of a rhythmic construct that is not written in scores is swing, a defining quality of many African American music traditions333While explaining swing is outside the current scope, we do note that it is occasionally incorrectly described in terms of triplets. .
But tempo is not the only way in which the score is not followed exactly. Dynamics refers to how the music gets louder and quieter. While scores do give information about dynamics, in this respect, too, their effectiveness relies heavily on conventions that are not written into the score. For example, where the above score says “p” it means to play quietly, but that does tell us how quietly, nor will all the notes be equally quiet. When there is a crescendo marking indicating to get louder, in some cases the performer will at first get momentarily quieter, creating space from which to build. Furthermore, when playing polyphonic piano music, notes played at the same time will usually be played at different dynamic levels and articulated differently from one another in order to bring out some voices over others.
Phrasing includes a joint effect of both expressive timing and dynamics. For example, there is a natural correlation between the melody rising, getting louder, and speeding up. These are not rules, however; skilled performers may deliberately choose to counteract such patterns to great effect.
We can think of a score as a highly abstract representation of music. The effective use of scores, i.e. the assumption by a composer that a score will subsequently be well-rendered as music, relies on the existence of conventions, traditions, and individual creativity. For example, Chopin wrote scores where the pianist’s use of rubato is expected, indeed the score requires it in order to make sense. Similarly, the melodies in jazz lead sheets were written with the understanding that they will be swung and probably embellished in various ways. There are numerous other instrument-specific aspects that scores do not explicitly represent, from the vibrato imbued by a string player to the tone of a horn player. Sometimes, the score just won’t really make perceptual sense without these.
In short, the mapping from score to music is full of subtlety and complexity, all of which turns out to be very important in the perceptual impact that the music will have. To get a sense of the impact of these concepts, we recommend that the reader listen:
first to a direct rendering of the above score here: https://clyp.it/jhdkghso, played according to the written grid and quantized to notes. Then,
MIDI is a communication protocol for digital musical instruments: a symbolic representation, transmitted serially, that indicates Note_On and Note_Off events and allows for a high temporal sampling rate. The loudness of each note is encoded in a discrete quantity referred to as velocity (the name is derived from how fast a piano key is pressed). While MIDI encodes note timing and duration, it does not encode qualities such as timbre; instead, MIDI events are used to trigger playback of audio samples.
MIDI can be visualized as a piano roll—a digital version of the old player piano rolls. Figure 2 is an example of a MIDI piano roll corresponding to the score shown in Figure 1. Each row corresponds to one of the 128 possible MIDI pitches. Each column corresponds to a uniform time step. If note is ON at time and had been pressed with velocity , then element . So, at 125 Hz, six seconds of MIDI data would be represented on a grid of size . Actual MIDI sampling can be faster than this, so even at 125 Hz we are still subsampling from the finest available temporal grid.
We refer to a score that has been rendered directly into a MIDI file as a MIDI Score. That is, it is rendered with no dynamics and exactly according to the written metrical grid. As given earlier, https://clyp.it/jhdkghso is an example of this.
If, instead, a score has been performed, by a musician for example, and that performance has been encoded into a MIDI stream, we refer to that as a MIDI Performance. https://clyp.it/x24hp1pq is an example (also given previously) of a MIDI performance.
2 Factoring the Music Generation Process: Related Work
Figure 3 shows one way of factoring the music generation process. The first stage shown in this figure is composition, which yields a score. The score is then performed. The performance is rendered as sound, and finally that sound is perceived. In the analog world, of course, performance and rendering the sound are the same on a physical instrument, but in the digital world, those steps are often separate. While other views of the process are possible, this one provides us a helpful context for considering much of the existing relevant work. Noting that sound generation and perception (the last two steps in Figure 3) are outside our scope, in the rest of this section we focus primarily on composition and performance.
Perhaps it is precisely because music is so often perceived as a profoundly human endeavour that there has also been, in parallel, an ongoing fascination with automating its creation. This fascination long predates notions such as the Turing test (ostensibly for discriminating automation of the most human behaviour), and has spawned a range of efforts: from attempts at the formalization of unambiguously strict rules of composition to incorporation of complete random chance into scores and performances. The use of rules exemplifies the algorithmic (and largely deterministic) approach to music generation, one that is interesting and outside the scope of the current work; for background on this we refer the reader, for example, to the text by Nierhaus nierhaus-2009 (). Our present work, on the other hand, lies in a part of the spectrum that incorporates probability and sampling.
Aleatory refers to music or art that involve elements of randomness, derived from the Latin alea (alee), meaning “die (dice)”. Dice were used in the 1700’s to create music in a game referred to as Musikalisches Würfelspiel nierhaus-2009 (); hedges-78 (); boehmer-67 (): the rolled numbers were used to select from pre-composed fragments of music. Some of these compositions were attributed to Mozart and Haydn, though this has not been authenticated.
Two centuries later, as the foundations of AI were being set, the notion of automatically understanding (and therefore generating) music was among the earliest applications to capture the imagination of researchers, with papers on computational approaches to perception, interpretation and generation of music by Simon, Longuet-Higgins and others linblom-sundberg-70 (); longuet-higgins-76 (); longuet-higgins-78 (); longuet-higgins-steedman-71 (); simon-sumner-68 (). Since then, many interesting efforts were made griffith-todd (); todd-loy (); concert-94 (); eck-schmidhuber-2002 (); pachet-2003 (); hild-91 (), and it is clear that in recent years both interest and progress in score generation has continued to advance, e.g. Lattner et al lattner-2017 (), Boulanger-Lewandowski et al boulanger-lewandowski-et-al-2012 (), Bretan et al bretan-2017a (), Herremans et al herremans-chew-2017 (), Roberts et al roberts-2016 (), Sturm sturm-2016 (), to name only a few. Briot et al briot-2017 () provide a survey of generative music models that involve machine learning. Herremans et al herremans-chuan-chew-2017 () provide a comprehensive survey and satisfying taxonomy of music generation systems. McDonald mcdonald-2017 () gives an overview highlighting some key examples of such work.
Corresponding to the second step in Figure 3 is a body of work often referred to as EMP (Expressive Musical Performance) systems. For example, the work by Chacon and Grachten chacon-grachten-2016 (), inspired by the Linear Basis Models proposed by Grachten and Widmer grachten-widmer-2012 (), involves defining a set of hand-engineered features, some of which depend on having a score with dynamic expression marks, others on heuristics for musical analysis (e.g. a basis function indicating whether the note falls on the first beat of a measure of 4/4). Widmer and Goebl widmer-goebl-2004 () and Kirke and Miranda kirke-miranda-2013 () both present extensive and detailed surveys of work done in the field of computational EMPs. In the latter survey, the authors also provide a tabular comparison of 29 systems that they have reviewed. Out of those systems, two use neural networks (one of which also uses performance rules) and a few more use PCA, linear regression, KCCA, etc. Some of the other systems that involve some learning, do so by learning rules in some way. For example, the KTH model friberg-et-al-2006 () consists of a top-down approach for predicting performance characteristics from rules based on local musical context. Bresin bresin-1998 () presents two variations of a neural network-based system for learning how to add dynamics and timing to MIDI piano performance.
Grachten and Krebs grachten-krebs-2014 () use a variety of unsupervised learning techniques to learn features with which they then predict expressive dynamics. Building on that work, van Herwaarden et al vanherwaarden-et-al-2014 () use an interesting combination of an RBM-based architecture, a note-centered input representation, and multiple datasets to—again—predict expressive dynamics. In both of these cases, the dynamics predictions appear to depend on the micro-timing rather than being predicted jointly as in the present work.
Teramura et al teramura-et-al-2008 (), observe that many previous performance rendering systems “often consist of many heuristic rules and tend to be complex. It makes [it] difficult to generate and select the useful rules, or perform the optimization of parameters in the rules.” They thus present a method that uses Gaussian Processes to achieve this, where some parameters can be learned. In their ostensibly simpler system, “for each single note, three outputs and corresponding thirteen input features are defined, and three functions each of which returns one of three outputs and receive the thirteen input features, are independently learned”. However, some of these features, too, depend on certain information, e.g. they compute the differences between successive pitches, and this only works in compositions where the voice leading is absolutely clear; in the majority of classical piano repertoire, this is not the case. In Laminae okumura-sako-kitamura-2014 (), Okumura et al systematize a set of context-dependent models, building a decision tree which allows rendering a performance by combining contextual information.
Moulieras and Pachet moulieras-pachet-2016 () use a maximum entropy model to generate expressive music, but their focus is again monophonic plus simple harmonic information. They also explicitly assume that “musical expression consists in local texture, rather than long-range correlations”. While this is fairly reasonable at this point, and indeed it is hard to say how much long-range correlation is captured by our model, we wished to choose a model which, at least in principle, allowed the possibility of modeling long-range correlation: ultimately, we believe that these correlations are of fundamental importance. Malik and Ek malik-ek-2017 () use a neural network to learn to predict the dynamic levels of individual notes while assuming quantized and steady timing.
3 Choosing Assumptions and a Problem Domain
In the case of both score production and interpretation, any computational model naturally makes assumptions. Let us review potential implications of some of these when generating music, and identify some of the choices we make in our own model in these respects.
Metric Abstraction Many systems abstract rhythm in relation to an underlying grid, with metric-based units such as eighth notes and triplets. Often this is further restricted to step sizes at powers of two. Such abstraction is oblivious to many essential musical devices, including e.g. rubato and swing as described in Section 1.1.1. Some EMP systems allow for variations in the global tempo, but this would not be able to represent common performance techniques such as playing of the melody slightly staggered from accompaniment (i.e. creating an asynchrony beyond what is written in the score).
We choose a temporal representation based on absolute time intervals between events, rounded to 8ms.
No Dynamics Nearly every compositional system represents notes as ON or OFF. This binary representation ignores dynamics, which constitute an essential aspect of how music is perceived. The EMP systems do tend to focus on dynamics. While many systems do not have audio readily available, we point out that listening to, e.g. the work of Malik and Ek malik-ek-2017 () where a binned velocity value is predicted for each note, the abstracted and static tempo is still quite noticeable. When dynamic level is treated in some EMPs as a global parameter applied equally to simultaneous notes, this defeats the ability of dynamics to differentiate between voices, or to compensate for a dense accompaniment (that is best played quietly) underneath a sparse melody.
We allow each note to have its own dynamic level.
Monophony Some systems only generate monophonic sequences. Admittedly, one must start somewhere: the need to limit to monophonic output is in this sense entirely understandable. This can work very well for instruments such as voice and violin, where the performer also has sophisticated control beyond quantized pitch and the velocity of the note attack. The perceived quality of monophonic sequences may be inextricably tied to these other dimensions that are difficult to capture and usually absent from MIDI sequences.
In our experience, the leap from monophonic to polyphonic generation is a significant one. A survey of the literature shows that most systems that admit polyphony still make assumptions about its nature—either that it is separable into chords, or that it is separable into voices, or that any microvariation in tempo applies to all voices at once (as opposed to allowing one voice to come in ahead of the beat), and so forth. Each of these assumptions is correct only sometimes. We settled on a representation that turned out to be simpler and more agnostic than this, in that it does not make any of these assumptions:
We specify note events one at a time, but allow the system to predict an arbitrary number of simultaneous notes, should it be so inclined.
Generally speaking, in contrast to many of the method discussed in Section 2, our approach makes no assumptions about the features other than the information that is known to exist in MIDI files: velocity, timing and duration of each note. We do not require computing or knowing the time signature, we do not require knowing the voice leading, we do not require inferring the chord, and so on. While additional information could be both useful and interesting, given the current state of the art and available data, we are focused on showing how much can be done without defining any rules or heuristics at all; we simply try to model the distribution of the existing data. Listening to some of the examples, one hears that our system generates a variety of natural time feels, including 3/4, 4/4 and odd time signatures, and they never feel rhythmically heavy-handed.
3.2 Problem Domain: Simultaneously Composing and Performing
In Figure 4, we show a few different possible entry points to the music generation process. For example, at one extreme, we can subsume all steps into a single mechanism so as to predict audio directly, as is done by WaveNet, with impressive results van-den-oord-et-al-2016 (). Another approach is to focus only on the instrument synthesis aspect engel-et-al-2017 (), which is an interesting problem outside the scope of our present work. As described in Section 2, the compositional systems generate scores that require performances, while the EMP systems require scores in order to generate performances.
Here, we demonstrate that jointly predicting composition and performance with expressive timing and dynamics, as illustrated in Figure 4(d), is another effective domain for music generation given the current state of the art. Furthermore, it creates output that can be listened to without requiring additional steps beyond audio synthesis as provided by a piano sample library.
While the primary evidence for this will be found simply by listening to the results, we mention two related discussion points about the state of the art:
Music with very long-term, fully coherent structure is still elusive. In “real” compositions, long-term structure spans the order of many minutes and is coherent on many levels. There is no current system that is able to learn such structure effectively. That is, if , then even for just 2 minutes, should be different from . There is no current system that effectively achieves anywhere near this for symbolic MIDI representation.
Metrics for evaluating generated music are very limited. Theis and others theis-et-al-2016 (); van-den-oord-dambre-2015 () have given clear arguments about the limitations of metrics for evaluating the quality of generative models in the case of visual models, and their explanations extend naturally to the case of musical and audio models. In particular, they point out that ultimately, “models need to be evaluated directly with respect to the application(s) they were intended for”. In the case of the generative music models that we are considering, this involves humans listening.
Taken together, what this means is that systems that generate musical scores face a significant evaluation dilemma. Since by definition any listening-based evaluation must operate in the audio space, either a) the scores must be rendered directly and will lack expression entirely, or b) a human or other system must perform the scores, in which case the quality of the generated score is hard to disentangle from the quality of the performance.444For example, listening to the direct score and performance clips given above, it should be clear that other than perhaps very experienced musicians, it would be extremely difficult for a listener to hear the audio of the MIDI Score and intuitively understand that that same passage could sound as it does in the MIDI Performance. Furthermore, the lack of long-term structure compounds the difficulty of evaluation, because one of the primary qualities of a good score is precisely in its long-term structure. This implicitly bounds the potential significance of evaluating a short and context-free compositional fragment.
With these considerations in mind, we generate directly in the domain of musical performance. A side benefit of this is that informal evaluation becomes more potentially meaningful: musicians and non-musicians alike can listen to clips of generated performances while (1) not being put off by the lack of expressiveness and (2) not needing to disentangle the different elements that contributed to what they hear, since both the notes and how they are all played were all generated by the system.555We emphasize that these observations do not apply to the development of tools for composers, where score fragment generation might be appropriate. Also, we reiterate that this discussion is made in relation to the current state of the art. We also note that our approach is consistent with many of the points and arguments recently made by Widmer widmer-16 ().
If we wish to predict expressive performance, we need to have the appropriate data. We use the International Piano-e-Competition dataset piano-competition (), which contains MIDI captures of roughly 1400 performances by skilled pianists. The pianists were playing a Disklavier, which is a real piano that also has internal sensors that record MIDI events corresponding to the performer’s actions. The critical importance of good data is well-known for machine learning in general, but here we note some particular aspects of this data set that made it well-suited for our task.
The data set was homogeneous in a set of important ways. It might be easy to underestimate the importance of any of the following criteria, and so we list them all explicitly here with some discussion:
First, it was all classical music.
This helps the coherence of the output.
Second, it was all solo instrumental music.
If one includes data that is for two or more instruments, then it no longer makes sense to train a generative model that is expected to generate for a solo instrument; there will be many (if not most) passages where what one instrument is doing is entirely dependent on what the other instrument is doing. The text analogy would be hoping for a system to learn to write novels by training it on only one character’s dialogue from movies and plays. There will occasionally be self-sufficient monologues, but generally speaking, well-written dialogue has already been distilled by the playwright, and makes more sense when voices are not removed from it.
Third, that solo instrument was consistently piano.
Classical composers generally write in a way that is very specific to whichever instrument they are writing for. Each instrument has its own natural characteristics, and classical music scores (i.e. that which is captured in the MIDI representation) are very closely related to the timbre of that instrument (i.e. how those notes will be “rendered”). One exception to this is that Bach’s music tends to sound quite good on any instrument, e.g. it is OK to train a piano system on Bach vocal chorales.
Fourth, the piano performances were all done by humans.
The system did not have to contend with learning from a dataset where some of the examples were synthesized, some were “hand-synthesized” to appear like human performances, etc. Each of those classes has its own patterns of micro-timing and dynamics, and each may be well-suited for a variety of music-related tasks, but for training a system on performances, it is very helpful that all the performances are indeed performances.
Finally, all of those humans were experts.
If we wish the system to learn about human performance, that human performance must match the listener’s concept of what “human performance” sounds like, which is usually performances by experts. The casual evaluator might find themselves slightly underwhelmed were they to listen to a system that has learned to play like a beginning pianist, even if the system has done so with remarkable fidelity to the dynamic and velocity patterns that occur in that situation.
The fact that the solo instrument was piano had additional advantages. Synthesizing audio from MIDI can be a challenging problem for some instruments. For example, having velocities and note durations and timing of violin music would not immediately lead to good-sounding violin audio at all. The problems are even more evident if one considers synthesizing vocals from MIDI. That the piano is a percussive instrument buys us an important benefit: synthesizing piano music from MIDI can sound quite good. Thus, when we generate data we can properly realize it in audio space and therefore have a good point of comparison. Conversely, capturing the MIDI data of piano playing provides us with a sufficiently rich set of parameters that we can later learn enough in order to be able to render audio. Note that with violin or voice, for example, we would need to capture many more parameters than those typically available in the MIDI protocol in order to get a sufficiently meaningful set of parameters for expressive performance.
5 RNN Model
We modeled the performance data with an LSTM-based Recurrent Neural Network. The model consisted of three layers of 512 cells each, although the network did not seem particularly sensitive to this hyperparameter. We used a temporally non-uniform representation of the data, as described next.
5.0.1 Representation: Time-shift
A MIDI excerpt is represented as a sequence of events from the following vocabulary of 413 different events:
128 NOTE-ON events: one for each of the 128 MIDI pitches. Each one starts a new note.
128 NOTE-OFF events: one for each of the 128 MIDI pitches. Each one releases a note.
125 TIME-SHIFT events: each one moves the time step forward by increments of 8 ms up to 1 second.
32 VELOCITY events: each one changes the velocity applied to all subsequent notes (until the next velocity event).
The neural network operates on a one-hot encoding over this event vocabulary. Thus, at each step, the input to the RNN is a single one-hot 413-dimensional vector. For the piano-e-competition dataset, a 15-second clip typically contains 600 such one-hot vectors, although this varies considerably (and roughly linearly with the number of notes in the clip).
While the minimal time step is a fixed absolute size (), the model can skip forward in time to the next note event. Thus, any time steps that contain rests or simply hold existing notes can be skipped with a single event. The largest possible single time shift in our case is 1 second but time shifts can be applied consecutively to allow effectively longer shifts. The combination of fine quantization and time-shift events helps maintain expressiveness in note timings while greatly reducing sequence length compared to an uncompressed representation.
This fine quantization is able to maintain expressiveness in note timings while not being as sparse as a grid-based representation. This sequence representation uses more events in sections with higher note density, which matches our intuition.
5.1 Training and Data Augmentation
We train the models by first separating the data into 30-second clips, from which we then select shorter segments. We train using stochastic gradient descent with a mini-batch size of 64 and a learning rate of 0.001 and teacher forcing.
We augment the data in two different ways, for different runs:
Each example is transposed up and down all intervals up to a major third, resulting in 8 new examples plus the original.
Each example is stretched in time uniformly by and , resulting in 4 new examples plus the original.
Each example is transposed up and down all intervals up to 5 or 6 semitones to span a full octave, resulting in 11 new examples plus the original.
Each example is stretched in time uniformly by up to .
In Section 3 we describe several forms of quantization that can be harmful to perceived musical quality. Our models also operate on quantized data; however, unlike much prior work we aim for quantization levels that are below noticeable perceptual thresholds.
Friberg and Sundberg friberg-sundberg-1992 () found that the just noticeable difference (JND) when temporally displacing a single tone in a sequence was generally no finer than roughly . Other studies have found that the JND for change in tempo is no finer than roughly 5%. We note that for a tempo of , each beat lasts for , and therefore this corresponds to a change of roughly . Given that at that tempo beats will frequently still be subdivided into 2 or triplets, that would correspond to a change of roughly 8 ms per subdivided unit. We therefore assume that using a sampling rate of (i.e. ) should generally be below the typical perceptual threshold.
Working with piano music, we have found that 32 different “steps” of velocity are sufficient. Note that there are about 8 levels of common dynamic marking in classical music (from ppp to fff), so it may well be the case that we could do with fewer than 32 bins, but our objective was not to find the lower bound here.
5.1.3 Predicting Pedal
In the RNN model, we experimented with predicting sustain pedal. We applied Pedal_On by directly extending the lengths of the notes: for any notes on during or after a Pedal_On signal, we delay their corresponding Note_Off events until the next Pedal_Off signal. This made it a lot easier for the system to accurately predict a whole set of Note_Off events all at once, as well as to predict the corresponding delay preceding this. Doing so may have also freed up resources to focus on better prediction of other events as well. Finally, as one might expect, including pedal made a significant subjective improvement in the quality of the resulting output.
We begin with the most important indicator of performance: generated audio examples.
In these examples, our systems generated all MIDI events: timing and duration of notes as well as note velocities. We then used freely-availably piano samples to synthesize audio from the resulting MIDI file.
A small set of examples are available at https://clyp.it/user/3mdslat4. We strongly encourage the reader to listen. These examples are representative of the general output of the model. We comment on a few samples in particular, to give a sense of the kind of musical structure that we observe:
\colorblue RNN Sample 4: This starts off with a slower segment that goes through a very natural harmonic progression in G minor, pauses on the dominant chord, and then breaks into a faster section that starts with a G major chord, then passes through major chords related to G minor (Bb, etc). Harmonically, this shows structural coherence even while the tempo and feel shift. At around 12s, the “left hand” uses dynamics to bring out an inner voice in a very natural and appropriate way.
\colorblue RNN Sample 7: This excerpt begins very reminiscent of a Schubert Impromptu, although it is sufficiently different that it has clearly not memorized it. There is a small rubato at the very beginning of the phrase, especially on the first note, which is musically appropriate. The swells in the phrasing make musical sense, as do the slight pauses right before some of the isolated notes in the left hand (e.g. the E at 0:10s, the F at around 12.5 seconds).
\colorblue RNN Sample 2: This excerpt begins in a classical style (e.g. Haydn or Mozart). Interestingly, the same way that one note (an F) is repeated in the right hand in the first few seconds, after a pause, the next phrase begins and then at around 8 seconds, the left hand mirrors that articulation pattern with a set of descending repeated notes (A, G, F).
We begin by noting that objective evaluation of these kinds of generative is fundamentally very difficult, and measures such as log-likelihood can be quite misleading theis-et-al-2016 (). Nevertheless, we provide comparisons here over several different hyperparameter configurations for the RNN.
|RNN||.765||baseline RNN trained on 15-second clips|
|RNN-NV||.619||baseline without velocity|
|RNN-SUS||.663||baseline with pedaled notes extended|
|RNN-AUG+||.755||baseline with more data augmentation|
|RNN-AUG-||.784||baseline with less data augmentation|
|RNN-30s||.750||baseline trained on 30-second clips|
|RNN-SUS-30s||.664||baseline + pedal + 30-second clips|
Table 1 contains the per-time-step log-loss of several RNN model variants. The baseline model is trained on 15-second performance clips, ignoring sustain pedal and with the two forms of data augmentation described in Section 5.1.
Note that while RNN-NV has the best log-loss, this variant is inherently easier as the model does not need to predict velocities. In the RNN-SUS variant, sustain pedal is used to extend note durations until the pedal is lifted; this aids prediction as discussed in Section 5.1.3.
6.3 Informal Feedback From Professional Composers and Musicians
We gave a small set of clips to professional musicians and composers for informal comments. We were not trying to do a Turing test, so we mentioned that the clips were generated by an automated system, and simply asked for any initial reactions/comments. Here is a small, representative subset of the comments we received (musical background in bold, some particularly interesting excerpts are italicized for later discussion):
Fantastic!!!! How many hours of learning  here?
This  absolutely blows the stuff I’ve heard online out of the solar system. The melodic sense is still foggy, in my view, but it’s staggering that it makes nice pauses with some arcing chord progressions quite nicely. I think that it’s not far from actually coming up with a worthwhile melody.  How does it know what “inspirational emotion” to draw from? or is it mostly doing things “in the likeness of”?
Composer & Professional Musician
In terms of performance I’m quite impressed with the results. It sounds more expressive than any playback feature Iâve worked with when using composition software.
In terms of composition, I think there is more room for improvement. The main issue is lack of consistency in rhythmic structure and genre or style. For example, Sample 1 starts with a phrase in Mozartâs style, then continues with a phrase in Waltonâs style perhaps, which then turns into Scott Joplin Sample 2 uses the harmonic language of a late Mahler symphony, along with the rhythmic language of a free jazz improvisation (I couldnât make a time signature out of this clip). Sample 3 starts with a phrase that could be the opening of a Romantic composition, and then takes off with a rhythmic structure that resembles a Bach composition, while keeping the Romantic harmonic language. Sample 4 is the most consistent of all. It sounds like a composition in the style of one of the Romantic piano composers (such as Liszt perhaps) and remains in that style throughout the clip.
I’d guess human because of a couple of “errors” in there, but maybe the AI has learned to throw some in! 
Pianist, TV & Film Composer:
Sample 1: resembles music in the style of Robert Schumann’s Kinderszenen or some early romantic salon music. I’m fond of the rest after the little initial chord and melody structure. The tempo slows down slightly before the rest which sounds really lively and realistic - almost a little rubato. Then the distinct hard attack. Nice sense of dynamics. Also nice ritardando at the end of the snippet. Not liking the somewhat messy run but this almost seems as if someone had to study a little bit harder to get it right - it seems wrong in a human way.
Sample 2: reminds me of some kind of Chopin waltz, rhythm is somewhat unclear. The seemingly wrong harmony at the beginning seems to be a misinterpretation of grace notes. The trill is astonishing and feels light and airy.
Sample 3: Could be some piece by Franz Schubert. Nice loosely feeling opening structure which shifts convincingly into fierce sequence with quite static velocity. This really reminds me of Schubert because Johann Sebastian Bach shines through the harmonic structure as it would have with Schubert. Interesting effort to change the dynamic focus from the right to the left hand and back again.
This is really interesting!
Sample 1: Sounded almost Bach-like for about the first bar, then turned somewhat rag-timey for the rest
Sample 2: Here we have a very drunken Chopin, messing around a bit with psychedelics
Does that help at all? Also, what do you mean by a regular piano sample library? Did you play these clips as composed by the AI system?
Overall, we note that the comments were quite consistent in terms of perceiving a human quality to the performance. Indeed, even though we made an effort to explain that all aspects of the MIDI file were generated by the computer, some people still wanted to double check whether in fact these were human performances.
While acknowledging the human quality of the performances, many of the musicians also questioned the strength of the long-term compositional structure. Indeed, creating music with long-term structure (e.g. more than several seconds of structure) is still a very challenging problem.
Many musicians identified the ‘style’ as the mix of classical composers of which the data indeed consisted.
We have considered various approaches to the question of generating music, and propose that it is currently effective to generate in the space of MIDI performances. We describe the characteristics of an effective data set for doing so, and demonstrate a system that achieves this quite effectively.
Our resulting system creates audio that sounds, to our ears, like a pianist who knows very well how to play, but has not yet figured out exactly what they want to play, nor is quite able remember what they just played. Professional composers and musicians have provided feedback that is consistent with the notion that the system generates music which, on one hand, does not yet demonstrate long-term structure, but where the local structure, e.g. phrasing, dynamics, is very strong. Indeed, even though we did not frame the question as a Turing test, a number of the musicians assumed that (or asked whether) the samples were performed by a human.
We gratefully acknowledge all of the musicians who provided feedback on the samples.
We thank members and visitors at Google Brain and specifically the Magenta team for discussions, including Adam Roberts, Anna Huang, Colin Raffel, Curtis Hawthorne, David Ha, David So, Fred Bertch, George Dahl, Jesse Engel, Kory Mathewson, Kyle Kastner, Natasha Jaques and Tim Cooijmans. Finally, we thank the reviewers for their useful feedback.
-  International Piano-e-Competition. http://www.piano-e-competition.com/. Accessed: 2018-02-15.
-  Konrad Boehmer. Zur Theorie der offenen Form in der neuen Musik. Darmstadt: Edition Tonos, 1967.
-  Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the 29th International Conference on Machine Learning, 2012.
-  R. Bresin. Artificial neural networks based models for automatic performance of musical scores. J. New Music Res., 27:239–270, 1998.
-  Mason Bretan, Sageev Oore, Jesse Engel, Douglas Eck, and Larry Heck. Deep music: Towards musical dialogue. In Proc. AAAI, 2017.
-  Jean-Pierre Briot, Gaëtan Hadjeres, and François Pachet. Deep learning techniques for music generation - A survey. CoRR, abs/1709.01620, 2017.
-  C. E. Cancino ChacÃ³n and M. Grachten. The basis mixer: A computational romantic pianist. In Late-Breaking Demo Session of the 17th International Society for Music Information Retrieval Conference. New York, NY, 2016.
-  Frédéric Chopin. Piano Concerto No. 1 in E minor, Op. 11, 1830.
-  D. Eck and J. Schmidhuber. Finding temporal structure in music: blues improvisation with lstm recurrent networks. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pages 747–756, Sept 2002.
-  Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Douglas Eck, Karen Simonyan, and Mohammad Norouzi. Neural audio synthesis of musical notes with wavenet autoencoders. arXiv: https://arxiv.org/abs/1704.01279, 2017.
-  A. Friberg, R. Bresin, and J Sundberg. Overview of the kth rule system for musical performance. Advances in Cognitive Psychology, 2(2–3):145–161, 2006.
-  Anders Friberg and Johan Sundberg. Perception of just noticeable time displacement of a tone presented in a metrical sequence at different tempos, 1992.
-  M. Grachten and F. Krebs. An assessment of learned score features for modeling expressive dynamics in music. IEEE Transactions on Multimedia, 16(5):1211–1218, 2014.
-  Maarten Grachten and Gerhard Widmer. Linear basis models for prediction and analysis of musical expression. Journal of New Music Research, 41(4):311–322, 2012.
-  Niall Griffith and Peter M. Todd, editors. Musical Networks: Parallel Distributed Perception and Performance. 1999.
-  Stephen Hedges. Dice music in the eighteenth century. Music and Letters, 59, 1978.
-  D. Herremans and E. Chew. Morpheus: Automatic music generation with recurrent pattern constraints and tension. IEEE Transactions on Affective Computing.
-  Dorien Herremans, Ching-Hua Chuan, and Elaine Chew. A functional taxonomy of music generation systems. ACM Comput. Surv., 50(5):69:1–69:30, September 2017.
-  Hermann Hild, Johannes Feulner, and Wolfram Menzel. Harmonet: A neural net for harmonizing chorales in the style of j.s.bach. In Proceedings of the 4th International Conference on Neural Information Processing Systems, NIPS’91, pages 267–274, San Francisco, CA, USA, 1991. Morgan Kaufmann Publishers Inc.
-  Y. Taniguchi S. Makimoto K. Teramura, H. Okuma and S. Maeda. Gaussian process regression for rendering music performance. In Proceedings of the 10th International Conference on Music Perception and Cognition (ICMPC 10). Japan, 2008.
-  A. Kirke and E. R. Miranda. An overview of computer systems for expressive music performance. In Guide to Computing for Expressive Music Performance. Springer-Verlag, London, 2013.
-  Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Imposing higher-level structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints, 2017.
-  B Lindblom and J Sundberg. Towards a generative theory of melody. Svensk Tidskrift fÃ¶r Musikforskning, (52), 1970.
-  H C Longuet-Higgins. The perception of melodies. Nature, 263, 1976.
-  H C Longuet-Higgins. The perception of music. Interdisciplinary Science Review, 3, 1978.
-  H C Longuet-Higgins and M J Steedman. On interpreting bach. Machine Intelligence, 6, 1971.
-  Iman Malik and Carl Henrik Ek. Neural translation of musical style. CoRR, abs/1708.03535, 2017.
-  Kyle McDonald. Neural nets for generating music. Medium (https://medium.com/artists-and-machine-intelligence/neural-nets-for-generating-music-f46dffac21c0), 2017. Accessed 15-November-2017.
-  S. Moulieras and F. Pachet. Maximum entropy models for generation of expressive music. arXiv: http://arxiv.org/abs/1610.03606, 2016.
-  Michael C. Mozer. Neural network composition by prediction: Exploring the benefits of psychophysical constraints and multiscale processing, 1994.
-  Gerhard Nierhaus. Algorithmic Composition: Paradigms of Automated Music Generation. Springer Vienna, 2009.
-  K. Okumura, S. Sako, and T. Kitamura. Laminae: A stochastic modeling-based autonomous performance rendering system that elucidates performer characteristics. In International Computer Music Conference (ICMC). Athens, Greece, 2014.
-  Sageev Oore. Recording of Chopin Piano Concerto No. 1 in E Minor, Op. 11, movement, 2017. (unreleased).
-  FranÃ§ois Pachet. The continuator: Musical interaction with style. Journal of New Music Research, 32(3):333–341, 2003.
-  A. Roberts, J. Engel, C. Hawthorne, I. Simon, E. Waite, S. Oore, N. Jaques, C. Resnick, and D. Eck. Interactive musical improvisation with magenta. In Demonstration Track in Neural Information Processing Systems (NIPS). 2016.
-  H A Simon and R K Sumner. Pattern in music. In B Kleinmuntz, editor, Formal Representation of Human Judgement. John Wiley, New York, 1968.
-  Bob Sturm, Joao Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. Music transcription modelling and composition using deep learning. In Proc. 1st Conf. Computer Simulation of Musical Creativity, Huddersfield, UK, July 2016.
-  Lucas Theis, AÃ¤ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In ICLR, 2016.
-  Peter M. Todd and Gareth Loy, editors. Music and Connectionism. MIT Press, 1991.
-  A. van den Oord and J. Dambre. Locally-connected transformations for deep gmms. In Proceedings of the International Conference on Machine Learning. Lille, France, 2015.
-  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016.
-  S. van Herwaarden, M. Grachten, and W. B. de Haas. Predicting expressive dynamics using neural networks. In Proceedings of the 15th Conference of the International Society for Music Information Retrieval, pages 47–52. 2014.
-  G. Widmer and W. Goebl. Computational models of expressive music performance: The state of the art. Journal of New Music Research, 33(3).
-  Gerhard Widmer. Getting closer to the essence of music: The con espressione manifesto. CoRR, abs/1611.09733, 2016.