A database linking piano and orchestral MIDI scores with application to automatic projective orchestration
This article introduces the Projective Orchestral Database (POD), a collection of MIDI scores composed of pairs linking piano scores to their corresponding orchestrations. To the best of our knowledge, this is the first database of its kind, which performs piano or orchestral prediction, but more importantly which tries to learn the correlations between piano and orchestral scores. Hence, we also introduce the projective orchestration task, which consists in learning how to perform the automatic orchestration of a piano score. We show how this task can be addressed using learning methods and also provide methodological guidelines in order to properly use this database.
Léopold Crestel Philippe Esling Lena Heng Stephen McAdams
Music Representations, IRCAM, Paris, France
Schulich School of Music, McGill University, Montréal, Canada
Orchestration is the subtle art of writing musical pieces for the orchestra by combining the properties of various instruments in order to achieve a particular musical idea [koechli_orch, Rimsky-Korsakov:1873aa]. Among the variety of writing techniques for orchestra, we define as projective orchestration [esling2010dynamic] the technique which consists in first writing a piano score and then orchestrating it (akin to a projection operation, as depicted in \figreffig:orch). This technique has been used by classic composers for centuries. One such example is the orchestration by Maurice Ravel of Pictures at an Exhibition, a piano work written by Modest Mussorgsky. This paper introduces the first dataset of musical scores dedicated to projective orchestrations. It contains pairs of piano pieces associated with their orchestration written by famous composers. Hence, the purpose of this database is to offer a solid knowledge for studying the correlations involved in the transformation from a piano to an orchestral score.
The remainder of this paper is organized as follows. First, the motivations for a scientific investigation of orchestration are exposed (section 2). By reviewing the previous attempts, we highlight the specific need for a symbolic database of piano and corresponding orchestral scores. In an attempt to fill this gap, we built the Projective Orchestral Database (POD) and detail its structure in section 3. In section 4, the automatic projective orchestration task is proposed as an evaluation framework for automatic orchestration systems. We report our experiment with a set of learning-based models derived from the Restricted Boltzmann Machine [taylor2009factored] and introduce their performance in the previously defined evaluation framework. Finally, in section 5 we provide methodological guidelines and conclusions.
2 A scientific investigation of orchestration
Over the past centuries, several treatises have been written by renowned composers in an attempt to decipher some guiding rules in orchestration [koechli_orch, piston-orch, Rimsky-Korsakov:1873aa]. Even though they present a remarkable set of examples, none of them builds a systemic set of rules towards a comprehensive theory of orchestration. The reason behind this lack lies in the tremendous complexity that emerges from orchestral works. A large number of possible sounds can be created by combining the pitch and intensity ranges of each instruments in a symphonic orchestra. Furthermore, during a performance, the sound produced by a mixture of instruments is also the result of highly non-linear acoustic effects. Finally, the way we perceive those sounds involves complex psychoacoustic phenomena [lembke2012timbre, tardieu2012perception, mcadams2009perception]. It seems almost impossible for a human mind to grasp in its entirety the intertwined mechanisms of an orchestral rendering.
Hence, we believe that a thorough scientific investigation could help disentangle the multiple factors involved in orchestral works. This could provide a first step towards a greater understanding of this complex and widely uncharted discipline. Recently, major works have refined our understanding of the perceptual and cognitive mechanisms specifically involved when listening to instrumental mixtures [pressnitzer2000perception, tardieu2012perception, mcadams2013timbre]. Orchids, an advanced tool for assisting composers in the search of a particular sonic goal has been developed [esling2010dynamic]. It relies on the multi-objective optimization of several spectro-temporal features such as those described in [peeters2011timbre].
However, few attempts have been made to tackle a scientific exploration of orchestration based on the study of musical scores. Yet, symbolic representations implicitly convey high-level information about the spectral knowledge composers have exploited for timbre manipulations. In [cookerly2010complete] a generative system for orchestral music is introduced. Given a certain style, the system is able to generate a melodic line and its accompaniment by a full symphonic orchestra. Their approach relies on a set of templates and hand-designed rules characteristic of different styles. [Pachet:2016:JOA:3004291.2897738] is a case study of how to automatically transfer the Ode to joy to different styles. Unfortunately, very few details are provided about the models used, but it is interesting to observe that different models are used for different styles. Automatic arrangement, which consists in reducing an orchestral score to a piano version that is can be played by a two-hand pianist, has been tackled in [huang2012towards] and [automatic_arranging_smc]. The proposed systems rely on an automatic analysis of the orchestral score in order to split it into structuring elements. Then, each element is assigned a role which determines whether it is played or discarded in the reduction. To the best of our knowledge, the inverse problem of automatically orchestrating a piano score has never been tackled. However, we believe that unknown mechanisms of orchestration could be revealed by observing how composers perform projective orchestration, which essentially consists in highlighting an existing harmonic, rhythmic and melodic structure of a piano piece through a timbral structure.
Even though symbolic data are generally regarded as a more compact representation than a raw signal in the computer music field, the number of pitch combinations that a symphonic orchestra can produce is extremely large. Hence, the manipulation of symbolic data still remains costly from a computational point of view. Even through computer analysis, an exhaustive investigation of all the possible combinations is not feasible. For that reason, the approaches found in the literature rely heavily on heuristics and hand-designed rules to limit the number of possible solutions and decrease the complexity. However, the recent advents in machine learning have brought techniques that can cope with the dimensionality involved with symbolic orchestral data. Besides, even if a wide range of orchestrations exist for a given piano score, all of them will share strong relations with the original piano score. Therefore, we make the assumption that projective orchestration might be a relatively simple and well-structured transformation lying in a complex high-dimensional space. Neural networks have precisely demonstrated a spectacular ability for extracting a structured lower-dimensional manifold from a high-dimensional entangled representation [LeCun:2015aa]. Hence, we believe that statistical tools are now powerful enough to lead a scientific investigation of projective orchestration based on symbolic data.
These statistical methods require an extensive amount of data, but there is no symbolic database dedicated to orchestration. This dataset is a first attempt to fill this gap by building a freely accessible symbolic database of piano scores and corresponding orchestrations.
3.1 Structure of the Database
The database can be found on the companion website
The Projective Orchestral Database (POD) contains 392 MIDI files. Those files are grouped in pairs containing a piano score and its orchestral version. Each pair is stored in a folder indexed by a number. The files have been collected from several free-access databases [imslp] or created by professional orchestration teachers.
As the files gathered in the database have various origins, different instrument names were found under a variety of aliases and abbreviations. Hence, we provide a comma-separated value (CSV) file associated with each MIDI file in order to normalize the corresponding instrumentations. In these files, the track names of the MIDI files are linked to a normalized instrument name.
For each folder, a CSV file with the name of the folder contains the relative path from the database root directory, the composer name and the piece name for the orchestral and piano works. A list of the composers present in the database can be found in table 1. It is important to note the imbalanced representativeness of composers in the database. It can be problematic in the learning context we investigate, because a kind of stylistic consistency is a priori necessary in order to extract a coherent set of rules. Picking a subset of the database would be one solution, but another possibility would be to add to the database this stylistic information and use it in a learning system.
Figure 2 highlights the activation ratio of each pitch in the orchestration scores (, where is the cardinal of an ensemble) over the whole dataset. Note that this activation ratio does not take the duration of notes into consideration, but only their number of occurrences. The pitch range of each instrument can be observed beneath the horizontal axis.
Two different kinds of imbalance can be observed in figure 2. First, a given pitch is rarely played. Second, some pitches are played more often compared with others. Class imbalance is known as being problematic for machine learning systems, and these two observations highlight how challenging the projective orchestration task is. More statistics about the whole database can be found on the companion website.
Both the metadata and instrumentation CSV files have been automatically generated but manually checked. We followed a conservative approach by automatically rejecting any score with the slightest ambiguity between a track name and a possible instrument (for instance bass can refer to double-bass or voice bass).
To facilitate the research work, we provide pre-computed piano-roll representations such as the one displayed in \figreffig:piano-roll. In this case, all the MIDI files of piano (respectively orchestra) work have been transformed and concatenated into a unique two-dimensional matrix. The starting and ending time of each track is indicated in the metadata.pkl file. These matrices can be found in Lua/Torch (.t7), Matlab (.m), Python (.npy) and raw (.csv) data formats.
Two versions of the database are provided. The first version contains unmodified midi files. The second version contains MIDI files automatically aligned using the Needleman-Wunsch [NEEDLEMAN1970443] algorithm as detailed in \secrefsec:automatic-alignment.
3.2 Automatic Alignment
Given the diverse origins of the MIDI files, a piano score and its corresponding orchestration are almost never aligned temporally.
These misalignments are very problematic for learning or mining tasks, and in general for any processing which intends to take advantage of the joint information provided by the piano and orchestral scores. Hence, we propose a method to automatically align two scores, and released its Python implementation on the companion website
The Needleman-Wunsch (NW) algorithm [NEEDLEMAN1970443] is a dynamic programming technique, which finds the optimal alignment between two symbolic sequences by allowing the introduction of gaps (empty spaces) in the sequences. An application of the NW algorithm to the automatic alignment of musical performances is introduced in [grachten2013automatic]. As pointed out in that article, NW is the most adapted technique for aligning two sequences with important structural differences like skipped parts, for instance.
The application of the NW algorithm relies solely on the definition of a cost function, which allows the pairwise comparison of elements from the two sequences, and the cost of opening or extending a gap in one of the two sequences.
To measure the similarity between two chords, we propose the following process:
discard intensities by representing notes being played as one and zero otherwise.
compute the pitch-class representation of the two vectors, which flattens all notes to a single octave vector (12 notes). In our case, we set the pitch-class to one if at least one note of the class is played. For instance, we set the pitch-class of C to one if there is any note with pitch C played in the piano-roll vector. This provides an extremely rough approximation of the harmony, which proved to be sufficient for aligning two scores. After this step, the dimensions of each vector is 12.
if one of the vectors is only filled with zeros, it represents a silence, and the similarity is automatically set to zero (note that the score function can take negative values).
for two pitch-class vectors and , we define the score as
where is defined as:
is a tunable parameter and is the norm.
Based on the values recommended in [NEEDLEMAN1970443] and our own experimentations, we set C to 10. The gap-open parameter, which defines the cost of introducing a gap in one of the two sequences, is set to 3 and the gap-extend parameter, which defines the cost of extending a gap in one of the two sequences, is set to 1.
4 An application : projective automatic orchestration
In this section, we introduce and formalize the automatic projective orchestration task (\figreffig:orch). In particular, we propose a system based on statistical learning and define an evaluation framework for using the POD database.
4.1 Task Definition
For each orchestral piece, we define as O and P the aligned sequences of column vectors from the piano-roll of the orchestra and piano parts. We denote as the length of the aligned sequences O and P.
The objective of this task is to infer the present orchestral frame knowing both the recent past of the orchestra sequence and the present piano frame. Mathematically, it consists in designing a function where
and defines the order of the model.
We propose a quantitative evaluation framework based on a one-step predictive task. As discussed in [conklin1995multiple], we make the assumption that an accurate predictive model will be able to generate original acceptable works. Whereas evaluating the generation of a complete musical score is subjective and difficult to quantify, a predictive framework provides us with a quantitative evaluation of the performance of a model. Indeed, many satisfying orchestrations can be created from the same piano score. However, the number of reasonable inferences of an orchestral frame given its context (as described in equation 2) is much more limited.
As suggested in [boulanger2012modeling, lavrenko2003polyphonic], the accuracy measure [bay2009evaluation] can be used to compare an inferred frame drawn from (2) to the ground-truth from the original file.
where (true positives) is the number of notes correctly predicted (note played in both and ). (false positive) is the number of notes predicted that are not in the original sequence (note played in but not in ). (false negative) is the number of unreported notes (note absent in , but played in ).
When the quantization gets finer, we observed that a model which simply repeats the previous frame gradually obtains the best accuracy as displayed in \tabreftab:results. To correct this bias, we recommend using an event-level evaluation framework where the comparisons between the ground truth and the model’s output is only performed for time indices in defined as the set of indexes such that
The definition of event-level indices can be observed in \figreffig:event_level_generation.
In the context of learning algorithms, splitting the database between disjoint train and test subsets is highly recommended [bishop2006pattern, pg.32-33], and the performance of a given model is only assessed on the test subset. Finally, the mean accuracy measure over the dataset is given by
where defines the test subset, the set of event-time indexes for a given score s, and .
4.2 Proposed Model
In this section, we propose a learning-based approach to tackle the automatic orchestral inference task.
We present the results for two models called conditional Restricted Boltzmann Machine (cRBM) and Factored Gated cRBM (FGcRBM). The models we explored are defined in a probabilistic framework, where the vectors and are represented as binary random variables. The orchestral inference function is a neural network that expresses the conditional dependencies between the different variables: the present orchestral frame , the present piano frame and the past orchestral frames . Hidden units are introduced to model the co-activation of these variables. Their number is a hyper-parameter with an order of magnitude of 1000. A theoretical introduction to these models can be found in [taylor2009factored], whereas their application to projective orchestration is detailed in [lop_smc].
In order to process the scores, we import them as piano-roll matrices (see \figreffig:piano-roll). Their extension to orchestral scores is obtained by concatenating the piano-rolls of each instrument along the pitch dimension.
Then, new events are extracted from both piano-rolls as described in \secrefsec:task_definition. A consequence is that the trained model apprehends the scores as a succession of events with no rhythmic structure. This is a simplification that considers the rhythmic structure of the projected orchestral score to be exactly the same as the one of the original piano score. This is false in the general case, since a composer can decide to add nonexistent events in an orchestration. However, this provides a reasonable approximation that is verified in a vast majority of cases. During the generation of an orchestral score given a piano score, the next orchestral frame is predicted in the event-level framework, but inserted at the temporal location of the corresponding piano frame as depicted in \figreffig:event_level_generation.
Automatic alignment of the two piano-rolls is performed on the event-level representations, as described in \secrefsec:automatic-alignment.
In order to reduce the input dimensionality, we systematically remove any pitch which is never played in the training database for each instrument. With that simplification the dimension of the orchestral vector typically decreases from 3584 to 795 and the piano vector dimension from 128 to 89. Also, we follow the usual orchestral simplifications used when writing orchestral scores by grouping together all the instruments of a same section. For instance, the violin section, which might be composed by several instrumentalists, is written as a single part. Finally, the velocity information is discarded, since we use binary units that solely indicate if a note is on or off.
Eventually, we observed that an important proportion of the frames are silences, which mathematically corresponds to a column vector filled with zeros in the piano-roll representation. A consequence of the over-representation of silences is that a model trained on this database will lean towards orchestrating with a silence any piano input, which is statistically the most relevant choice. Therefore, orchestration of silences in the piano score () are not used as training points. However, it is important to note that they are not removed from the piano-rolls. Hence, silences could still appear in the past sequence of a training point, since it is a valuable information regarding the structure of the piece. During generation time, the silences in the piano score are automatically orchestrated with a silence in the orchestra score. Besides, silences are taken into consideration when computing the accuracy.
The results of the cRBM and FGcRBM on the orchestral inference task are compared to two naïve models. The first model is a random generation of the orchestral frames obtained by sampling a Bernoulli distribution of parameter . The second model predicts an orchestral frame at time by simply repeating the frame at time . The results are summed up in \tabreftab:results.
As expected, the random model obtains very poor results. The repeat model outperform all three other models, surprisingly even in the event-level framework. Indeed, we observed that repeated notes still occur frequently in the event-level framework. For instance, if between two successive events only one note out of five is modified, the accuracy of the repeat model on this frame will be equal to .
If the FGcRBM model outperforms the cRBM model in the frame-level framework, the cRBM is slightly better than the FGcRBM model in the event-level framework.
Generations from both models can be listened to on the companion website
5 Conclusion and future work
We introduced the Projective Orchestral Database (POD), a collection of MIDI files dedicated to the study of the relations between piano scores and corresponding orchestrations. We believe that the recent advent in machine learning and data mining has provided the proper tools to take advantage of this important mass of information and investigate the correlations between a piano score and its orchestrations. We provide all MIDI files freely, along with aligned and non-aligned pre-processed piano-roll representations on the website https://qsdfo.github.io/LOP/index.html.
We proposed a task called automatic orchestral inference. Given a piano score and a corresponding orchestration, it consists in trying to predict orchestral time frames, knowing the corresponding piano frame and the recent past of the orchestra. Then, we introduced an evaluation framework for this task based on a train and test split of the database, and the definition of an accuracy measure. We finally present the results of two models (the cRBM and FGcRBM) in this framework.
We hope that the POD will be useful for many researchers. Besides the projective orchestration task we defined in this article, the database can be used in several other applications, such as generating data for a source-separation model . Even if small errors still persist, we thoroughly checked manually the database and guarantee its good quality. However, the number of files collected is still small with the aim of leading statistical investigations. Hence, we also hope that people will contribute to enlarge this database by sharing files and helping us gather the missing information.