Cross-Modal Music Retrieval and Applications©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The published version is available online:

Cross-Modal Music Retrieval and Applications1


There has been a rapid growth of digitally available music data including audio recordings, digitized images of sheet music, album covers and liner notes, and video clips. This huge amount of data calls for retrieval strategies that allow users to explore large music collections in a convenient way. More precisely, there is a need for cross-modal retrieval algorithms that, given a query in one modality (e. g., a short audio excerpt), find corresponding information and entities in other modalities (e. g., name of piece and sheet music). This goes beyond exact audio identificiation (and subsequent retrieval of meta-information) as performed by commercial applications like Shazam [1].

In this paper, we review several cross-modal retrieval scenarios, with a particular focus on sheet music (visual domain) and audio (acoustic domain). First, we discuss a traditional approach where the sheet music and audio representations are converted into common mid-level feature representations that capture musical properties related to pitches and harmony. The resulting feature sequences can then be compared using standard alignment algorithms [2, 3]. Second, we review an approach based on symbolic fingerprinting techniques. Originally, audio fingerprinting refers to a procedure that allows for a robust identification of exact replicas of audio recordings [4]. In our cross-modal scenario, we discuss tempo- and transposition-invariant symbolic fingerprinting methods based on note parameters extracted via audio transcription techniques [5, 6]. Third, employing deep learning methods, we describe an end-to-end cross-modal retrieval strategy that works without needing manually crafted feature representations [7]. Given snippets of sheet music (in the form of pixel images) and corresponding audio excerpts (in the form of spectrograms), a neural network learns a joint embedding space, on which cross-modal retrieval can be performed using simple distance measures and nearest-neighborhood search.

Using these three approaches as illustrative examples, the primary objective of this paper is to discuss principles and challenges encountered in general music processing, such as designing musically motivated features and similarity measures to cope with semantic data variability. Furthermore, to illustrate the potential of cross-modal retrieval techniques, we describe some navigation and browsing applications including a prototype system called the Piano Music Companion, while indicating future research directions.

Music Representations

Figure 1: The different representations for music data and data transformations relevant for cross-modal music retrieval.

Before we delve into the various cross-modal retrieval approaches, we first introduce some basic notions, following [3, Chapter 1]. As indicated by Figure 1, music can be represented in many different ways and formats. For example, a composer may write a composition in the form of a musical score, where musical symbols are used to visually encode which notes are to be played, and how. The printed form of a musical score is also referred to as sheet music. The original medium of this representation is paper, although it is now also accessible on computer screens in the form of digital images. In electronic instruments and computers, music can be communicated by means of standard protocols (such as the widely used MIDI2 protocol), where event messages specify note pitches, note intensities (velocities), and other parameters to generate the intended sounds. Often, the term symbolic is used to refer to any data format that explicitly represents musical entities. The musical entities may range from timed note events, as is the case in MIDI files, to graphical shapes with attached musical meaning, as is the case in music engraving systems. In contrast to such symbolic representations, the musical events are not given explicitly in audio representations such as WAV or MP3 files. These encode acoustic waves that are generated when, e. g., playing an instrument, and travel from the sound sources to the human ear as air pressure oscillations.

At this point it is important to note that each of these representations reflects certain aspects of a musical entity, but no single representation encompasses all properties. For example, rather than giving strict specifications, a musical score only serves as a guide for performing a piece of music, leaving room for different interpretations. Reading the instructions in the score, a musician shapes the music by varying the tempo, dynamics, articulation, and other parameters, thus creating a personal interpretation of the piece. Furthermore, while sheet music visually encodes the musical notes, such information is hidden in an audio recording, which is basically a time series of samples. In summary, even if they refer to the same piece of music, there may be a significant gap—technically as well as semantically—between different representations such as sheet music and audio.

The boundaries between the various music representations are not sharp. As illustrated by Figure 1, symbolic representations—depending on their specific format and intended application—may be closer to sheet music or audio representations. For example, symbolic representations such as MusicXML3 are used for rendering sheet music, where the shape of the note objects and their arrangement on a page are determined. Optical music recognition (OMR) can be seen as the inverse process with the objective to transform sheet music into a symbolic representation. Furthermore, symbolic representations such as MIDI are used for synthesizing audio, where the note objects are transformed into musical tones and real sounds. The inverse process is known as automatic music transcription (AMT) and aims at extracting note events, key signature, time signature, instrumentation, and other score parameters from a given music recording [3]. Both transformations, OMR as well as AMT, are far from straightforward. For example, correctly recognizing and interpreting the meaning of all the musical symbols in complex sheet music is easy for a trained human, but hard for a computer. Even though current OMR software is reported to yield highly accurate results, manual postprocessing is necessary to obtain a high-quality symbolic representation [8]. Similarly, converting a music recording into a note-based representation is a largely unsolved problem—in particular, for multi-voiced music involving different instruments [9].

For relating different types of data (e. g., sheet music and audio data) to each other, traditional methods are often based on mid-level representations that exploit specific domain knowledge. As an important example, we first consider mid-level representations that capture musical properties related to pitches and harmony. We then discuss symbolic fingerprints that are based on note-level descriptors. Both of these approaches require expert knowledge in the transformation process. As an alternative, we present an end-to-end learning approach based on deep neural networks, where the idea is to circumvent the explicit definition of a mid-level representation. In the following sections, we address benefits and limitations of these conceptually different approaches in the context of cross-modal music retrieval.

Chroma-Based Approach

Figure 2: Some chromagrams obtained from (a) monophonic and (c) polyphonic sheet music and (c) polyphonic audio representations for the beginning of Frédéric Chopin’s Nocturne in B Major, Op. 9, No. 3.

To make music data algorithmically accessible, traditional music processing tries to extract suitable features that capture relevant key aspects while suppressing irrelevant details. For music-related retrieval and analysis tasks, chroma features have turned out to be a powerful mid-level representation [10, 3].

Due to their central importance in music processing, we give a short introduction to the basics of chroma features following [3, Chapter 1]. Recall that playing a note on an instrument results in a (more or less) periodic sound of a certain fundamental frequency. This fundamental frequency is closely related to what is called the pitch of a note. This notion allows us to order pitched sounds from “lower” to “higher”—similarly to the keys of a piano keyboard ordered from left to right. Two notes with fundamental frequencies in a ratio equal to any power of two (e. g., half, twice, or four times) are perceived as very similar (or musically/harmonically equivalent, in some sense). This observation leads to the fundamental notion of an octave, which is defined as the interval between one musical note and another with half or double its fundamental frequency. In Western music, the “space” within one octave is generally subdivided into twelve scale steps with fundamental frequencies equally spaced on a logarithmic frequency axis, resulting in what is known as the twelve-tone equal-tempered scale. In this scale, each pitch can be separated into two components, which are referred to as tone height (or octave number) and chroma (or pitch spelling attribute denoted by , , , , in Western music notation).

Chroma features rely on this perception of octave equivalence and map absolute pitch into twelve octave-independent pitch classes, where a pitch class consists of all pitches that share the same chroma. Thus, a chroma feature is represented by a -dimensional vector , where corresponds to chroma , to , and so on. In the feature extraction step, a given audio signal is converted into a sequence of chroma vectors (also called chromagram), where each vector expresses how the short-time energy of the signal is spread over the twelve chroma bands. A chromagram closely correlates to the melodic and harmonic progression of the music, while exhibiting a high degree of robustness to variations in instrumentation and dynamics.

There are many ways for computing chroma-based features from audio recordings, e. g., using short-time Fourier transforms (STFT) in combination with binning strategies [10] or by employing suitable multirate filter banks [11]. Furthermore, the properties of chroma features can be significantly changed by introducing suitable pre- and post-processing steps modifying spectral, temporal, and dynamical aspects. As an example, Figure 2 (center part) shows two different chromagram variants extracted from a piano audio recording. While the first one is a traditional chromagram, the second version is enhanced such that certain important frequencies that relate to melody notes as specified by the upper staff, are emphasized—which can be important,e. g.,  for melody-based retrieval. When given a symbolic music representation (such as MIDI or MusicXML files), it is straightforward to derive chromagrams from the explicitly encoded note parameters (pitches, note onsets, note durations). Figure 2 shows a symbolic chromagram obtained from a monophonic (left part) and polpyhonic (right part) sheet music representation. While symbolic chromagrams are based on “pure” note information, audio-based chromagrams tend to be “noisy”, reflecting the full range of the signal’s acoustic properties (including partials, transients, room acoustics). Still, as also demonstrated by Figure 2, chroma features mainly capture melodic and harmonic properties and are suited to serve as a mid-level feature representation for comparing and relating acoustical and symbolic music.

To demonstrate the applicability and potential of chroma-based features, we consider a cross-modal retrieval scenario motivated by Barlow and Morgenstern’s book A Dictionary of Musical Themes published in 1949 [12]. This book contains about 10,000 musical themes of well-known instrumental pieces from the corpus of Western Classical music. These monophonic themes (usually four bars long) are typically the most memorable parts of a piece of music. This motivates the retrieval scenario as considered in [13, 14], where the objective is to retrieve all audio recordings from a music collection that contain a specified musical theme. More formally, let be the collection of musical themes, where each element is regarded as a query. Furthermore, let be a set of audio recordings, which we regard as a database collection consisting of documents . Given a query , the retrieval task is to identify the semantically corresponding documents . One approach, as illustrated by Figure 3, is to first transform a query (possibly using OMR as an intermediate step) and each of the documents into chromagrams. Based on these mid-level representations, one computes a matching function by locally comparing the query chromagram to the audio chromagram using a subsequence variant of dynamic time warping (DTW), see [11, Chapter 4]. For each position of the audio recording , such a matching function indicates the local cost of aligning the query chromagram with a segment ending at that position of the audio chromagram. In other words, each local minimum of that is close to the value zero points to a location where the query (musical theme) is similar to a local segment of the document (audio recording). Thus, for a given query, the retrieval task can be solved by computing matching curves for all documents and screening for local minima that are below a certain threshold in these curves. The costs of the local minima yield a natural ranking of the retrieved documents and their relevant sections, which can then be presented in the form of a ranked list, see Figure 3 (right side).

Figure 3: An (a) illustration of the matching procedure with chroma-based representations. (b) The costs of the local minima yield a natural ranking of the retrieved documents and their relevant sections, which are shown in the form of a ranked list.

As detailed in [13, 14], there are various challenges that need to be addressed, including tempo deviations, OMR extraction errors, musical tunings, key transpositions, and differences in the degree of polyphony between the symbolic query and the audio recordings. For some of these challenges, there already exist reliable compensation strategies. For example, key transpositions are simulated by a cyclic shift of the query’s chromagram, or local and global tempo deviations are compensated by using sequence alignment techniques such as DTW. Handling differences in the degree of polyphony is still subject to ongoing research. One strategy to bridge the “polyphony gap” is to first extract the predominant melody of the audio recording using harmonic summation [15] and source-filter models [16]. From the resulting salience representations, enhanced audio chromagrams that better match the monophonic theme may be derived (see Figure 2 for an illustration).

Obviously, computing matching curves for each database document results in a retrieval procedure that does not scale to large music collections. Indexing techniques based on short audio excerpts (so called audio shingles) can help speed up the retrieval procedure [17, 18]. In the next section, we discuss an alternative approach that is based on symbolic fingerprints and permits extremely efficient retrieval.

Symbolic Fingerprinting Approach

We have seen that chroma features are a very convenient mid-level representation for comparing music data of different modalities. One main benefit is that both symbolic and audio data can be easily converted into chromagrams. Furthermore, capturing only the coarse harmonic/melodic progression, chromagrams are highly robust to musical and acoustic variations. However, the reduction onto the chroma level also leads to a loss of valuable information that may be contained in the input data (such as accurate timing and pitch parameters as encoded by sheet music). As a consequence, chroma-based retrieval strategies often become problematic for short input sequences (e. g., covering only a couple of notes). Furthermore, reducing pitch information to the twelve chroma bands renders the comparison of monophonic and polyphonic versions difficult. An alternative to using chroma-based features is to exploit the high specificity of note parameters and of resulting time–pitch patterns of occurring notes. To this end, both the visual and acoustic data need to be transformed into the symbolic music domain. In the following, we discuss such an approach based on symbolic fingerprints and highlight the resulting benefits and limitations.

Traditionally, in music processing, audio fingerprinting refers to methods for identifying exact replicas of audio recordings, which are possibly distorted in some way (e. g., compression artifacts or background noise). For this problem, also known as audio identification, powerful algorithms exist and are in everyday use in commercial applications (see, e. g.,  [4, 3, 19, 1]). In the identification process, the audio material is compared by means of so-called audio fingerprints, which are compact and discriminative audio features. There are many different ways of designing and computing audio fingerprints, and the suitability of a specific type of fingerprint very much depends on the requirements imposed by the intended application. For example, in the pioneering work by Wang [1], a fingerprinting approach is described that operates on spectral peaks extracted from a time–frequency representation. Recent work such as [20, 19] has focused on making fingerprinting algorithms more robust to transformation in the time (playback speed of the audio) and the frequency scale (transpositions). Classical fingerprinting approaches, combined with indexing techniques, allow for a very efficient (scalable to huge fingerprint datasets) and effective (high precision even for short queries) identification of audio material. However, being based on audio-specific spectro-temporal patterns, these techniques are not suited for handling music-specific variations as required for cross-modal music retrieval or related tasks such as cover song retrieval [3, 21].

Figure 4: An illustration of symbolic fingerprints.

Inspired by classical fingerprinting techniques, Arzt et al. [5, 6] introduced a symbolic fingerprinting approach, which not only allows for the identification of exact replicas of recordings, but also for fast retrieval of different versions of the same piece of music including differently performed audio recordings and score representations. In the following, we summarize the main idea of this approach. For the moment, we start with a symbolic music representation where all note events are encoded explicitly. As illustrated by Figure 4, we assume that each note event is specified by an onset time and a pitch . To obtain fingerprints, we consider triples consisting of three events , , and with . For each such triple, we define the time differences and as well as the pitch differences and . Furthermore, we set . Finally, a symbolic fingerprint is defined to be a list of the following numbers:


Considering time and pitch relations in a relative fashion, each fingerprint is invariant with regard to musical transpositions (pitch shifts) and tempo changes. To obtain local descriptors, fingerprints are computed only from note events within a certain neighborhood (typically a few seconds). This not only facilitates short query lengths, but also reduces the number of fingerprints to be stored in the database. Also observe that since each individual fingerprint encodes relative timing information, we need to assume that the onset times of a triple are distinct. As a result, simultaneous note events (as occurring in a chord) may not be encoded by a single fingerprint. However, such co-occurring events can be captured by considering several fingerprints. In summary, being discriminative yet compact descriptors of fixed length, such fingerprints have turned out to be suitable for indexing symbolically encoded music data.

Figure 5: An illustration of cross-modal retrieval via piano transcription and symbolic fingerprinting. (Photo of Werner Goebl courtesy of Clemens Chmelar.)

We now discuss how the symbolic fingerprints can be used for cross-modal music retrieval. As a challenging example scenario, we consider a combined sheet-music identification and score following application tailored to piano music [6]. Given a short excerpt of an audio recording (used as query), the task is to identify the underlying sheet music document as well as the exact score position (see Figure 5). Accordingly, the database for this task consists of sheet music representations of all pieces to be potentially identified. In a preprocessing step, all sheet music documents are first transformed into a suitable symbolic format (e. g., by applying OMR or by extracting note parameters from a MusicXML file). From this encoding, symbolic fingerprints are extracted for each document by considering all possible triples of note events that obey certain constraints. For example, to avoid a combinatorial explosion, one typically imposes constraints in the form of minimum and maximum values for the time differences and . The resulting fingerprints along with links to suitable metadata (e. g., corresponding piece and sheet music positions) are stored in a fingerprint database that is equipped with efficient search structures based on indexing techniques.

Similarly, an incoming audio query is also transformed into a set of symbolic fingerprints. This, however, involves a non-trivial transcription step to convert the recording into a symbolic representation. In general, automatic music transcription is still an unsolved problem—in particular for polyphonic music recordings with many different instruments (e. g., orchestral music), see [9, 22, 23]. In the case of single-instrument polyphonic music (such as piano music), state-of-the-art algorithms provide reasonable, albeit far from perfect, transcriptions. In our scenario, we employ a recent transcription algorithm based on a recurrent neural network (RNN) [22]. The use of bidirectional hidden layers enables the system to better model the context of the notes, which exhibit a very characteristic envelope during their decay phase—in particular for piano music. The network was trained on a collection of several hundred piano pieces recorded on various (virtual and real) pianos, see [22] for further details. To make the transcriber applicable also in online scenarios, instead of preprocessing the whole piece of audio at a time, the signal is split into blocks that consist of several subsequent frames centered around the current frame. Using such blocks (each covering roughly  ms of audio) is a trade-off between keeping the system’s ability to model the context of the notes and to keep the introduced delay at a minimum. The network outputs a transcription of the audio query consisting of a list of note onsets and pitches, which can be further transformed into a set of audio fingerprints. Finally, the score fingerprint database is searched for subsets that approximately fit the query’s set of audio fingerprints. The best matching subset in the database yields the sheet music document along with the score position, see Figure 5.

In contrast to chroma-based mid-level representations, symbolic fingerprints are compact, possess a high discriminative power, and are well suited for indexing techniques. As a result, these techniques scale well to large amounts of data in terms of memory requirements, accuracy, and efficiency. However, there is also a price to be paid. The necessary transcription from audio signals into the symbolic domain is a hard problem that is solvable well enough only for certain classes of music (e. g., piano music recorded under reasonable acoustic conditions). Even though a small proportion of the fingerprints extracted from the query may suffice to identify the correct piece, symbolic fingerprinting may fail if the input representation becomes too noisy. For general music recordings including many instruments (e. g., orchestra), there is still a long way to go; here one requires strategies that better adapt to the multitude of musical aspects including harmony, melody, rhythm, dynamics, and instrumentation. In this context, recent advances in deep learning may help to make further progress in this area. In the subsequent section, we discuss such a deep learning approach that tries to learn sheet music and audio correspondences directly from raw input representations without the need for mid-level representations that explicitly exploit musical knowledge.

Deep Learning Approach

In the previous sections, we have seen two more traditional approaches for linking audio and sheet music data using musically informed mid-level representations—once using chroma features and once symbolic fingerprints. Such representations not only require expert knowledge at the design stage, but are also problematic when relying on error-prone (pre-)processing steps such as automatic music transcription on the audio side or optical music recognition on the sheet music side. As an alternative, we now present a methodology to directly learn correspondences between audio data and sheet music images from a set of training observations, thus circumventing the explicit definition of a mid-level representation. This approach builds on the current success of artificial neural networks, nowadays often referred to as deep learning, which have proven to be powerful tools for automatic feature learning [24]. Given snippets of sheet music images and corresponding audio excerpts, we introduce a cross-modal neural network that learns an embedding space in which both modalities are represented as low-dimensional vectors [7]. In this embedding space, cross-modal music retrieval can then be easily performed by using a simple similarity measure.

The general principle of supervised feature learning is to learn latent representations in an end-to-end fashion from a set of raw training observations. Such approaches are not only generally applicable, but also have the advantage of automatically adapting the learned representations to the given problem. One limitation, however, is that supervised learning requires a sufficiently large set of training data to arrive at models that generalize well to unseen data. In our scenario, we need training pairs that consist of sheet music snippets and corresponding audio excerpts. Typical examples as used in our system are shown in Figure 6(a)–(d). Note that for creating such training pairs, we need to first establish correspondences between individual pixel locations of the note heads in a score and their respective counterparts (note onset events) in the corresponding audio recording. Establishing the correspondences can be done either in a manual annotation process or by relying on synthetic training data generated from digital sheet music formats such as Musescore4 or Lilypond5. Based on these relationships, one can generate corresponding snippets of sheet music images (in our case pixels) and short excerpts of audio (in our case represented by log-frequency spectrograms with bins frames). These are the pairs presented to the multi-modal network for training.

Figure 6: (a)–(d) Four training pairs, each consisting of a sheet music snippet and an audio excerpt. The pair in (d) was obtained from the pair in (c) by applying data augmentation techniques.

To improve the generalization ability of the resulting network, one can further apply data augmentation techniques to (synthetically) increase the effective size of the training set and to better account for relevant data variability. In this setting, different transformations for sheet music augmentation (e. g., image scaling and translation) and audio augmentation (e. g., using different sound fonts and tempo scaling) are applied. At this point, we emphasize that data augmentation is a crucial component for learning cross-modal representations that generalize to unseen music, especially when little data is available. In this process, augmenting the dataset using data transformations is conceptually different and more promising than automatically generating random scores. First, rendering (synthesizing) sheet music typically results in images with strong regularities (e. g., same scale or perfectly centered staff lines). By applying image transformations, these regularities are disturbed, thus making the embedding networks robust to small distortions as occurring in realistic scenarios, e. g., images of printed sheet music scanned under different conditions and sheet music originating from different publishers using varying type settings. Second, note that music and hence also sheet music follows musical rules. Therefore, augmentation by adding randomly generated music may distort the inherent data distribution of “realistic” music, which may have a negative impact on embedding space learning.

Based on such training pairs, the retrieval task is formulated as an embedding problem with the aim of learning a joint embedding space of the two different modalities [7]. This approach is inspired by a similar text–to–image retrieval problem, where a pairwise ranking loss is introduced as an optimization target [25]. In the following, let denote a training pair consisting of a sheet image snippet and an audio excerpt . As shown in Figure 7, the network consists of two separate pathways. One pathway processes and is represented by the function , where are the network parameters to be trained. The other pathway, which is represented by the function with parameters , is responsible for . The two functions map and , respectively, to a -dimensional vector, where denotes the embedding dimension. To define the loss function, we need a scoring function to measure similarity in the embedding space. In our scenario, is chosen to be the cosine measure (i. e., the cosine of the angle between two vectors). Furthermore, for each given training pair , we assume that there are additional contrasting examples for . Then, the pairwise ranking loss (also known as max-margin hinge loss, see [25]) is defined as follows:


In this formula, the first sum is taken over a set of training pairs (a training batch), where each such pair comes with a separate set of contrasting examples (in practice all remaining audio samples of the current training batch). The purpose of this loss function is to encourage an embedding where the distance between matching samples is lower than the distance between mismatching samples . The parameter is the margin parameter of the hinge loss and, in combination with the maximum function, imposes a penalty on poorly embedded training pairs. More precisely, if the elements of a matching pair are already close in the learned embedding space and, in addition, the elements of the mismatching pairs are embedded far enough apart, the second term in the -operator goes below zero and the respective pairs do not contribute to the overall loss. On the contrary, if the embedded elements of a matching pair are still far apart, the second term is usually above zero and will yield a substantial contribution to the overall loss. In the training stage, the pairwise ranking loss in Equation (2) is minimized via stochastic gradient descent with respect to the network parameters and . Once the networks represented by the functions and are learned, the elements of matching pairs are close in the embedding space, while those of contrasting pairs are far apart (in the ideal case). For further details concerning the network topology and the training procedure, we refer to [7, 26].

Figure 7: An illustration of the network used for learning a cross-modal embedding space. At application time, the learned functions and are used to project the sheet music snippets and audio excerpts, respectively, to the joint embedding space.

Given this learned embedding space, cross-modal retrieval can be performed based on a retrieval-by-embedding paradigm, see Figure 7. It is important to note that although the network pathways are trained simultaneously on pairs of sheet music snippets and audio excerpts, both modalities are required only at training time. At application time the two network pathways operate independently from each other. This has huge benefits in view of the cross-modal retrieval applications discussed in the previous sections. For example, in sheet-music identification and score following applications, one can first compute an embedding of an entire collection of sheet music snippets using the image embedding function . The resulting -dimensional embedding vectors can be further processed and stored using suitable index structures that allow for an efficient neighborhood search. Then, given an audio excerpt as a query, the search can be performed by first projecting the query into the joint embedding space using the audio embedding function of the network, and then performing a nearest neighborhood search.

The experiments reported in [7], which are based on 26 classical piano pieces (including the composers Bach, Haydn, Beethoven, and Chopin) and roughly 20,000 training pairs demonstrate that the end-to-end learning approach yields reasonable retrieval results for sheet music of medium complexity (e. g., piano scores) and synthesized audio (used for evaluation to establish the ground truth). In particular, combining retrieval based on snippets/excerpts with a subsequent majority voting step, the approach is capable of correctly relating sheet music and audio recordings on the piece level with high accuracy. However, on the level of sheet music snippets (consisting of one or two bars) and audio excerpt (lasting a couple of seconds), the proposed system is not yet competitive with engineered approaches that exploit musical knowledge or are based on symbolic representations (see the approaches presented in the two previous sections).

At this stage, one may draw the conclusion that, even when comparatively little training data is available, it is still possible to use deep learning models by designing appropriate (task-specific) data augmentation strategies. First experiments showed that, when trained on only one composer, the model started to generalize to unseen scores by other composers. Therefore, we may expect that the described model will develop its full potential when provided a comprehensive dataset that consists of millions of training pairs comprising different editions and layouts of sheet music and different recorded performances.

Applications and Future Directions

In this paper, we have introduced different approaches for cross-modal music retrieval aiming to bridge the gap between various music representations. Despite the remaining challenges, current technology enables a variety of music navigation and browsing applications of educational and commercial relevance. For example, in the context of modern digital music libraries, cross-modal retrieval strategies have become an important component for content-based analysis, synchronization, indexing, and navigation in heterogeneous music collections [27]. Other cross-modal applications are often subsumed under the umbrella of score following, where the computer “listens” to a live performance and tries to “read along” in the sheet music. The output of a score following algorithm can be used for highlighting the current measure in a digital score, automatic page turning6, or automatic accompaniment (see, e. g.,  [28]).

In the following, we describe one specific example of a prototype system to give a concrete impression of what is already possible. The Piano Music Companion is a versatile system focused on piano music, intended to be useful for both pianists and music lovers (see [29]). The system is able to identify, follow, and synchronize live performances of classical piano music, in real time. The Piano Music Companion is a permanent listener. Whenever the pianist starts playing (regardless of which piece, or where within the piece), the companion identifies the piece, the position within the score, and continues to follow along. This allows triggering various actions synchronized to the performed music—for instance, the current position in the sheet music is highlighted. While this is helpful for the performer and listener, further information about important themes, musical structures, and chords can be provided. In a concert setting, the system may also give hints to the listener about what to focus on at specific moments. The system may also give additional background information on the piece or composer, while telling the user where to acquire (additional) recordings of the current or related pieces.

Technically, the Piano Music Companion is based on two main components that run in parallel. The first is responsible for identifying the piece being played. To this end, symbolic fingerprinting as described above is used to continuously match the most recently detected notes of the live performance to a database of symbolically encoded sheet music (see Figure 5). Currently, the database includes the complete solo piano works by Chopin and the complete Beethoven piano sonatas, and consists of roughly 1,000,000 notes in total (about 330 pieces). Once the piece and the rough position within the sheet music representation has been identified, the actual score following is conducted using a separate chroma-based tracking procedure, which is realized as an online variant of the matching procedure shown in Figure 3. In this way, the system combines the strengths of the respective components. The fingerprinting component is flexible, it works globally across different pieces, and it scales over large datasets. However, since the fingerprinter’s transcription step is in general faulty, the component often leads to outliers and local misalignments. This weakness is compensated by the separate chroma-based tracking component, which is less efficient but introduces a high degree of robustness (due to the chroma features). This second component is applied only locally for tracking the score once the piece and the rough position are known.

By combining these two components, the Piano Music Companion continuously re-evaluates its hypothesis and tries to match the current input stream to the complete database. Thus, even if the musician suddenly jumps to a different score position or starts playing a completely different piece, the system is able to follow as long as the piece is part of the database. The Piano Music Companion is also highly tolerant to deviations from the notated score (due to performance errors, transcription errors, or intentional variations), and to tempo changes. A video demonstration of our system can be found at

Our vision is to extend this scenario towards a Complete Classical Music Companion. Such a system will be at one’s fingertips anytime and anywhere, possibly as an application on a mobile device. Whatever source of music—be it a live concert, a DVD, a video stream, or a radio program—whatever piece of classical music, whatever instrumentation, and whoever the performers are, the companion will detect what it is listening to, inform about the music, the historical context of the piece, famous interpretations, etc., thus guiding the user in the listening process.

Figure 8: Some sources of freely accessible music data distributed over the internet. (Video image courtesy of the Warner Music Group.)

Beyond this specific music companion scenario, cross-modal music processing techniques are essential for organizing and searching information distributed over the internet (see Figure 8). For example, there are millions of digitized pages of sheet music publicly available on sites such as the Petrucci Music Library (IMSLP)7. On the audio side, widely accessible music and video platforms such as YouTube offer a vast and rapidly growing corpus of music recordings. Furthermore, music-related websites as available at Wikipedia contain information of various types including text, score, images, and audio. Finally, community-driven encyclopedias such as MusicBrainz8 collect and provide music-related metadata in a systematic fashion. For example, structured websites can be used to automatically derive text-, score-, and audio-based queries to look for other musically related documents on the Web [13, 30]. Furthermore, YouTube videos may be automatically enriched with manually or automatically generated musical annotations, as recently demonstrated in [31].

This rich application potential, demonstrated in concrete application scenarios, makes cross-modal music retrieval a very active research field, which also drives research on other music processing tasks. For instance, one key challenge is to improve transformation techniques such as OMR and AMT, which are a bottleneck in many of the current approaches. Also, deep neural networks that directly learn to relate different data modalities are a very promising alternative that is getting a lot of attention now. We hope that these prospects will serve as an inspiration for the signal processing community to pay even more attention to music as a promising (and beautiful) object of study.


The International Audio Laboratories Erlangen are a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and the Fraunhofer-Institut für Integrierte Schaltungen IIS. Meinard Müller and Stefan Balke are supported by the German Research Foundation (DFG MU 2686/11-1). Matthias Dorfer acknowledges financial support for early versions of this work by the Austrian Ministries Federal Ministry of Transport, Innovation, and Technology and the Federal Ministry of Science, Research, and Economy and the Province of Upper Austria via the Competence Centers for Excellent Technologies, Software Competence Center Hagenberg. The work of Andreas Arzt and Gerhard Widmer was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 Framework Programme (H2020, 2014-2020) / ERC Advanced Grant Agreement n.670035, project “Con Espressione”.

Bibliography of Authors

Meinard Müller (

Meinard Müller received the Diploma degree in mathematics and the Ph.D. degree in computer science from the University of Bonn, Germany, in 1997 and 2001, respectively. Since 2012, he holds a professorship for Semantic Audio Signal Processing at the International Audio Laboratories Erlangen. His recent research interests include music processing, music information retrieval, and audio signal processing. Meinard Müller has coauthored more than 100 peer-reviewed scientific papers, wrote a monograph titled “Information Retrieval for Music and Motion” (Springer, 2007) as well as a textbook titled “Fundamentals of Music Processing” (Springer, 2015,

Andreas Arzt (

Andreas Arzt is an assistant professor at the Johannes Kepler University Linz. He studied computer science at the University of Vienna and at the Vienna University of Technology, in 2006 and 2008, respectively. In 2016, he finished his PhD thesis with the topic “Flexible and Robust Music Tracking” at the Department of Computational Perception at the Johannes Kepler University in Linz, Austria. His research interests include real-time music tracking, music synchronization and music identification.

Stefan Balke (

Stefan Balke studied electrical engineering at the Leibniz Universität Hannover, Germany, in 2013. In early 2018, he completed his PhD in the Semantic Audio Signal Processing Group headed by Prof. Meinard Müller at the International Audio Laboratories Erlangen. Afterwards, he joined the Institute of Computational Perception led by Gerhard Widmer at the Johannes Kepler University Linz. His research interests include music information retrieval, machine learning, and multimedia retrieval. In his spare time, he plays trumpet in several local bands and projects.

Matthias Dorfer (

Matthias Dorfer studied Medical Informatics at Vienna University of Technology with a focus on computational image analysis and machine learning. After finishing his Master studies he worked two years as an industrial researcher in the field of medical image analysis. Since April 2015 he is a PhD student at the Department of Computational Perception at Johannes Kepler University Linz under the supervision of Professor Gerhard Widmer. His research interests are artificial neural networks, especially multimodality deep learning, and audio-visual representation learning.

Gerhard Widmer (

Gerhard Widmer received his Ph.D. degree in computer science in 1989 from Technische Univesität Wien, Austria. He is a professor and head of the Department of Computational Perception at Johannes Kepler University, Linz, Austria, and head of the Intelligent Music Processing and Machine Learning Group at the Austrian Research Institute for Artificial Intelligence (OFAI), Vienna, Austria. His research interests include AI, machine learning, and intelligent music processing. He is a Fellow of the European Association for Artificial Intelligence (EurAI), has been awarded Austrias highest research awards, the START Prize (1998), the Wittgenstein Award (2009), and currently holds an ERC Advanced Grant for research on computational models of expressivity in music.


  1. thanks: ©2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The published version is available online:
  6. A page turner is a person with the task of turning sheet music pages for a soloist during a performance.


  1. A. Wang, “An industrial strength audio search algorithm,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Baltimore, Maryland, USA, 2003, pp. 7–13.
  2. F. Kurth, M. Müller, C. Fremerey, Y. Chang, and M. Clausen, “Automated synchronization of scanned sheet music with audio recordings,” in Proceedings of the International Conference on Music Information Retrieval (ISMIR), Vienna, Austria, Sep. 2007, pp. 261–266.
  3. M. Müller, Fundamentals of Music Processing.    Springer Verlag, 2015.
  4. P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of audio fingerprinting,” The Journal of VLSI Signal Processing, vol. 41, no. 3, pp. 271–284, Nov. 2005. [Online]. Available:
  5. A. Arzt, S. Böck, and G. Widmer, “Fast identification of piece and score position via symbolic fingerprinting,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Porto, Portugal, 2012, pp. 433–438.
  6. A. Arzt, G. Widmer, and R. Sonnleitner, “Tempo- and transposition-invariant identification of piece and score position,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 2014, pp. 549–554.
  7. M. Dorfer, A. Arzt, and G. Widmer, “Learning audio-sheet music correspondences for score identification and offline alignment,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 2017, pp. 115–122.
  8. D. Byrd and J. G. Simonsen, “Towards a standard testbed for optical music recognition: Definitions, metrics, and page images,” Journal of New Music Research, vol. 44, no. 3, pp. 169–195, 2015.
  9. E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Automatic music transcription: challenges and future directions,” Journal of Intelligent Information Systems, vol. 41, no. 3, pp. 407–434, 2013. [Online]. Available:
  10. E. Gómez, “Tonal description of music audio signals,” PhD Thesis, Universitat Pompeu Fabra, Barcelona, Spain, 2006.
  11. M. Müller, Information Retrieval for Music and Motion.    Springer Verlag, 2007.
  12. H. Barlow and S. Morgenstern, A Dictionary of Musical Themes, revised edition third printing ed.    Crown Publishers, Inc., 1975.
  13. S. Balke, S. P. Achankunju, and M. Müller, “Matching musical themes based on noisy OCR and OMR input,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brisbane, Australia, 2015, pp. 703–707.
  14. S. Balke, V. Arifi-Müller, L. Lamprecht, and M. Müller, “Retrieving audio recordings using musical themes,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China, 2016, pp. 281–285.
  15. J. Salamon, J. Serrà, and E. Gómez, “Tonal representations for music retrieval: from version identification to query-by-humming,” International Journal of Multimedia Information Retrieval, vol. 2, no. 1, pp. 45–58, 2013. [Online]. Available:
  16. J. S. Juan J. Bosch, Rachel M. Bittner and E. Gómez, “A comparison of melody extraction methods based on source-filter modelling,” in Proceedings of the International Conference on Music Information Retrieval (ISMIR), New York City, USA, 2016, pp. 571–577.
  17. M. A. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in high-dimensional musical spaces,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 5, pp. 1015–1028, 2008.
  18. P. Grosche and M. Müller, “Toward characteristic audio shingles for efficient cross-version music retrieval,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, 2012, pp. 473–476.
  19. R. Sonnleitner and G. Widmer, “Robust quad-based audio fingerprinting,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 409–421, 2016.
  20. J. Six and M. Leman, “Panako – A scalable acoustic fingerprinting system handling time-scale and pitch modification,” in Proceedings of the International Conference on Music Information Retrieval (ISMIR), Taipei, Taiwan, 2014, pp. 259–264.
  21. J. Serrà, E. Gómez, and P. Herrera, “Audio cover song identification and similarity: Background, approaches, evaluation and beyond,” in Advances in Music Information Retrieval, ser. Studies in Computational Intelligence, Z. W. Ras and A. A. Wieczorkowska, Eds.    Berlin, Germany: Springer, 2010, vol. 274, ch. 14, pp. 307–332.
  22. S. Böck and M. Schedl, “Polyphonic piano note transcription with recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, March 2012, pp. 121–124.
  23. S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, 2016.
  24. J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
  25. R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-semantic embeddings with multimodal neural language models,” arXiv preprint (arXiv:1411.2539), 2014.
  26. M. Dorfer, J. Schlüter, A. Vall, F. Korzeniowski, and G. Widmer, “End-to-end cross-modality retrieval with cca projections and pairwise ranking loss,” International Journal of Multimedia Information Retrieval, 2017.
  27. D. Damm, C. Fremerey, V. Thomas, M. Clausen, F. Kurth, and M. Müller, “A digital library framework for heterogeneous music collections: from document acquisition to cross-modal interaction,” International Journal on Digital Libraries: Special Issue on Music Digital Libraries, vol. 12, no. 2-3, pp. 53–71, 2012.
  28. R. B. Dannenberg and C. Raphael, “Music score alignment and computer accompaniment,” Communications of the ACM, Special Issue: Music Information Retrieval, vol. 49, no. 8, pp. 38–43, 2006.
  29. A. Arzt, S. Böck, S. Flossmann, H. Frostel, M. Gasser, C. C. S. Liem, and G. Widmer, “The piano music companion,” in Proceedings of the European Conference on Artificial Intelligence, Prague, Czech Republic, 2014, pp. 1221–1222.
  30. M. Gasser, A. Arzt, T. Gadermaier, M. Grachten, and G. Widmer, “Classical music on the web - user interfaces and data representations,” in Proceedings of the International Conference on Music Information Retrieval (ISMIR), 2015, pp. 571–577.
  31. S. Balke, C. Dittmar, J. Abeßer, K. Frieler, M. Pfleiderer, and M. Müller, “Bridging the Gap: Enriching YouTube videos with jazz music annotations,” Frontiers in Digital Humanities, vol. 5, 2018.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description