Cover Song Identification with Timbral Shape Sequences
Abstract
We introduce a novel low level feature for identifying cover songs which quantifies the relative changes in the smoothed frequency spectrum of a song. Our key insight is that a sliding window representation of a chunk of audio can be viewed as a timeordered point cloud in high dimensions. For corresponding chunks of audio between different versions of the same song, these point clouds are approximately rotated, translated, and scaled copies of each other. If we treat MFCC embeddings as point clouds and cast the problem as a relative shape sequence, we are able to correctly identify 42/80 cover songs in the “Covers 80” dataset. By contrast, all other work to date on cover songs exclusively relies on matching note sequences from Chroma derived features.
1 Introduction
Automatic cover song identification is a surprisingly difficult classical problem that has long been of interest to the music information retrieval community [5]. This problem is significantly more challenging than traditional audio fingerprinting because a combination of tempo changes, musical key transpositions, embellishments in time and expression, and changes in vocals and instrumentation can all occur simultaneously between the original version of a song and its cover. Hence, low level features used in this task need to be robust to all of these phenomena, ruling out raw forms of popular features such as MFCC, CQT, and Chroma.
One prior approach, as reviewed in Section 2, is to compare beatsynchronous sequences of chroma vectors between candidate covers. The beatsyncing helps this be invariant to tempo, but it is still not invariant to key. However, many schemes have been proposed to deal with this, up to and including a brute force check over all key transpositions.
Chroma representations factor out some timbral information by folding together all octaves, which is sensible given the effect that different instruments and recording environments have on timbre. However, valuable nonpitch information which is preserved between cover versions, such as spectral fingerprints from drum patterns, is obscured in Chroma representation. This motivated us to take another look at whether timbralbased features could be used at all for this problem. Our idea is that even if absolute timbral information is vastly different between two versions of the same song, the relative evolution of timbre over time should be comparable.
With careful centering and normalization within small windows to combat differences in global timbral drift between the two songs, we are indeed able to design shape features which are approximately invariant to cover. These features, which are based on selfsimilarity matrices of MFCC coefficients, can be used on their own to effectively score cover songs. This, in turn, demonstrates that even if absolute pitch is obscured and blurred, cover song identification is still possible.
2 Prior Work
To the best of our knowledge, all prior low level feature design for cover song identification has focused on Chromabased representations alone. The cover songs problem statement began with the work of [5], which used FFTbased crosscorrelation of all key transpositions of beatsynchronous chroma between two songs. A followup work [8] showed that high passing such crosscorrelation can lead to better results. In general, however, crosscorrelation is not robust to changes in timing, and it is also a global alignment technique. Serra [22] extended this initial work by considering dynamic programming local alignment of chroma sequences, with followup work and rigorous parameter testing and an “optimal key transposition index” estimation presented in [23]. The same authors also showed that a delay embedding of statistics spanning multiple beats before local alignment improves classification accuracy [25]. In a different approach, [14] compared modeled covariance statistics of all chroma bins, as well as comparing covariance statistics for all pairwise differences of beatlevel chroma features, which is not unlike the “bag of words” and bigram representations, respectively, in text analysis. Other work tried to model sequences of chords [2] as a slightly higher level feature than chroma. Slightly later work concentrated on fusing the results of music separated into melody and accompaniment [11] and melody, bass line, and harmony [21], showing improvements over matching chroma on the raw audio. The most recent work on cover song identification has focused on fast techniques for large scale pitchbased cover song identification, using a sparse set of approximate nearest neighbors [28] and low dimensional projections [12]. Authors in [9] and [17] also use the magnitude of the 2D Fourier Transform of a sequences of chroma vectors treated as an image, so the resulting coefficients will be automatically invariant to key and time shifting without any extra computation, at the cost of some discriminative power.
Outside of cover song identification, there are other works which examine gappy sequences of MFCC in music, such as [4]. However, these works look at matched sequences of MFCClike features in their original feature space. By contrast, in our work, we examine the relative shape of such features. Finally, we are not the first to consider shape in an applied musical context. For instance, [29] turns sequences of notes in sheet music into plane curves, whose curvature is then examined. To our knowledge, however, we are the first to explicitly model shape in musical audio for version identification.
3 Time Ordered Point Clouds from Blocks of Audio
The first step of our algorithm uses a timbrebased method to turn a block of audio into what we call a timeordered point cloud. We can then compare to other timeordered point clouds in a rotation, translation, and scale invariant manner using normalized Euclidean SelfSimilarity matrices (Section 3.3). The goal is to then match up the relative shape of musical trajectories between cover versions.
3.1 Point Clouds from Blocks and Windows
We start with a song, which is a function of time that has been discretized as some vector . In the following discussion, the symbol means the song portion beginning at time and ending at time . Given , there are many ways to summarize a chunk of audio , which we call a window, as a point in some feature space. We use the classical MelFrequency Cepstral coefficient representation [3], which is based on a perceptually motivated log frequency and log power shorttime Fourier transform that preserves timbral information. In our application, we perform an MFCC with 20 coefficients, giving rise to a 20dimensional point.
(1) 
Given a longer chunk of audio, which we call a block, we can use the above embedding on a collection of windows that cover the block to construct a collection of points, or a point cloud, representing that block. More formally, given a block covering a range , we want a set of window intervals , with , so that


,

Where , , , and are all discrete time indices into the sampled audio . Hence, our final operator takes a set of timeordered intervals which cover a block and turns them into a dimensional point cloud in
(2) 
3.2 BeatSynchronous Blocks
As many others in the MIR community have done, including [5] and [8] for the cover songs application, we compute our features synchronized within beat intervals. We use a simple dynamic programming beat tracker developed in [6]. Similarly to [8], we bias the beat tracker with three initial tempo levels: 60BPM, 120BPM, and 180BPM, and we compare the embeddings from all three levels against each other when comparing two songs, taking the best score out of the 9 combinations. This is to mitigate the tendency of the beat tracker to double or halve the true beat intervals of different versions of the same song when there are tempo changes between the two. The tradeoff is of course additional computation. We should note that other cover song works, such as [23], avoid beat tracking step altogether, hence bypassing these problems. However, it is important for us to align our sequences as well as possible in time so that shape features are in correspondence, and this is a straightforward way to do so.
Given a set of beat intervals, the union of which makes up the entire song, we take blocks to be all contiguous groups of beat intervals. In other words, we create a sequence of overlapping blocks such that is made up of timecontiguous beat intervals, and and differ only by the starting beat of and the finishing beat of . Hence, given beat intervals, there are blocks total. Note that computing an embedding over more than one beat is similar in spirit to the chroma delay embedding approach in [25]. Intuitively, examining patterns over a group of beats gives more information than one beat alone, the effect of which is empirically evaluated in Section 5. For all blocks, we take the window size to be the length of the average tempo period, and we advance the window intervals evenly from the beginning of the block to the end of a block with a hop size . Hence, there is a overlap between windows. We were inspired by theory on raw 1D time series signals [18], which shows that matching the window length to be just under the length of the period in a delay embedding maximizes the roundness of the embedding. Here we would like to match beatlevel periodicities and fluctuations therein, so it is sensible to choose a window size corresponding to the tempo. This is in contrast to most other applications that use MFCC sliding window embeddings, which use a much smaller window size on the order of 10s of milliseconds, generally with a overlap, to ensure that the frequency statistics are stationary in each window. In our application, however, we have found that a longer window size makes our self similarity matrices (Section 3.3) smoother, allowing for more reliable matches of beatlevel musical trajectories, while having more windows per beat (high overlap) leads to more robust matching of SSMs using L2 (Section 4.1).
Figure 1 shows the first three principal components of an MFCC embedding with a traditional small window size versus our longer window embedding to show the smoothing effect.
3.3 Euclidean SelfSimilarity Matrices
For each beatsynchronous block spanning beats, we have a 20dimensional point cloud extracted from the sliding window MFCC representation. Given such a timeordered point cloud, there is a natural way to create an image which represents the shape of this point cloud in a rotation and translation invariant way, called the selfsimilarity matrix (SSM) representation.
Definition 1.
A Euclidean SelfSimilarity Matrix (SSM) over an ordered point cloud is an matrix so that
(3) 
In other words, an SMM is an image representing all pairwise distances between points in a point cloud ordered by time. SSMs have been used extensively in the MIR community already, spearheaded by the work of Foote in 2000 for note segmentation in time [10]. They are now often used in general segmentation tasks [24] [15]. They have also been successfully applied in other communities, such as computer vision to recognize activity classes in videos from different points of view and by different actors [13]. Inspired by this work, we use selfsimilarity matrices as isometry invariant descriptors of local shape in our sliding windows of beat blocks, with the goal of capturing relative shape. In our case, the “activities” are musical expressions over small intervals, and the “actors” are different performers or groups of instruments.
To help normalize for loudness and other changes in relationships between instruments, we first center the point cloud within each block on its mean and scale each point to have unit norm before computing the SSM. That is, we compute the SSM on , where
(4) 
Also, not every beat block has the same number of samples due to natural variations of tempo in real songs. Thus, to allow comparisons between all blocks, we resize each SSM to a common image dimension , which is a parameter chosen in advance, the effects of which are explored empirically in Section 5.
Figure 2 shows examples of SSMs of 4beat blocks pulled from the Covers80 dataset that our algorithm matches between two different versions of the same song. Visually, similarities in the matched regions are evident. In particular, viewing the images as height functions, many of the critical points are close to each other. The “We Can Work It Out” example shows how this can work even for live performances, where the overall acoustics are quite different. Even more strikingly, the “Don’t Let It Bring You Down” example shows how similar shape patterns emerge even with an opposite gender singer and radically different instrumentation. Of course, in both examples, there are subtle differences due to embellishments, local time stretching, and imperfect normalization between the different versions, but as we show in Section 5, there are often enough similarities to match up blocks correctly in practice.
4 Global Comparison of Two Songs
Once all of the beatsynchronous SSMs have been extracted from two songs, we do a global comparison between all SSMs from two songs to score them as cover matches. Figure 3 shows a block diagram of our system. After extracting beatsynchronous timbral shape features on SSMs, we then extract a binary crosssimilarity matrix based on the L2 distance between all pairs of selfsimilarity matrices between two songs. We subsequently apply the Smith Waterman algorithm on the binary crosssimilarity matrix to score a match between the two songs.
4.1 Binary CrossSimilarity And Local Alignment
Given a set of beatsynchronous block SSMs for a song A and a set of beatsynchronous block SSMs for a song B, we compute a songlevel matching between song A and B by comparing all pairs of SSMs between the two songs. For this we create an crosssimilarity matrix (CSM), where
(5) 
is the Frobenius norm (L2 image norm) between the SSM for the beat block from song A and the SSM for beat block for song B. Given this crosssimilarity information, we then compute a binary cross similarity matrix . A binary matrix is necessary so that we can apply the Smith Waterman local alignment algorithm [27] to score the match between song A and B, since Smith Waterman only works on a discrete, quantized alphabet, not real values [23]. To compute , we take the mutual fraction nearest neighbors between song A and song B, as in [25]. That is, if is within the smallest values in row of the CSM and if is within the smallest values in column of the CSM, and 0 otherwise. As in [25], we found that a dynamic distance threshold for mutual nearest neighbors per element worked significantly better than a fixed distance threshold for the entire matrix.
Once we have the matrix, we can feed it to the Smith Waterman algorithm, which finds the best local alignment between the two songs, allowing for time shifting and gaps. Local alignment is a more appropriate choice than global alignment for the cover songs problem, since it is possible that different versions of the same song may have intros, outros, or bridge sections that were not present in the original song, but otherwise there are many sections in common. We choose a version of Smith Waterman with diagonal constraints, which was shown to work well for aligning binary crosssimilarity matrices for chroma in cover song identification [23]. In particular, we recursively compute a matrix so that
(6) 
where is the Kronecker delta function and
(7) 
The term in each line is such that there will be a +1 score for a match and a 1 score for a mismatch. The function is the socalled “affine gap penalty” which gives a score of for a gap of length . The local constraints are to bias Smith Waterman to choosing paths along neardiagonals of . This is important since in musical applications, we do not expect large gaps in time in one song that are not in the other, which would show up as horizontal or vertical paths through the matrix. Rather, we prefer gaps that occur nearly simultaneously in time for a poorly matched beat or set of beats in an otherwise wellmatching section. Figure 6 shows a visual representation of the paths considered through .
Figure 4 shows an example of a CSM, , and resulting Smith Waterman for a true cover song pair. Several long diagonals are visible, indicating large chunks of the two songs are in correspondence, and this gives rise to a large score of between the two songs. Figure 5 shows the CSM, , and Smith Waterman for two songs which are not versions of each other. By contrast, there are no long diagonals, and this pair only receives a score of 8.
5 Results
5.1 Covers 80
To benchmark our algorithm, we apply it to the standard “Covers 80” dataset [7], which consists of 80 sets of two versions of the same song, most of which are pop songs from the past three decades. There are designated two sets of songs A and B, each with exactly one version of every pair. To benchmark our algorithm on this dataset, we follow the scheme in [5] and [8]. That is, given a song from set A, compute the Smith Waterman score from all songs from set B and declare the cover song to be the one with the maximum score. Note that a random classifier would only get 1/80 in this scheme. The best scores reported on this dataset are 72/80 [20], using a support vector machine on several different chromaderived features.
Table 1 shows the correctly identified songs based on the maximum score, given variations of the parameters we have in our algorithm. We achieve a maximum score of for a variety of parameter combinations. The nearest neighbor fraction and the dimension of the SSM image have very little effect, but increasing the number of beats per block has a positive effect on the performance. The stability of and are encouraging from a robustness standpoint, and the positive effect increasing the number of beats per block suggests that the shape of medium scale musical expressions are more discriminative than smaller ones.
Kappa = 0.05  B = 8  B = 10  B = 12  B = 14 

d = 100  30  33  36  40 
d = 200  31  33  36  39 
d = 300  31  34  36  40 
Kappa = 0.1  B = 8  B = 10  B = 12  B = 14 
d = 100  35  39  41  42 
d = 200  36  38  42  42 
d = 300  36  38  41  41 
Kappa = 0.15  B = 8  B = 10  B = 12  B = 14 
d = 100  36  42  41  42 
d = 200  36  41  41  42 
d = 300  38  42  42  41 
In addition to the Covers 80 benchmark, we apply our cover songs score to a recent popular music controversy, the “Blurred Lines” controversy [16]. Marvin Gaye’s estate argues that Robin Thicke’s recent pop song “Blurred Lines” is a copyright infringement of Gaye’s “Got To Give It Up.” Though the note sequences differ between the two songs, ruling out any chance of a high chromabased score, Robin Thicke has said that his song was meant to “evoke an era” (Marvin Gaye’s era) and that he derived significant inspiration from “Got To Give It Up” specifically [16]. Without making a statement about any legal implications, we note that our timbral shapebased score between “Blurred Lines” and “Got To Give It Up” is in the percentile of all scores between songs in group A and group B in the Covers 80 dataset, for , , and . Unsurprisingly, when comparing “Blurred Lines” with all other songs in the Covers 80 database plus “Got To Give It Up,” “Got To Give It Up” was the highest ranked. For reference, binary cross similarity matrices are shown in Figure 7, both for our timbre shape based technique and the delay embedding chroma technique in [25]. The timbrebased crosssimilarity matrix is densely populated with diagonals, while the pitchbased one is not.
6 Conclusions And Future Work
We show that timbral information in the form of MFCC can indeed be used for cover song identification. Most prior approaches have used Chromabased features averaged over intervals. By contrast, we show that an analysis of the fine relative shape of MFCC features over intervals is another way to achieve good performance. This opens up the possibility for MFCC to be used in much more flexible music information retrieval scenarios than traditional audio fingerprinting.
On the more technical side, we should note that for comparing shape, L2 of SSMs for crosssimilarity is fairly simple and not robust to local reparameterizations in time between versions, though we tried many other isometry invariant shape descriptors that were significantly slower and yielded inferior performance in initial implementation. In particular, we tried curvature descriptors (ratio of arc length to chord length), GromovHausdorff distance after fractional iterative closest points aligning MFCC block curves [19], and Earth Mover’s distance between SSMs [26]. If we are able to find another shape descriptor which performs better than our current scheme but is slower, we may still be able to make it computationally feasible by using the “Generalized Patch Match” algorithm [1] to reduce the number of pairwise block comparisons needed by exploiting coherence in time. This is similar in spirit to the approximate nearest neighbors schemes proposed in [28] for large scale cover song identification, and we could adapt their sparse Smith Waterman algorithm to our problem. In an initial implementation of generalized patch match for our current scheme, we found we only needed to query about 15% of the block pairs.
7 Supplementary Material
We have documented our code and uploaded directions for performing all experiments run in this paper. We also created an open source graphical user interface which can be used to interactively view crosssimilarity matrices and to examine the shape of blocks of audio after 3D PCA using OpenGL. All code can be found in the ISMIR2015 directory at
8 Acknowledgements
Chris Tralie was supported under NSFDMS 1045133 and an NSF Graduate Fellowship. Paul Bendich was supported by NSF 144749. John Harer and Guillermo Sapiro are thanked for valuable feedback. The authors would also like to thank the Information Initative at Duke (iiD) for stimulating this collaboration.
References
 Connelly Barnes, Eli Shechtman, Dan B Goldman, and Adam Finkelstein. The generalized patchmatch correspondence algorithm. In Computer Vision–ECCV 2010, pages 29–43. Springer, 2010.
 Juan Pablo Bello. Audiobased cover song retrieval using approximate chord sequences: Testing shifts, gaps, swaps and beats. In ISMIR, volume 7, pages 239–244, 2007.
 Bruce P Bogert, Michael JR Healy, and John W Tukey. The quefrency alanysis of time series for echoes: Cepstrum, pseudoautocovariance, crosscepstrum and saphe cracking. In Proceedings of the symposium on time series analysis, volume 15, pages 209–243. chapter, 1963.
 Michael Casey and Malcolm Slaney. The importance of sequences in musical similarity. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 5, pages V–V. IEEE, 2006.
 Daniel PW Ellis. Identifying’cover songs’ with beatsynchronous chroma features. MIREX 2006, pages 1–4, 2006.
 Daniel PW Ellis. Beat tracking by dynamic programming. Journal of New Music Research, 36(1):51–60, 2007.
 Daniel PW Ellis. The “covers80” cover song data set. URL: http://labrosa. ee. columbia. edu/projects/coversongs/covers80, 2007.
 Daniel PW Ellis and Courtenay Valentine Cotton. The 2007 labrosa cover song detection system. MIREX 2007, 2007.
 Daniel PW Ellis and BertinMahieux Thierry. Largescale cover song recognition using the 2d fourier transform magnitude. In The 13th international society for music information retrieval conference, pages 241–246, 2012.
 Jonathan Foote. Automatic audio segmentation using a measure of audio novelty. In Multimedia and Expo, 2000. ICME 2000. 2000 IEEE International Conference on, volume 1, pages 452–455. IEEE, 2000.
 Rémi Foucard, JL Durrieu, Mathieu Lagrange, and Gaël Richard. Multimodal similarity between musical streams for cover version detection. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5514–5517. IEEE, 2010.
 Eric J Humphrey, Oriol Nieto, and Juan Pablo Bello. Data driven and discriminative projections for largescale cover song identification. In ISMIR, pages 149–154, 2013.
 Imran N Junejo, Emilie Dexter, Ivan Laptev, and Patrick Pérez. Crossview action recognition from temporal selfsimilarities. In Proceedings of the 10th European Conference on Computer Vision: Part II, pages 293–306. SpringerVerlag, 2008.
 Samuel Kim, Erdem Unal, and Shrikanth Narayanan. Music fingerprint extraction for classical music cover song identification. In Multimedia and Expo, 2008 IEEE International Conference on, pages 1261–1264. IEEE, 2008.
 Brian McFee and Daniel PW Ellis. Analyzing song structure with spectral clustering. In 15th International Society for Music Information Retrieval (ISMIR) Conference, 2014.
 Emily Miao and Nicole E Grimm. The blurred lines of what constitutes copyright infringement of music: Robin thicke v. marvin gayeâs estate. WESTLAW J. INTELLECTUAL PROP., 20:1, 2013.
 Oriol Nieto and Juan Pablo Bello. Music segment similarity using 2dfourier magnitude coefficients. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 664–668. IEEE, 2014.
 Jose A Perea and John Harer. Sliding windows and persistence: An application of topological methods to signal analysis. Foundations of Computational Mathematics, pages 1–40, 2013.
 Jeff M Phillips, Ran Liu, and Carlo Tomasi. Outlier robust icp for minimizing fractional rmsd. In 3D Digital Imaging and Modeling, 2007. 3DIM’07. Sixth International Conference on, pages 427–434. IEEE, 2007.
 Suman Ravuri and Daniel PW Ellis. Cover song detection: from high scores to general classification. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 65–68. IEEE, 2010.
 Justin Salamon, Joan Serrà, and Emilia Gómez. Melody, bass line, and harmony representations for music version identification. In Proceedings of the 21st international conference companion on World Wide Web, pages 887–894. ACM, 2012.
 J Serra. Music similarity based on sequences of descriptors: tonal features applied to audio cover song identification. Department of Information and Communication Technologies, Universitat Pompeu Fabra, Barcelona, Spain, 2007.
 Joan Serra, Emilia Gómez, Perfecto Herrera, and Xavier Serra. Chroma binary similarity and local alignment applied to cover song identification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(6):1138–1151, 2008.
 Joan Serra, Meinard Müller, Peter Grosche, and Josep Lluis Arcos. Unsupervised detection of music boundaries by time series structure features. In TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 Joan Serra, Xavier Serra, and Ralph G Andrzejak. Cross recurrence quantification for cover song identification. New Journal of Physics, 11(9):093017, 2009.
 Sameer Shirdhonkar and David W Jacobs. Approximate earth moverâs distance in linear time. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981.
 Romain Tavenard, Hervé Jégou, and Mathieu Lagrange. Efficient cover song identification using approximate nearest neighbors. 2012.
 Julián Urbano, Juan Lloréns, Jorge Morato, and Sonia SánchezCuadrado. Melodic similarity through shape similarity. In Exploring music contents, pages 338–355. Springer, 2011.