# Sequential Complexity as a Descriptor for Musical Similarity

## Abstract

We propose string compressibility as a descriptor of temporal structure in audio, for the purpose of determining musical similarity. Our descriptors are based on computing track-wise compression rates of quantised audio features, using multiple temporal resolutions and quantisation granularities. To verify that our descriptors capture musically relevant information, we incorporate our descriptors into similarity rating prediction and song year prediction tasks. We base our evaluation on a dataset of 15 500 track excerpts of Western popular music, for which we obtain 7 800 web-sourced pairwise similarity ratings. To assess the agreement among similarity ratings, we perform an evaluation under controlled conditions, obtaining a rank correlation of 0.33 between intersected sets of ratings. Combined with bag-of-features descriptors, we obtain performance gains of 31.1% and 10.9% for similarity rating prediction and song year prediction. For both tasks, analysis of selected descriptors reveals that representing features at multiple time scales benefits prediction accuracy.

## I Introduction

We are concerned with the task of quantifying musical similarity, which has received considerable interest in the field of audio-based music content analysis [casey2008content, fu2011survey]. Owing to the proliferation of music in digital formats and the expansion of web-based music databases, there is an impetus to develop novel search, navigation and recommendation systems. Music content analysis has found application in such information retrieval systems as an alternative to manual annotation processes, when the latter are infeasible, unavailable or amenable to be supplemented [celma2009music].

We may distinguish between music content analysis applications such as audio fingerprinting [cano2005review], version identification [serra2011identification], genre classification [scaringella2006automatic] and mood identification [kim2010music]. Given a query track, audio fingerprinting typically should identify a unique track deemed similar with respect to a collection. In contrast, for genre and mood classification, the set of tracks deemed similar with respect to a collection is typically large. Thus, we may distinguish between music classification tasks according to the degree of specificity associated with the measure of musical similarity [casey2008content].

In this work, we consider two low-specificity tasks, namely similarity rating prediction and song year prediction. An important issue in our considered domain surrounds feature representation. In particular, we address the problem of representing temporal structure in audio features. We refer to summary statistics of audio features extracted from a song as descriptors. Descriptors may be characterised according to how temporal structure is accounted for [fu2011survey]. We may distinguish between bag-of-features representations [aucouturier2007bag], which discard information on temporal structure, and sequential representations. As a sequential representation, we propose to estimate the complexity of audio feature time series, where we quantify complexity in terms of string compressibility. As a result, we obtain scalar-valued summary statistics which retain information on temporal structure.

We motivate our evaluations involving similarity rating prediction and song year prediction to test the hypothesis that our complexity descriptors capture temporal information in audio features and that such information is relevant for determining musical similarity. For similarity rating prediction, our ground truth is given by human similarity judgements and we assume that an objective musical similarity correlates with subjects’ degree of perceived musical similarity, based on a five-point rating scale. For song year prediction, our ground truth is readily given by chart entry times of songs and we assume that musical similarity correlates with chart entry time proximity. Whereas song year prediction has received little attention in the literature, the song year is important in determining musical preference [barrett2010music]. Thus, song year prediction might be applied in music recommendation [bertin2011million]. Song year prediction might furthermore be incorporated in genre classification tasks, since musical genres are associated with particular years.

Section II provides an overview of methods and descriptors for computing low-specificity similarity. In Section III, we describe our approach. In Section IV, we detail our experimental method and results; we provide separate accounts for similarity rating prediction and song year prediction in Sections LABEL:sec:evaluationsimilarityratingprediction and LABEL:sec:evaluationsongyearprediction, respectively. Finally, in Section LABEL:sec:conclusions we provide conclusions.

## Ii Background

For a detailed review of recent literature on methods for determining musical similarity, from the perspective of classification, we refer to the work of Fu et al. [fu2011survey]. To determine musical similarity, one possible approach involves computing pairwise distances between tracks. The obtained distances may then be used for classification. A second approach consists in applying track-wise descriptors directly for classification.

Based on the second approach, Tzanetakis and Cook [tzanetakis2002musical] compute first and second-order moments on spectral features including MFCCs, to perform genre classification using the -nearest neighbours (KNN) algorithm and Gaussian mixture models (GMMs) estimated on each target class. Li and Ogihara [li2006toward] propose to classify Daubechies wavelet histograms using GMMs and KNN for genre and mood classification. Using spectral features, West et al. [west2006incorporating] propose methods for learning similarity functions based on constructing decision trees for genre classification. Slaney et al. [slaney2008learning] propose feature transformations based on supervised learning and using onset and loudness features, for the purpose of album and artist classification.

Based on the approach of determining distances between descriptors, Logan and Salomon [logan2001music] propose to estimate GMMs on individual tracks. Pairwise track distances are then computed using a combination of Kullback-Leibler divergence (KLD) and earth mover’s distance, where the KLD is used to compare pairs of track centroids. The approach based on KLD assumes that each centroid follows a Gaussian distribution; thus the KLD may be computed in closed form as

(1) |

where and respectively denote the mean and covariance of two multivariate Gaussian distributions with dimensionality . Aucouturier and Pachet [aucouturier2002music] in contrast compute cross-likelihoods between GMMs using Monte Carlo approximations for the purpose of genre classification, whereas Berenzweig et al. [berenzweig2004large] consider the asymptotic likelihood approximation of the KLD and centroid distances for the task of similarity rating prediction. Mandel and Ellis [mandel2005song] instead represent tracks as single Gaussians and use (1) as a distance measure between track pairs. The obtained distances are then applied to artist identification, using support vector machines (SVMs) for classification. An alternative approach to computing the KLD is based on computing histograms of quantised features, as proposed by Vignoli and Pauws [vignoli2005music] for playlist recommendation; Levy and Sandler [levy2006lightweight] compare approaches in the context of genre classification.

The previously described techniques are commonly referred to bag-of-features approaches, since they discard information on temporal structure. Yet, the relative convenience of bag-of-features approaches stands in contrast to the importance of temporal structure in perception of musical timbre, as observed by McAdams et al. [mcadams1995perceptual]. Aucouturier and Pachet [aucouturier2007bag] argue that the bag-of-features approach is insufficient to model polyphonic music for determining similarity. Sequential representations based on mid-level features are widely applied for the purpose of version identification [serra2011identification]. For low-specificity classification, one possible approach to mitigating the shortcoming of the bag-of-features approach involves the intermediate step of aggregating features locally, before summarising anew using obtained summary statistics. Tzanetakis and Cook [tzanetakis2002musical] propose to estimate the local mean and variance of features contained in a 1s window. For the task of predicting musical similarity, Seyerlehner et al. [seyerlehner2010fusing] apply a single, global summarisation step to overlapping windows, computing variance and percentiles. For the purpose of local aggregation, alternative pooling functions are considered by Mörchen et al. [morchen2006modeling], Hamel et al. [hamel2011temporal], Wülfing and Riedmiller [wulfing2012unsupervised].

An alternative approach is based on retaining the temporal order of features at each window position. Spectral analysis may be applied to the original features, resulting in a new feature sequence. Pampalk [pampalk2006computational] proposes fluctuation patterns describing loudness modulation across frequency bands, whereas Lee et al. [lee2009automatic] propose statistics based on modulation spectral analysis. Mörchen et al. [morchen2006modeling] consider a variety of statistics based on spectral analysis and autocorrelation. Meng et al. [meng2007temporal], Coviello et al. [coviello2012multivariate] apply multivariate autoregressive modelling to windowed features, for the tasks of genre and tag classification.

To account for temporal structure, statistical modelling may be applied to quantised features. For genre classification, Li and Sleep [li2005genre] propose an SVM kernel in which pairwise distances are obtained by comparing dictionaries generated using the Lempel-Ziv compression algorithm [ziv1978compression]. Reed and Lee [reed2009importance] apply latent semantic analysis to unigram and bigram counts for classification using SVMs, whereas Langlois and Marques [langlois2009music] propose to estimate language models for computing sequence cross-likelihoods for genre and artist classification. Ren and Jang [ren2012discovering] propose an algorithm for computing histograms of feature codeword sequences for genre classification.

Recent approaches attempt to model temporal structure using representations constructed at multiple time scales. Based on a bag-of-features approach, Foucard et al. [foucard2011multi] propose an ensemble of classifiers, where each classifier is trained on features at a given time scale. Features at successive resolutions are aggregated using averaging. Applied to tag and instrument classification, results indicate that a multiscale approach benefits performance. Dieleman and Schrauwen [dielemanmultiscale] apply feature learning based on spherical -means clustering to tag classification. Evaluated aggregation techniques are based on varying the spectrogram window size, in addition to Gaussian and Laplacian pyramid smoothing techniques. Although not applied to classification, Mauch and Levy [mauch2011structural] propose a similar smoothing approach for characterising structural change at multiple time scales. Finally, convolutional neural networks have been proposed for modelling temporal structure: Dieleman et al. [dieleman2011audio] propose deep learning architectures for genre, artist and key classification tasks. Hamel et al. [hamel2011temporal] propose a deep learning architecture incorporating multiple feature aggregation functions for tag classification.

The approach proposed in this work resembles methods applying statistical models to quantised feature sequences [li2005genre, reed2009importance, langlois2009music, ren2012discovering]. In contrast, we propose to compute summary statistics directly from estimated sequential models. Since the obtained statistics may be compared using a metric, our approach has the potential to be combined with indexing and hashing schemes for computationally efficient retrieval [slaney2008locality, rhodes2010investigating, schluter2013], while retaining information on temporal structure. Our method of computing multiple representations using downsampling resembles the approach proposed by Dieleman and Schrauwen [dielemanmultiscale].

Note that our approach differs from Cilibrasi et al. [cilibrasi2004algorithmic], who propose pairwise sequence compressibility to quantify similarity. We did not pursue this approach for low-specificity tasks, based on results for the pairwise prediction approach reported in Section LABEL:sec:similarityratingpredictionresults. Note that we may take compression rates as estimates of sequential Shannon entropy rates, inviting further comparison or combination with related measures of sequential complexity [dubnov2008unified, abdallah2009information, james2011anatomy]. Such measures have to date not been evaluated quantitatively in music content analysis, inviting further investigation beyond the scope of this work.

## Iii Approach

Assume that we are given the audio feature vector sequence . Similar to the descriptor proposed in [streich2005automatic], as a means of quantifying the sequential complexity of , we compute the compression rate ,

(2) |

where denotes the number of bits required to represent , given a quantisation scheme with levels and using a specified sequential compression scheme. To obtain a length-invariant measure of sequential complexity, we normalise with respect to the sequence length .

Given the th track in our collection, we compute compression rates for feature sequences extracted from musical audio. We refer to the set of compression rates as feature complexity descriptors (FCDs). For features based on constant frame rate, we compute FCDs using the original feature sequence, in addition to FCDs computed on downsampled versions of the original sequence; we consider downsampling factors . We distinguish among temporal resolutions using the labels FCD1, FCD2, FCD4, FCD8, respectively. For features based on variable frame rate, we compute FCDs with no further downsampling applied.

Thus proposed, consider FCDs computed on a hypothetical scalar-valued feature sequence exhibiting a high amount of temporal structure, either due to periodicity or locally constant regions (Fig. 1 (a), (b)). For such sequences, we obtain low values for , since the quantised feature sequence may be encoded efficiently. Conversely, if we discard temporal structure by randomly shuffling the original feature sequence (Fig. 1 (c)), we obtain high values for , since the quantised feature sequence no longer admits an efficient encoding. In contrast to FCDs, feature moments such as mean and variance are invariant to any such re-ordering of features. We observe that feature moments have been widely applied for low-specificity content analysis tasks. Considering that FCDs have similar dimensionality to feature moments and assuming that temporal order of features is informative for our considered tasks, we therefore expect that FCDs may be used to improve prediction accuracy with respect to using feature moments alone, for our considered tasks.

### Iii-a Similarity rating prediction

For the task of similarity rating prediction, assume that we have a distance metric which we use to compare descriptor vectors computed on pairs of tracks. We hypothesise that the pairwise distance between descriptors correlates with the similarity rating associated with track pairs. To predict similarity ratings we take as our feature space pairwise distances between descriptor vectors and apply multinomial regression. We use to denote the th descriptor vector computed for the th track in our collection, with and given a set of available descriptor vectors. We compute separate descriptor vectors across audio features and across FCD resolutions, with each vector component in corresponding to a quantisation granularity . We denote with the distances between , obtained across all descriptor vectors, using our assumed distance measure. Given the pair of tracks whose similarity rating we seek to predict, we estimate the probability of similarity score as

(3) |

where , are the model parameters associated with outcome , given a total of similarity scores. We predict similarity ratings by determining the value of which maximises . We describe our model estimation method in Section LABEL:sec:modelestimation.

### Iii-B Song year prediction

For the task of song year prediction, we hypothesise that descriptor values correlate with the chart entry date of tracks. Following [bertin2011million] we apply a linear regression model. Given the th track in our collection, we predict the associated chart entry date using a linear combination of components in descriptor vectors ,

(4) |

where denotes regression coefficients for the th descriptor vector as specified for similarity rating prediction, and where denotes the model intercept. We describe our model estimation method for song year prediction in Section LABEL:sec:modelestimationyearprediction. We motivate use of both multinomial and linear regression techniques as a straightforward means of evaluating the utility of FCDs for determining similarity based on a metric space. We perform our evaluation by considering predictive accuracy, in addition to interpreting estimated coefficients as feature utilities.

## Iv Evaluation

For our evaluations, we use a collection of 15 473 entries from the American Billboard Hot 100 singles popularity chart^{1}

For each track excerpt in the dataset, we extract a set of 25 audio features, using MIRToolbox [lartillot2007matlab] version 1.3.2 and using the framewise chromagram representation proposed by Ellis and Poliner [ellis2007identifyingcover]. With the exception of rhythmic features, which are computed using predicted onsets, features are based on a constant frame rate of Hz. Table LABEL:tab:featuresummary summarises the set of evaluated audio features.