Report on ACM Recommender Systems Challenge 2018: Automatic Music Playlist Continuation
ACM RecSys Challenge 2018: Automatic Music Playlist Continuation (Final Report)
The ACM RecSys Challenge 2018 on Automatic Music Playlist Continuation
An Analysis of Approaches Taken in the ACM RecSys Challenge 2018 for Automatic Music Playlist Continuation
The ACM Recommender Systems Challenge 2018 focused on the task of automatic music playlist continuation, which is a form of the more general task of sequential recommendation. Given a playlist of arbitrary length with some additional meta-data, the task was to recommend up to 500 tracks that fit the target characteristics of the original playlist. For the RecSys Challenge, Spotify released a dataset of one million user-generated playlists. Participants could compete in two tracks, i.e., main and creative tracks. Participants in the main track were only allowed to use the provided training set, however, in the creative track, the use of external public sources was permitted. In total, 113 teams submitted 1,228 runs to the main track; 33 teams submitted 239 runs to the creative track. The highest performing team in the main track achieved an R-precision of , an NDCG of , and an average number of recommended songs clicks of . In the creative track, an R-precision of , an NDCG of , and a click rate of was obtained by the best team. This article provides an overview of the challenge, including motivation, task definition, dataset description, and evaluation. We further report and analyze the results obtained by the top performing teams in each track and explore the approaches taken by the winners. We finally summarize our key findings, discuss generalizability of approaches and results to domains other than music, and list the open avenues and possible future directions in the area of automatic playlist continuation.
According to a study carried out in 2016 by the Music Business Association111https://musicbiz.org/news/playlists-overtake-albums-listenership-says-loop-study as part of their Music Biz Consumer Insights program,222https://musicbiz.org/resources/tools/music-biz-consumer-insights/consumer-insights-portal playlists accounted for 31% of music listening time among listeners in the United States, which is more than albums (22%), but less than single tracks (46%). In a 2017 study conducted by Nielsen,333http://nielsen.com/us/en/insights/reports/2017/music-360-2017-highlights.html it was found that 58% of users in the United States create their own playlists, 32% share them with others. Other studies, conducted by MIDiA,444https://midiaresearch.com/blog/announcing-midias-state-of-the-streaming-nation-2-report show that 55% of music streaming service subscribers create music playlists, using streaming services. Studies like these suggest a growing importance of playlists as a mode of music consumption, which is also reflected in the fact that the music streaming service Spotify currently hosts over 2 billion playlists.555https://press.spotify.com/us/about
In its most generic definition, a playlist is simply a sequence of tracks intended to be listened to together. The task of automatic playlist generation then refers to the automated creation of these sequences of tracks (bonnin2015, ). In this context, the ordering of songs666In this paper, the terms “song” and “track” are used, interchangeably. in a playlist is often highlighted as a key characteristics of automatic playlist generation, which makes the task a highly complex endeavor. Some authors have therefore proposed approaches based on Markov chains to model the transitions between songs in playlists, e.g. (chen_etal:kdd:2012, ; mcfee_lanckriet:ismir:2011, ). While these approaches have been shown to outperform approaches agnostic of the song order in terms of log likelihood, recent research has found little evidence that the exact order of songs actually matters to users (Tintarev:2017:SDS:3079628.3079633, ), while the ensemble of songs in a playlist (vall_etal:recsys:2017, ) and direct song-to-song transitions (kamehkhosh_etal:milc:2018, ) seems to matter.
Considered a variation of automatic playlist generation, the task of automatic playlist continuation (APC) consists of adding one or more tracks to a playlist in a way that fits the same target characteristics of the original playlist (Schedl2018, ; bonnin2015, ). This has benefits in both the listening and creation of playlists: users can enjoy listening to continuous sessions beyond the end of a finite-length playlist, while also finding it easier to create longer, more compelling playlists without a need to have extensive musical familiarity.
Schedl et al. (Schedl2018, ) have recently identified the task of automatic music playlist continuation as one of the grand challenges in music recommender systems research. A large part of the APC task is to accurately infer the intended purpose of a given playlist. This is challenging not only because of the broad range of these intended purposes (when they even exist), but also because of the diversity in the underlying features or characteristics that might be needed to infer those purposes.
An extreme cold start scenario for this task is where a playlist is created with some meta-data (e.g., the title of a playlist), but no song has been added to the playlist. This problem can be cast as an ad-hoc information retrieval task, where the task is to rank songs in response to a user-provided meta-data query.
Given the importance of playlists in improving the user experience within the context of music streaming services, ACM Recommender Systems Challenge777ACM Recommender Systems Challenge, or RecSys Challenge in short, is an annual competition organized in conjunction with the ACM Conference on Recommender Systems, since 2010. For more information, refer to (Said:2016, ) or visit http://recsyschallenge.com/. 2018 (Chen:2018, ) has focused on an automatic music playlist continuation task.888http://2018.recsyschallenge.com This paper provides an overview of the challenge, the results achieved by over 100 participating teams as well as the winning and most innovative approaches, and future directions and open avenues in this research area.
1.1. Task: Automatic Playlist Continuation
As mentioned earlier, automatic playlist continuation is a useful feature for music streaming services not only because it can extend listening session length, but also because it can increase engagement of users on their platform by making it easier for users to create playlists that they can enjoy and share. ACM Recommender Systems Challenge 2018 has focused on the task of automatic playlist continuation (APC). This task consists of adding one or more tracks to a music playlist in a way that fits the target characteristics of the original playlist (Schedl2018, ; bonnin2015, ). To formally define the task, let be the universe of tracks in the underlying music catalog. Given a playlist created by a user , that contains music tracks , the task is to rank the music tracks from to be recommended to the user for completing the playlist. In addition, each playlist includes some meta-data information, such as title. It should be noted that can be equal to zero for some playlists, meaning that the user has created the playlist but no music track has yet been added to the playlist.
1.2. Competition: Main and Creative Tracks
ACM Recommender Systems Challenge 2018 invited participants to submit their solutions for the APC task in two distinct tracks: main track and creative track. Participants in the main track were only allowed to use the dataset provided by the challenge for training their models. In contrast, participants in the creative track were required to use external resources, such as public datasets, for solving the same task. The submitted solutions for both tracks were evaluated using the same dataset, which will be explained in the following subsection.
1.3. Data: Million Playlist Dataset
For algorithm development and testing, we released a dataset of one million user-created playlists from the Spotify platform, dubbed the Million Playlist Dataset (MPD). These playlists were created during the period of January 2010 until November 2017. Statistics of the MPD are reported in Table 1. The dataset includes, for each playlist, its title as well as the list of tracks (including album and artist names), and some additional meta-data such as Spotify URIs and the playlist’s number of followers. The playlist titles in the dataset were unmodified, however for reporting in Table 1, playlist titles were lightly normalized by converting to lowercase, and removing spaces and common non-alphanumeric symbols. A truncated sample playlist is shown in Appendix B.
|Number of playlists||1,000,000|
|Number of tracks||66,346,428|
|Number of unique tracks||2,262,292|
|Number of unique albums||734,684|
|Number of unique artists||295,860|
|Number of unique playlist titles||92,944|
|Number of unique normalized playlist titles||17,381|
|Average playlist length (tracks)||66.35|
A separate challenge dataset was used to validate the quality of the elaborated algorithms. It consisted of a set of playlists from which a number of tracks had been withheld. The challenge set was composed of 10,000 incomplete playlists and covered a total of 10 scenarios (1000 playlists for each): (1) title only, no track, (2) title and the first 5 tracks, (3) the first 5 tracks, (4) title and the first 10 tracks, (5) the first 10 tracks, (6) title and the first 25 tracks, (7) title and 25 random tracks, (8) title and the first 100 tracks, (9) title and 100 random tracks, and (10) title and the first track.
The task was then to predict the missing tracks in those playlists, and participating teams were required to submit their predictions for those missing tracks (as a list of 500 ordered predictions). The withheld tracks were used by the organizers as ground truth, i.e. to compute the performance measures for each submission.
Note that the data provided by the challenge does not contain acoustic information or features. However, participants in the creative track were able to use the Spotify API (or other sources) to retrieve such information.
In order to foster reproducibility and further research in music recommendation, the dataset will be made available for researchers on the Spotify Research website.999https://research.spotify.com/datasets
To assess the quality of submissions, we computed three metrics and averaged them across all playlists in the challenge dataset: R-precision, normalized discounted cumulative gain (NDCG), and recommended songs clicks. The formal definition of these metrics is presented in Appendix A.
The higher the R-precision and NDCG, the better. However, lower recommended songs clicks indicates better performance. To aggregate the individual scores for the three metrics, Borda rank aggregation (borda:1781, ) is used, i.e. scores are converted to ranks, which are then summed up over the three measures to obtain a single performance score.
The RecSys Challenge was well received: 1,791 people registered; 1,430 with an academic affiliation and 361 from industry. These people formed a total of 410 teams. Out of these, 117 teams were active, i.e., submitted at least one run (113 and 33, respectively, to the main and to the creative track). The number of active teams per country for the top 20 countries (in terms of the number of teams) is plotted in Figure 1. As depicted, the United States has the highest number of active teams followed by Austria and Italy.
In total we received 1,467 submissions, out of which 1,228 were submitted to the main track and 239 to the creative track. The number of submissions made by each active team is plotted in Figure 2.
The final results achieved by the participating teams for both main and creative tracks are available online.101010The final leaderboard for the main track: http://www.recsyschallenge.com/2018/leaderboard-main.html,111111The final leaderboard for the creative track: http://www.recsyschallenge.com/2018/leaderboard-creative.html Tables 2 and 3 summarize the results achieved by the top 10 teams in the main and creative tracks, respectively. Note that the test set for both tracks are the same and the only difference is that the teams were allowed to use external resources (other than the MPD training set) in the creative track.
As shown in Tables 2 and 3, the team vl6 has achieved the first ranked in both tracks, followed by teams hello word! and Avito in the main track and Creamy Fireflies and KAENEN in the creative track. The first ranked team has achieved the best results in terms of all evaluation metrics, except for the recommended songs clicks metric in the main track where it has been beaten by team Avito.
Figures 3 and 4 demonstrate the highest performance achieved in the leaderboard over time for the main and the creative tracks, respectively.121212The starting date for the plots corresponding to recommended songs clicks differs from the starting dates in the other plots. This is due to the error of our evaluation script, which has been solved on 2018-06-01. As expected, there is an increasing trend in terms of R-precision and NDCG and a decreasing trend in terms of recommended songs clicks over time. We also plot the performance of the first ranked team (team vl6) per submission over time in Figure 5.
To gain a deep understanding of the performance of the models, we report the results for the 10 different types of playlists, separately (see Tables 4 and 5 for the main and creative tracks, respectively). As mentioned earlier in Section 1.3, the challenge set includes 10,000 playlists; 1000 playlists from each of the following playlist types: 1) title only, no track, (2) title and the first 5 tracks, (3) the first 5 tracks, (4) title and the first 10 tracks, (5) the first 10 tracks, (6) title and the first 25 tracks, (7) title and 25 random tracks, (8) title and the first 100 tracks, (9) title and 100 random tracks, and (10) title and the first track.
As expected, by increasing the number of tracks as the input, the performance generally increases. There exist some exceptions, specially when 100 tracks are given. The reason can be due to the way that the teams handle the relation between the playlists. It is well known that most learning models fail at modeling long sequences, which also happens in the APC task.
Surprisingly, the models perform worse when the title is also given as a meta-data for the playlist. For instance, the only difference between Type 2 and Type 3 is that the former contains playlist title. We believe that this strange behavior is observed because titles are highly sparse and models overfit on the titles appearing in the training set. In summary, the models fail at modeling the titles effectively.
Interestingly, APC given random tracks produces much better results compared to the first tracks in the playlist (see the results for Type 6 vs. Type 7 and Type 8 vs. Type 9). This is due to the fact that adjacent tracks in a playlist are likely to share similar information, such at genre, artist, album, etc. Therefore, random tracks would provide more useful information to better understand the focus of the playlist, and thus more accurate APC performance is achieved.
When the number of given tracks are more than or equal to 5, the recommended songs clicks for all the models is less than 1. This means that most users can find a relevant track in the top 10 recommended list and do not need to reload the recommended track list.
By increasing the number of given tracks, the standard deviation of the performances obtained by the top 10 teams generally increases. In other words, most approaches perform closely when a few tracks are given. However, when several tracks are given for each playlist (e.g., more than or equal to 25 tracks), a substantial difference between the performance of different approaches is observed.
Even one track matters: comparing the results of the playlists from Type 1 and Type 10, we observe a significant increase in the performance by adding only the first track of the playlist. This might be also due to the fact that the proposed solutions could not handle the title desirably.
In general, the team hello world! performed well when the first tracks of the playlists are given. However, the teams vl6 and MIPT_MSU achieved the best results when the tracks are given in a random order. The team Avito also achieved the highest performance multiple times for some of the playlists that contain a few tracks.
The performance of the models in the main track is slightly higher than that in the creative track. The reason might be that adding external resources increases the complexity of the models and given the amount of training data, the models could not take advantage of external resources, effectively.
The approaches used by the top performing teams are briefly described in the next two sections.
|Team||Type 1||Type 2||Type 3||Type 4||Type 5|
|title only||title + first 5 tracks||only first 5 tracks||title + first 10 tracks||only first 10 tracks|
|Team||Type 6||Type 7||Type 8||Type 9||Type 10|
|title + first 25 tracks||title + 25 random tracks||title + first 100 tracks||title + 100 random tracks||title + first track|
|Team||Type 1||Type 2||Type 3||Type 4||Type 5|
|title only||title + first 5 tracks||only first 5 tracks||title + first 10 tracks||only first 10 tracks|
|Team||Type 6||Type 7||Type 8||Type 9||Type 10|
|title + first 25 tracks||title + 25 random tracks||title + first 100 tracks||title + 100 random tracks||title + first track|
4. Top-performing approaches: Main Track
In this section, we provide a brief analysis of the approaches taken by the top 10 teams in the main track. We further explain the approaches used by the top 3 teams in more detail.
High-level characteristics of the winning approaches are presented in Table 6. As shown in the table, several teams took advantage of a two-stage architecture for the playlist continuation task. In such an architecture, the first stage model retrieves a small set of tracks (compared to the total number of tracks in the dataset), while the second stage focuses on re-scoring or re-ranking the output of the first stage model with the goal of accuracy improvement. Therefore, a high-recall model is desired for the first stage, however, a high-precision model is preferred for the second stage. The reason for making this decision is mainly related to efficiency. However, the two-stage architecture can also improve the APC performance. Among the top 10 teams in the main track, vl6 (Volkovs:vl6, ), Avito (Rubtsov:avito, ), HAIR (Zhu:hair, ), BachPropagate (Kallumadi:bach-propagate, ), and IN3PD (Faggioli:in3pd, ) took advantage of a multi-stage architecture. Multi-stage models have been extensively explored for improving efficiency and effectiveness in various retrieval and recommendation settings (Chen:2017, ; Dang:2013, ; Lampropoulos:2012, ; Li:2011, ; Wang:2011, ).
In addition, matrix factorization, as a dominant approach in collaborative filtering (CF), was also employed by several top performing teams, including vl6 (Volkovs:vl6, ), Avito (Volkovs:vl6, ), KAENEN (Ludewig:kaenen, ), and IN3PD (Faggioli:in3pd, ). These models mostly create an incomplete playlist-track matrix and use matrix factorization to learn a low-dimensional dense representation for each playlist and track. They learn similar representations for the tracks that often occur together in user-created playlists. Therefore, the tracks from a single artist (band), an album, or a music genre may be assigned close representations. The matrix factorization algorithms used by the top teams include weighted regularized matrix factorization (WRMF) (Hu:2008, ), LightFM with a weighted approximate-rank pairwise (WARP) loss (Kula:2015, ), and Bayesian personalized ranking (BPR) (Rendle:2009, ). Interestingly, some teams, including HAIR (Zhu:hair, ) and Definitive Turtles (Kelen:definitive-turtles, ), were able to achieve promising results using simple neighborhood-based collaborative filtering methods.
Moreover, due to the high capacity of neural networks to learn task-specific representations, a number of top performing teams used neural network models to produce accurate predictions for the APC task. These neural approaches include: (1) simple feed-forward networks for predicting tracks given each playlist (e.g., a word2vec-style model (Mikolov:2013, )) or for neural collaborative filtering (He:2017, ), (2) convolutional models for playlist embedding or extracting useful information from playlist titles, (3) recurrent neural networks and in particular long short-term memory networks for modeling the sequence of tracks in the playlists, and (4) autoencoders for learning playlist representations.
Most top-performing teams that used a two-stage architecture built their second stage based on (mostly pairwise) learning to rank models. These models were designed to re-rank a small number of tracks given a set of features produced by different models, including the first-stage model, as well as several heuristic hand-crafted features. The tree-based models, such as XGBoost (Chen:2016, ), GBDT (Friedman:2001, ), and LambdaMART (Burges:2010, ), were the popular learning to rank algorithms among the top teams in the challenge.
It is notable that some top performing teams used information retrieval techniques mainly developed for the ad-hoc retrieval task. For instance, inverse document frequency (IDF) weighting (Jones:1972, ), TF-IDF weighting (Salton:1988, ), BM25 weighting (Robertson:1994, ), and relevance model (Lavrenko:2001, ) (a pseudo-relevance feedback model) were respectively employed by teams Definitive Turtles (Kelen:definitive-turtles, ), KAENEN (Ludewig:kaenen, ), Creamy Fireflies (Antenucci:creamy-fireflies, ), and BachPropagate (Kallumadi:bach-propagate, ).
An important challenge in the APC task is dealing with cold-start playlists, i.e., the playlists with only title (no track). Some teams tried to deal with such special cases differently by trying to learn a relationship between the playlist titles and its tracks. Among which, neural networks and matrix factorization models are notable that predict the tracks in a playlist, given its title.
In the following, we detail the approaches taken by the top three teams in the main track:
vl6 team: The vl6 team used a two-stage architecture, where the first one is based on Weighted Regularized Matrix Factorization (WRMF) (Hu:2008, ), and the second one is implemented using XGBoost (Chen:2016, ), a gradient boosting learning to rank model. In addition to the output of the WRMF model, few models were used to produce features for the XGBoost model. These models include a convolutional neural network for playlist embedding, user-user and item-item neighborhood-based collaborative filtering models, and a set of hand-crafted features. Note that the cold-start instances (those that only consists of a title with no track) were handled separately. For such cases, the vl6 team used a matrix factorization on top of the playlist titles. For a detailed description of the approach used by the vl6 team, refer to (Volkovs:vl6, ).
hello world! team: The team hello world! linearly combined the results produced by two different models: an autoencoder model and a convolutional neural network. The autoencoder model tries to reconstruct track lists and artist lists for each playlist. To model both marginal and joint information across playlist and contents, the model was trained using a “hide-and-seek” idea. In other words, either the track list or the artist list was randomly deactivated in the input of the autoencoder. To use the title of playlist, especially for the cold-start situations, a character-level convolutional neural network (charCNN) was used to learn a representation from the playlist’s title. This can be viewed as a classification model: predicting the tracks in each playlist given its title. In the linear combination, the output of the charCNN was weighted higher for shorter playlists. For a detailed description of the approach used by the team hello world!, we refer the reader to (Yang:hello-world, ).
Avito team: Similar to the first team, the team Avito also used a two-stage architecture. The first stage is based on a matrix factorization model with the weighted approximate-rank pairwise (WARP) loss, implemented in LightFM (Kula:2015, ). Two separate models were trained, one based on playlist-track information and the other one based on the playlist titles. The union of the outputs of these two models were re-ranked by the second stage model, which is a XGBoost learning to rank model (Chen:2016, ). In addition to the LightFM features, some additional feature engineering was done to boost the performance. For a detailed description of the approach used by the Avito team, refer to (Rubtsov:avito, ).
5. Top-performing Approaches: Creative Track
In this section, we provide a brief analysis of the approaches taken by the top 10 teams in the creative track, in which teams were allowed to use external resources.131313When teams started to submit the same approaches to the creative and main tracks (due to the lower popularity of the creative one), we required submissions to the creative track to exploit external data. We further explain the approaches followed by the top 3 teams in more detail.
A first observation when reviewing the algorithms of the top performers in the creative track reveals that most of the teams only slightly altered their algorithms for the main track, e.g., by adding to their pipeline a final audio content-based re-ranking approach (Ludewig:kaenen, ) or by extending their content-based filtering approaches by enriching the provided meta-data with audio information (Antenucci:creamy-fireflies, ). Most of what was said above for the main track therefore also holds for the approaches taken in the creative track, in particular the superior performance of two-stage architectures, use of neural networks, and special handling of cold-start situations.
Interestingly, except for one team (spotif.ai), all top 10 teams participating in the creative track also participated in the main track (see Table 3). However, their ranks most often differed between the main and creative tracks: vl6 (ranked 1st in main track), Creamy Fireflies (4th in main), KAENEN (7th in main), cocoplaya (11th in main), BachPropagate (7th in main), Trailmix (13th in main), teamrozik (63rd in main), Freshwater Sea (19th in main), Team Radboud (21st in main), and Avito (3rd in main). The spotif.ai team, which solely participated in the creative track, employed a recurrent neural network architecture (long short-term memory (doi:10.1162/neco.19220.127.116.115, )) that was particularly designed to cope with sequential data, in addition to a weighted regularized matrix factorization (WRMF) approach (Hu:2008, ).
Remarkably, almost all teams participating in the creative track used the Spotify API141414https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features as external data source and downloaded the provided audio content features. A notable exception was team cocoplaya (Ferraro:cocoplaya, ), who retrieved 30-second-snippets of each track from Spotify and computed their own audio-based features, in particular the output of a probabilistic genre classifier for each of 13 genres (bogdanov_etal:ismir:2016, ). Others included external information when filtering playlist titles using stopword lists or pre-defined lists of music-related terms (e.g., playlist, songs, music) (Zhao:trailmix, ). Still others used pre-trained word embedding models, such as the CBOW model from word2vec (Mikolov:2013, ), to create track embeddings (Kallumadi:bach-propagate, ).
In the following, we detail the approaches taken by the top three teams in the creative track:
vl6 team: The vl6 team also ranked first in the creative track. Their approach taken here largely resembles the one taken in the main track (see Section 4). The only difference is that the feature set used in the second stage of their approach (feature selection using an XGBoost model) was extended by content-based music descriptors of tracks. These descriptors were acquired through the Spotify Audio API and comprise acousticness, danceability, energy, instrumentalness, key, liveness, loudness, mode, speechiness, tempo, time signature, and valence. However, no substantial and consistent improvement was achieved by adding these features (compare Tables 2 and 3). For a detailed description of the approach used by the vl6 team, refer to (Volkovs:vl6, ).
Creamy Fireflies team: This team used an ensemble of known techniques, which they intelligently combined in an informed way to select and tune the individual techniques depending on the underlying playlist characteristics (from only title to 100 tracks). Five base approaches were used: (1) popularity-based recommendation, (2) track- and (3) playlist-based collaborative filtering (on the playlist-track matrix), as well as (4) track- and (5) playlist-based content-based filtering; (4) using artist and album identifiers as features; (5) additional features derived from playlist titles. More precisely, playlist features were created by applying techniques from information retrieval and natural language processing to clean and enrich the playlist titles (e.g., tokenization, normalization, and stemming). In a tuning step, the authors then sought optimal parameters for each combination of algorithm and playlist category (cf. Section 1.3). Their base ensemble approach subsequently weighted the five algorithms for each playlist category and other playlist characteristics (e.g., length and track positions). The final score was computed as the weighted sum of the scores given by each algorithm and playlist category. The authors also investigated another ensemble model, based on a proposed measure of artist heterogeneity. Clustering the playlists according to this measure and performing a cluster-based filtering slightly improved NDCG and R-precision. Eventually, several boosts depending on the playlist category were investigated. For instance, assuming that the last tracks in a (long) seed playlist are the most important ones with respect to the continuation, candidate tracks more similar to those last ones in the seed playlist were given higher weight.
In the creative track, team Creamy Fireflies additionally used the Spotify API to acquire the following features for each track: acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence, and popularity. They extended their content-based filtering and collaborative filtering models described above to include track-level similarity. To this end, a sparse representation of track clusters was used, in which clusters were generated by grouping tracks into four equally sized clusters based on the values of each audio feature. For a detailed description of the approach used by the Creamy Fireflies team, refer to (Antenucci:creamy-fireflies, ).
KAENEN team: Also the KAENEN team proved that it is possible to achieve remarkable results without using very complex approaches. They combined nearest-neighborhood techniques with common matrix factorization algorithms, which were adapted to the application domain. More precisely, they adapted an item-based CF approach, treating playlists as users and computing cosine similarity between item vectors (binary, over all playlists). To alleviate the popularity bias that affects such co-occurrence-based similarities, inverse document frequency (IDF) weighting is applied to each candidate track, i.e., tracks that appear in many playlists are downweighted. As second approach, the team proposed a playlist-based nearest neighbor method, which uses the same framework as the item-based CF approach, but this time computing similarities over binary playlist vectors instead of track vectors. Each candidate track is then ranked with respect to the similarity to the most similar playlists in which occur, again considering the IDF weighting. As third approach, the team adapted a standard matrix factorization technique using alternating least squares (ALS) optimization. To compute the ranking of a candidate track with respect to a seed playlist , the latent factors of all tracks in are IDF-weighted and the dot product of the arithmetic mean of this set of latent factors (constituted of all tracks in ) and the latent factors of is used as final score. To address the cold-start scenario (only playlist title given), the team used a simple string matching technique applied on tokenized and stemmed playlist titles to identify the most similar playlists to . In addition, they used a matrix factorization approach (with ALS optimization) treating unique playlist names as users and occurrences of tracks in the corresponding playlists as “ratings”. The latent factors were then used to identify the playlists most similar to . The individual approaches described above were subsequently combined into a hybrid recommender system, using switching and weighting hybridization schemes (Burke2002, ). In cold-start cases where the string matching approaches did not produce enough results (i.e., 500 tracks), the missing ones were filled with the most popular tracks of the MPD.
For the creative track, like the other top performers, the team KAENEN retrieved audio features using the Spotify API. They then used a re-ranking strategy as follows. If the mean standard deviation of the audio features of the seed playlist ’s tracks fell below a threshold (low content diversity), the original score of a candidate track with respect to was re-weighted by cosine similarity between ’s content features and the mean of the content features of all tracks in . For a detailed description of the approach used by the KAENEN team, refer to (Ludewig:kaenen, ).
6. Other Notable Approaches
In the previous sections, we discussed the approaches of the top teams in each of the challenge tracks. A detailed analysis of all 117 active teams’ approaches is unfeasible, due to the sheer number of teams, as well as the fact that only some of them published their approach in detail, or had sufficient documentation in the code they shared (with many teams not sharing their code at all). However, based on a review of some of the teams that did not achieve top scores, we see a similar variety of techniques used as in the top performing submissions. Some combination of collaborative filtering, word embedding approaches, deep neural network architectures, information retrieval techniques and ensembles thereof are used by teams who achieved both higher and lower scores. This raises the question of what makes one approach score better at the task than another? We can expect implementation details such as hyperparameter tuning, dataset preprocessing and sampling strategies to have a significant impact on the performance of an approach. Different formulations of objective functions, different approaches to extracting features from the dataset, as well as different architectures and sequencing of operations could also have an effect on the overall results. To provide some context towards answering this question, we present two teams which did not achieve scores in the top 10, but which took different approaches to solving the automatic playlist continuation task:
Unconscious Bias team: The Unconscious Bias placed 43rd in the main track. Their approach is based on applying adversarial autoencoders (makhzani2015adversarial, ) to the playlist continuation task. On the surface, this approach shares similarities to the approach taken by team hello world!, which came in 2nd place in the main track. Team hello world! (Yang:hello-world, ) used a combination of a content-aware autoencoder as well as a convolutional neural network on playlist titles to arrive at their score. In contrast to hello world!’s various novel dropout strategies to train an autoencoder network, the Unconscious Bias team uses an adversarial approach as a regularization technique, which allows the network to generalize from the training set to unseen examples, in a way that also matches the prior distribution. Interestingly, Unconscious Bias evaluated the general autoencoder approach as a baseline in their experiments, and found performance to be lower than their proposed adversarial autoencoder approach. Clearly there are very significant differences in the two approaches, even though both utilize autoencoders. To delve deeper into these differences and how they might have resulted in such a large difference in scores, we recommend reading both (Vagliano:unconscious-bias, ) and (Yang:hello-world, ).
D2KLab team: Like many other teams, D2KLab took an ensemble approach to the problem, combining several methods together to solve the task, including a specialized method to handle the cold-start (title-only) use case. Their core approach involves an ensemble of multiple Recurrent Neural Networks (RNN), in particular, Long-Short Term Memory (LSTM) cells trained to predict the next track given a sequence of tracks. The inputs to the system are word2vec embeddings at the track, album, and artist level. To deal with playlist titles, and particularly to address the cold-start use case, they also derived title embeddings using the fastText (bojanowski2016enriching, ) algorithm, trained on n-grams of playlist titles included in groups of playlists that are clustered in the playlist embedding space.
For their creative track submission, D2KLab also included lyric metadata by linking the MPD tracks with the WASABI lyric corpus (meseguer-brocal-wasabi, ). They developed a suite of lyric features that describe the different stylistic and linguistic dimensions of a song text, for example, vocabulary and emotion. These features were vectorized and concatenated with the other embedding-based features as inputs to the RNN network.
In the main track, their submission achieved an R-Precision of 0.1808, NDCG of 0.3252, and Clicks score of 3.086, which ranks them in the 37th position. In the creative track, their approach achieved an R-Precision of 0.1852, NDCG of 0.3334, and Clicks score of 3.026, putting them in 13th place. The improved scores in the creative track suggests that their use of lyric features adds valuable information for the playlist continuation task. For a detailed description of the approach used by the D2KLab team, refer to (Monti:d2klab, ).
7. Summary of Key Findings
In this section, we briefly summarize our key findings from the challenge and the submitted solutions. In summary, most approaches ensemble the results obtained by several well-known methods, including matrix factorization models, neighborhood-based collaborative filtering models, basic information retrieval techniques, and learning to rank models. The results show that the models work best when a sufficient number of tracks per playlist is provided and they are randomly selected from the playlist (as opposed to the sequential order from the beginning of the playlist). The submitted solutions could not effectively use playlist titles for APC. This might be due to the sparseness of the titles as well as the scale of the training data. In addition, none of the submitted solutions tried to infer the user intents from the playlist titles. The results also demonstrate that the performance of different models are close to each other when few tracks per playlist are given. However, when the number of tracks increases, a more diverse set of results is observed.
In the creative track, most teams exclusively used the descriptors from the Spotify API, and only few of them tried to extract their own features from the audio. It is worth noting that surprisingly, there is no significant gap between the results in the main and creative tracks. Indeed, the results for the creative track are marginally worse than those obtained for the main track. This might be due to the fact the inclusion of side information makes the problem more complex and the submitted solutions could not successfully generalize the information obtained from the exploited external resources.
8. Generalizability of Approaches and Results
The RecSys Challenge 2018 focused on the topic of sequence-aware music recommendation and was deliberately and necessarily a narrow and clearly defined task (playlist continuation) as usual for such a competition. Nevertheless, some of the best-performing approaches are transferable to target domains other than music, though to different extent which also depends on the target domain. Most straightforward, the approaches submitted to the main track, which therefore do not use any external side information, could be adapted easily to multimedia domains such as (short) video, where users of platform like Youtube create and share their playlists of video clips. Likewise, in the online learning and training domains, curated sequences of exercises or tasks are made available by teachers and students. Both share similar characteristics in the sense that the sequence of items does matter and consumption times are comparable in magnitude to those of songs. Both factors, i.e., importance of sequences and similar consumption time (Schedl2018, ), may prevent the immediate applicability of these approaches to other targets such as story lines of images (much shorter consumption time) or book reading lists (much longer consumption times and sequence often not important).
Nonetheless, for such domains which are further away from music, other ways of adopting the proposed approaches might be viable. The models constructed from the provided dataset by some teams, most notably the two top performing ones in the main track (vl6 and hello world!) which are based on deep neural networks, could be used in a transfer learning setting to re-purpose the model for related tasks (Goodfellow2016, ).
Most solutions submitted to the creative track are harder to generalize, in particular if they are closely tied to content-based features. However, the level of generalizability obviously depends on the nature of the leveraged content features, which were used at the song and at the playlist level. Noteworthy, all top 3 teams (and many others) in the creative track used the Spotify API to extract audio descriptors (tempo, loudness, danceability, etc.). As an example, team Creamy Fireflies relied in the creative track on artist and album identifiers, but also on Spotify’s audio content descriptors to implement content-based filtering. While the former (identifiers) are practically available in almost all other domains too, audio content features are limited to a few domains (e.g., podcasts or videos).
As for the achieved results in terms of performance metrics, they strongly depend on the dataset used and vary according to the type of playlist in the challenge set on which they are computed. R-precision, NDCG, and number of clicks are therefore not comparable to results achieved on similar tasks in domains other than music. We are also not aware of existing research works or benchmarking challenges that easily compare to the RecSys Challenge 2018 in terms of the nature of the dataset and the distinction between different types of input playlists used in the evaluation of approaches. A detailed investigation of approaches and achievable results in other target domains using different kinds of playlists and target items therefore remains an avenue for future research.
Another avenue for generalization is given by the fact that the problem for playlist type 1 (title only) resembles a standard search or retrieval task, in which the query is expressed as text, i.e., the name of the playlist to create. Successful approaches, taken in the RecSys Challenge 2018, which particularly address playlists of this type could therefore lead to improved capabilities to search and retrieve music by arbitrary natural language input. This would complement the current research on text-based music retrieval, which most often leverages (user-generated or expert-created) annotations or tags.
9. Future Directions and Open Avenues
Even though the RecSys Challenge 2018 has stimulated a wealth of ideas and creative solutions, we contemplate several directions for additional research that might be worth pursuing.
Integration of additional content and context feature:
Given that solutions in the creative track did not outperform those in the main track, the question arises whether the right or good external data sources have been exploited by the algorithms submitted to the creative track. Almost all submissions relied on content features provided by the Spotify API, omitting the time-consuming task of computing other (maybe better) content descriptors from audio (snippets) of the tracks. Also additional contextual information about tracks, albums, or artists, e.g., Wikipedia articles or album reviews, could be integrated in the future.
Explicit inference of intent or purpose:
In cases where a playlist title is given, sophisticated natural language processing techniques (NLP) could be applied, trying to uncover the listener’s intent or purpose of the playlist. However, identifying such user intents to listen to music, the most important of which are arousal and mood regulation, achieving self-awareness, and expressing social relatedness (schaefer:frpsy:2013, ), is challenging. Therefore, NLP techniques will likely have to be complemented by insights gained from gratification (lonsdale11bjp, ) and other psychological theories.
Modeling and transferring sequence-specific characteristics:
We also see great potential for future approaches that analyze and model certain sequence-specific characteristics of user-generated playlists, formalize them, and integrate them into the sequential recommendation process. Similarly to the artist heterogeneity measure proposed by team Creamy Fireflies (Antenucci:creamy-fireflies, ), aspects of overall playlist coherence (e.g., in terms of genre, style, or acoustic descriptors), coherence of direct song-to-song transitions, or item diversity measures could be computed from user-generated playlists and considered as (weak) constraint in the process of APC, i.e., the seed playlist should be continued in a way that maintains the same level of coherence, diversity, etc.
Evaluation in terms of perceived recommendation quality:
In addition to the mostly accuracy-related performance measures used to gauge performance of submissions, user-centric measures of perceived recommendation quality should be adopted in the future, in order to obtain a truly user-centric perspective of recommendation quality. Such measures of perceived recommendation quality can be assessed through questionnaires in online evaluation settings. Existing questionnaires such as (Ekstrand2014UPD, ; Knijnenburg2012, ) should be extended to the sequence-aware music domain and may eventually include aspects of perceived accuracy, diversity, coherence, satisfaction, novelty, serendipity, and level of personalization.
We would like to thank everyone at Spotify who was involved in the RecSys Challenge, including Ben Carterette, Christophe Charbuillet, Cedric de Boom, Jean Garcia-Gathright, James Kirk, James McInerney, Vidhya Murali, Hugh Rawlinson, Sravana Reddy, Marc Romejin, Romain Yon, and Yu Zhao. Furthermore, we greatly appreciate the help provided by previous organizers of the RecSys Challenge, in particular by Yashar Deldjoo, Mehdi Elahi, and Alan Said.
This work was supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.
-  S. Antenucci, S. Boglio, E. Chioso, E. Dervishaj, S. Kang, T. Scarlatti, and M. F. Dacrema. Artist-driven layering and user’s behaviour impact on recommendations in a playlist continuation scenario. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  D. Bogdanov, A. Porter, P. Herrera, and X. Serra. Cross-collection Evaluation for Music Classification Tasks. In Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR ’16, 2016.
-  P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
-  G. Bonnin and D. Jannach. Automated generation of music playlists: Survey and experiments. ACM Computing Surveys, 47(2):26, 2015.
-  C. J. Burges. From ranknet to lambdarank to lambdamart: An overview. Technical report, June 2010.
-  R. Burke. Hybrid recommender systems: Survey and experiments. User Modeling and User-Adapted Interaction, 12(4):331–370, Nov 2002.
-  C.-W. Chen, P. Lamere, M. Schedl, and H. Zamani. Recsys challenge 2018: Automatic music playlist continuation. In Proceedings of the 12th ACM Conference on Recommender Systems, RecSys ’18, Vancouver, BC, Canada, 2018.
-  R.-C. Chen, L. Gallagher, R. Blanco, and J. S. Culpepper. Efficient cost-aware cascade ranking in multi-stage retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 445–454, New York, NY, USA, 2017. ACM.
-  S. Chen, J. L. Moore, D. Turnbull, and T. Joachims. Playlist prediction via metric embedding. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 714–722, New York, NY, USA, 2012. ACM.
-  T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.
-  V. Dang, M. Bendersky, and W. B. Croft. Two-stage learning to rank for information retrieval. In Advances in Information Retrieval, pages 423–434, Berlin, Heidelberg, 2013. Springer.
-  J.-C. de Borda. Mémoire sur les élections au scrutin. Histoire de l’Académie Royale des Sciences, 1781.
-  M. D. Ekstrand, F. M. Harper, M. C. Willemsen, and J. A. Konstan. User perception of differences in recommender algorithms. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pages 161–168, New York, NY, USA, 2014. ACM.
-  G. Faggioli, M. Polato, and F. Aiolli. Efficient similarity based methods for the playlist continuation task. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  A. Ferraro, D. Bogdanov, J. Yoon, K. Kim, and X. Serra. Automatic playlist continuation using a hybrid recommender system combining features from text and audio. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  J. H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 173–182, Republic and Canton of Geneva, Switzerland, 2017. International World Wide Web Conferences Steering Committee.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the Eighth IEEE International Conference on Data Mining, ICDM ’08, pages 263–272, 2008.
-  K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst., 20(4):422–446, Oct. 2002.
-  K. S. Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21, 1972.
-  S. Kallumadi, B. Mitra, and T. Iofciu. A line in the sand: Recommendation or ad-hoc retrieval? In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  I. Kamehkhosh, D. Jannach, and G. Bonnin. How Automated Recommendations Affect the Playlist Creation Behavior of Users. In Joint Proceedings of the 23rd ACM Conference on Intelligent User Interfaces Workshops: Intelligent Music Interfaces for Listening and Creation, MILC ’18, Tokyo, Japan, March 2018.
-  M. Kaya and D. Bridge. Automatic playlist continuation using subprofile-aware diversification. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  D. M. Kelen, D. Berecz, F. Béres, and A. A. Benczúr. Efficient k-nn for playlist continuation. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  J. Kim, M. Won, C. C. Liem, and A. Hanjalic. Towards seed-free music playlist generation: Enhancing collaborative filtering with playlist title information. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4-5):441–504, 2012.
-  M. Kula. Metadata embeddings for user and item cold-start recommendations. In Proceedings of the 2nd Workshop on New Trends on Content-Based Recommender Systems co-located with 9th ACM Conference on Recommender Systems, pages 14–21, 2015.
-  A. S. Lampropoulos, P. S. Lampropoulou, and G. A. Tsihrintzis. A cascade-hybrid music recommender system for mobile services based on musical genre classification and personality diagnosis. Multimedia Tools and Applications, 59(1):241–258, Jul 2012.
-  V. Lavrenko and W. B. Croft. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’01, pages 120–127, New York, NY, USA, 2001. ACM.
-  L. Li, D. Wang, T. Li, D. Knox, and B. Padmanabhan. Scene: A scalable two-stage personalized news recommendation system. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 125–134, New York, NY, USA, 2011. ACM.
-  A. J. Lonsdale and A. C. North. Why do we listen to music? a uses and gratifications analysis. British Journal of Psychology, 102:108–134, 2011.
-  M. Ludewig, I. Kamehkhosh, N. Landia, and D. Jannach. Effective nearest-neighbor music recommendations. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  B. McFee and G. Lanckriet. The Natural Language of Playlists. In Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR ’11, Miami, FL, USA, October 2011.
-  G. Meseguer-Brocal, G. Peeters, G. Pellerin, M. Buffa, E. Cabrio, C. Faron Zucker, A. Giboin, I. Mirbel, R. Hennequin, M. Moussallam, F. Piccoli, and T. Fillon. Wasabi: a two million song database project with audio and cultural metadata plus webaudio enhanced client applications. In Web Audio Conference 2017 - Collaborative Audio, Queen Mary University of London, London, United Kingdom, 2017.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, NeurIPS ’13, pages 3111–3119. Curran Associates, Inc., 2013.
-  D. Monti, E. Palumbo, G. Rizzo, P. Lisena, R. Troncy, M. Fell, E. Cabrio, and M. Morisio. An ensemble approach of recurrent neural networks using pre-trained embeddings for playlist completion. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 452–461, Arlington, Virginia, United States, 2009. AUAI Press.
-  S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, pages 232–241, New York, NY, USA, 1994. Springer-Verlag New York, Inc.
-  V. Rubtsov, M. Kamenshikov, I. Valyaev, V. Leksin, and D. I. Ignatov. A hybrid two-stage recommender system for automatic playlist continuation. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  A. Said. A short history of the recsys challenge. 37:102–104, 12 2016.
-  G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513 – 523, 1988.
-  T. Schäfer, P. Sedlmeier, C. Städtler, and D. Huron. The Psychological Functions of Music Listening. Frontiers in psychology, 4, 2013.
-  M. Schedl, H. Zamani, C.-W. Chen, Y. Deldjoo, and M. Elahi. Current challenges and visions in music recommender systems research. International Journal of Multimedia Information Retrieval, 7(2):95–116, Jun 2018.
-  N. Tintarev, C. Lofi, and C. C. Liem. Sequences of diverse song recommendations: An exploratory study in a commercial system. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, UMAP ’17, pages 391–392, New York, NY, USA, 2017. ACM.
-  I. Vagliano, L. Galke, F. Mai, and A. Scherp. Using adversarial autoencoders for automatic playlist continuation. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  A. Vall, M. Quadrana, M. Schedl, G. Widmer, and P. Cremonesi. The Importance of Song Context in Music Playlists. In Proceedings of the 11th ACM Conference on Recommender Systems, RecSys ’17, Como, Italy, 2017.
-  T. van Niedek and A. de Vried. Random walk with restart for automatic playlist continuation and query-specific adaptations. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  M. Volkovs, H. Rai, Z. Cheng, G. Wu, Y. Lu, and S. Sanner. Two-stage model for automatic playlist continuation at scale. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 105–114, New York, NY, USA, 2011. ACM.
-  H. Yang, Y. Jeong, M. Choi, and J. Lee. Mmcf: Multimodal collaborative filtering for automatic playlist continuation. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  X. Zhao, Q. Song, J. Caverlee, and X. Hu. Trailmix: An ensemble recommender system for playlist curation and continuation. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
-  L. Zhu, B. He, M. Ji, C. Ju, and Y. Chen. Automatic music playlist continuation via neighbor-based collaborative filtering and discriminative reweighting/reranking. In Proceedings of the 2018 ACM Recommender Systems Challenge, RecSysChallenge ’18, Vancouver, BC, Canada, 2018.
Appendix A Evaluation Metrics
As mentioned earlier in Section 1.4, the quality of submissions were assessed based on the value of three different evaluation metrics: R-precision, normalized discounted cumulative gain (NDCG), and recommended songs clicks. In this appendix, we provide in detail description of each of these metrics.
R-precision measures the fraction of recommended relevant items among all known relevant items (i.e., the number of withheld tracks) and is invariant of the order in which tracks are retrieved. R-precision is calculated on both the track and the artist level, with artist matches contributing a partial score (of 0.25) even if the predicted track is incorrect. Let and be the set of unique track IDs and artist IDs in the ground truth, respectively. Let be the set of track IDs in the top tracks recommended in the submitted playlist, and be the set of unique artist IDs in the same set. Then:
The higher the R-precision, the better.
NDCG (Jarvelin:2002, ) assesses the ranking quality of the recommended tracks and increases when relevant tracks are placed higher in the recommendation list. This metric was originally proposed to evaluate the effectiveness of information retrieval systems. Nowadays, it is also frequently used for evaluating (music) recommender systems. Assuming that tracks for each playlist are sorted according to their recommendation score in descending order, the discounted cumulative gain (DCG) is then defined as follows:
where is the label (as found in the ground truth) for the item ranked at position for the playlist, and is the length of the recommendation list (here, ). DCG is normalized by IDCG – the DCG value for the best possible ranking obtained by ordering the tracks by true ratings in descending order. NDCG is then calculated as:
The higher the NDCG, the better.
Recommended songs clicks (or shortly just “clicks”) is a user-centric beyond-accuracy measure that relates to a Spotify feature called Recommended Songs. Given a playlist title and/or set of tracks in a playlist, this feature recommends 10 tracks to add to the playlist. The list can be refreshed to produce 10 more tracks. The recommended songs clicks metric is the number of refreshes needed before the first relevant track is encountered. It is formalized as shown in the following equation, where is the list of recommended tracks and is the ground truth, i.e., the omitted tracks from the real playlist.
If there is no relevant track in , a value of 51 is picked, which is 1 plus the maximum number of clicks possible. The lower the recommended songs clicks, the better.
Appendix B Sample Playlist from the Dataset
A sample truncated playlist from the MDP dataset is presented below.