Word2vec applied to Recommendation: Hyperparameters Matter
Skip-gram with negative sampling, a popular variant of Word2vec originally designed and tuned to create word embeddings for Natural Language Processing, has been used to create item embeddings with successful applications in recommendation. While these fields do not share the same type of data, neither evaluate on the same tasks, recommendation applications tend to use the same already tuned hyperparameters values, even if optimal hyperparameters values are often known to be data and task dependent. We thus investigate the marginal importance of each hyperparameter in a recommendation setting, with an extensive joint hyperparameter optimization on various datasets. Results reveal that optimizing neglected hyperparameters, namely negative sampling distribution, number of epochs, subsampling parameter and window-size, significantly improves performance on a recommendation task, and can increase it up to a factor of .
Word2vec (W2V) methods (Mikolov et al., 2013a; Mikolov et al., 2013b) come from the Natural Language Processing (NLP) community. They were designed to produce low-dimensional distributional word representations. They were successfully applied to recommendation (Grbovic et al., 2015) to generate user and product embeddings as they can scale to millions of items.
Word corpora and sequences of items are two radically different type of data. Text data has a particular linguistic structure (Bybee and Hopper, 2001), constrained by grammatical and conjugating rules. Sequences of items, such as listening sessions or e-commerce purchase histories, have a different structure induced by the user’s behaviour and the items’ nature (Greer et al., 1973). Moreover, linguistic and recommendation tasks are different. Intuitively, having accurate embeddings for popular items is crucial to perform on recommendation related tasks (Steck, 2011), where top items often represent most of the content users interacted with. On the contrary, for linguistic tasks, the frequent words, mostly linking words, are not relevant (Mikolov et al., 2013b) and so are their embeddings. Since hyperparameters choices are generally known to be data and task dependent (Hutter et al., 2014), we expect optimal hyperparameter configuration to be different for NLP and recommendation.
W2V methods depend on several hyperparameters, some of which were already tuned to some extent by the algorithms’ designers in order to perform well for NLP tasks such as word similarity and analogy detection (Levy et al., 2015), such that most renowned implementations (Řehůřek and Sojka, 2010) set these values as default. In previous work that used W2V for recommendation (Grbovic et al., 2015; Barkan and Koenigstein, 2016; Musto et al., 2015; Ozsoy, 2016; Vasile et al., 2016; Nedelec et al., 2016), the values of these hyperparameters are rarely discussed and therefore assume the original values are used.
Thus, we study the marginal importance of each hyperparameter of Skip-gram with negative sampling (SGNS) in a recommendation setting, using Next Event Prediction (NEP) as an offline proxy for a recommendation task. We conduct an extensive joint hyperparameter optimization on different type of recommendation datasets (2 of music, 1 of e-commerce and 1 of click-stream). This allow us to identify hyperparameters, namely negative sampling distribution, number of epochs, subsampling parameter and window-size, which can significantly improve performance on the NEP task. This confirms that optimal values for these hyperparameters are data and task dependent, and that best configurations for recommendation tasks are radically different than those for NLP tasks, especially regarding the negative sampling distribution.
We first describe W2V methods and associated hyperparameters in Section 2. Then, we present the experiments in Section 3 and results in Section 4 before concluding in Section 5.
W2V (Mikolov et al., 2013a; Mikolov et al., 2013b) is a group of word embedding algorithms that provides state-of-the-art results on various linguistic tasks (Levy et al., 2015). They are based on the Distributional Hypothesis (Sahlgren, 2008), which states that words that appear in the same contexts tend to purport similar meanings. The most common method, SGNS, used in the remaining of the paper, seeks to represent each word as d-dimensional vector , such that words that are similar to each other will have similar vector representations. It does so by maximizing a function of products where appears in the context of (a window around of maximum size ), and minimizing the same function for negative examples where do not necessarily appear in the context of . The loss function is
where is the sigmoid function. For each observation of , SGNS forms negative examples by sampling (hence the term ”negative sampling”) words in the corpus from a -smoothed unigram distribution:
Here before represents the frequency of the word and the parameter is empirically set to following (Mikolov et al., 2013b). This makes frequent words being -smoothly sampled more often than rare words to create the negative examples. Training can be made faster by using a dynamic window-size (i.e.: randomly sampling the window size between 1 and ) or by randomly removing words (sub-sampling) that are more frequent than some threshold with a probability
In recommendation settings, such as music consumption or online shopping, a revised version of the Distributional Hypothesis is adopted to justify the use of SGNS, stating that items that appear in the same contexts share similarities. Grbovic et al. (Grbovic et al., 2015) proposed the use of SGNS on sequences of items to form item embeddings employed in recommendations applications. W2V-based item embeddings have since been successfully used in numerous recommendation scenarios (Barkan and Koenigstein, 2016; Vasile et al., 2016; Ozsoy, 2016; Musto et al., 2015). Since then, this method has been derived to handle problems specific to recommendation. For example, Meta-Prod2vec (Vasile et al., 2016), improves upon Prod2Vec by using the item meta-data side information to regularize the final item embedding, and authors show that they outperforms Prod2Vec on NEP for music, globally and especially in a cold-start regime.
In the following, we describe the role and classically used values of the hyperparameters in the investigated literature (Grbovic et al., 2015; Barkan and Koenigstein, 2016; Musto et al., 2015; Ozsoy, 2016; Vasile et al., 2016; Nedelec et al., 2016), for which joint optimization significantly improved NEP performance.
Negative pairs of items are sampled from the negative sampling distribution which is parametrized by in Equation (2). The original smoothed unigram distribution, proposed in (Mikolov et al., 2013b), samples items proportionally to their frequency raised to the power . This value was empirically chosen because it outperformed the uniform () and unigram () distributions on every linguistic task tested by the authors. This result was further confirmed in (Levy et al., 2015), where the authors extensively studied the marginal effect of each hyperparameter of W2V by performing a joint hyperparameter search for two linguistic tasks. Consequently, widely used implementations of W2V (e.g. gensim (Řehůřek and Sojka, 2010)) use this value by default, and does not clearly present it as tunable. We assume that works that do not discuss this parameter rely on its commonly accepted default value.
The number of epochs controls the total number of times SGNS goes over each item of the dataset, which has a direct impact on the duration and the quality of the training. Its default value is set to in gensim (Řehůřek and Sojka, 2010). Some work states some hyperparameter search on the number of epochs (Ozsoy, 2016; Vasile et al., 2016), but stop too early their investigations or do not develop on the methods nor the final values used.
The window-size is sampled randomly between 1 and the maximum window-size . It controls how wide the gap between two items in a sequence can be such that they are still considered in the same context. The default value is set to in gensim (Řehůřek and Sojka, 2010). Some authors claim without empirical or theoretical verification that it is best to use a ”infinite” window-size (Barkan and Koenigstein, 2016), meaning that the whole sessions is considered as one context, but most arbitrarily used a fixed value without further discussion.
We study the influence of 4 hyperparameters () on final performance by evaluating SGNS on a recommendation task based on items embeddings, with 4 recommendation datasets coming from diverse sources. Our code is available online. 111Code for reproducing our results can be found at https://github.com/anonymous-authors-recsys/w2v_reco_hyperparameters_matter
3.1.1. Music datasets
We rely on 2 sets of listening sessions. The former, ”30Music” (Turrin et al., 2015), composed of listening and playlists data retrieved from Internet radio stations, is open and commonly used for recommendation (Vasile et al., 2016; Jannach et al., 2017; Ben-Elazar et al., 2017; Brusamento et al., 2016). The latter is a private dataset of listening sessions from an on-demand music streaming service. Both are composed of k sessions sampled from the original datasets. We refer to these datasets as Music 1 and Music 2, respectively. Log count distributions for these datasets are shown in Figure 1. The log count distribution tail is sharp and the distribution is left-skewed: there is an important discrepancy between popular and unpopular items. We notice a strong resemblance between the two distributions, which suggests similarities of music usage between users of the two platforms.
3.1.2. E-commerce dataset
We use an open Online Retail dataset (Chen et al., 2012) composed of transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. It is composed of user purchase histories. Compared to music data, the log count distribution tail is heavier and the distribution is less left-skewed: the discrepancy between popular and unpopular items is smaller.
3.1.3. Click-stream dataset
We use the ”kosarak” dataset (Bodon, 2003), which contains anonymized click-stream data of a hungarian on-line news portal. It is composed of user click-stream histories. The log count distribution is comparable to the two music datasets: sharp tail and left-skewness.
3.2. Task and metrics
We evaluate the item embeddings on the Next Event Prediction task, a common way to assess the quality of item embeddings (Vasile et al., 2016; Letham et al., 2013; Rendle et al., 2010) for recommendation. We consider time ordered sequences of user interactions with the items. We split each sequence into training, validation and test sets. We first fit the SGNS model on the first elements of each user sequence; then, we use the performance on k randomly sampled (-th, -th) pairs of items (validation set) to bench the hyperparameters, and finally, we report our final results by performing prediction on k randomly sampled (-th, -th) pairs of items (test set, disjoint with validation set). For prediction, we use the last item in the training sequence as the query item and predict the closest items to the query item using a nearest-neighbor approach (Cover and Hart, 1967). We evaluate with the following metrics:
Hit ratio at K (HR@K). It is equal to if the test product appears in the list of predicted products and otherwise (Le et al., 2007).
Normalized Discounted Cumulative Gain (NDCG@K). It favors higher ranks in the ordered list of predicted products (Järvelin and Kekäläinen, 2002). The formula is
where is the graded relevance of the result, at position . Among the K proposed items, if one of them is equal to the next item, then its relevance for its index among the K proposed items is , all other items having a relevance of .
3.3. Experimental setup
We use a modified implementation of gensim (Řehůřek and Sojka, 2010) for our experiments, such that parameter (Eq. (1)) becomes tunable. We perform an extensive joint hyperparameter search on the number of epochs ( to with step of ), the window-size (), the sub-sampling parameter (Eq. (2)) ( to with step of ), the negative sampling distribution parameter (Eq. (3)) ( to with step of ), the embedding size ( to with a step of ), the number of negative samples ( to with a step of ) and the learning rate ( to with a step of ). The marginal benefit of including the 3 latter variables to the optimization is not significant, with less than % in terms of performance, (inferior to the confidence interval). Thus, for readability, we only focus on the influence of the 4 first hyperparameters and keep the other fixed to default values (respectively and ).
We run the task on the 4 datasets and select the optimal parameters based only on the HR@10 performance, given that we observe a strong correlation with NDCG@10 performance. Results (average score over 10 folds) and confidence intervals on the test set are aggregated in Table 1 (”Fully Optimized SGNS”). To demonstrate the benefit of performing a hyperparameter search, we present the results obtained when SGNS is used with default values as defined in gensim (Řehůřek and Sojka, 2010) implementation in the ”Out-of-the-box SGNS” row. As parameter is often not tunable in implementations and never discussed in the recommendation setting, we also report the results obtained when optimizing on every hyperparameter but in the ”Optimized SGNS” row, in order to isolate the benefit of optimizating over this variable.
To compare the benefit of using recommendation-specific implementations of Prod2Vec, we use Meta-Prod2vec (Vasile et al., 2016) on the two music datasets, having artists as side information, and report results in the ”Fully optimized Meta-Prod2vec” row for optimized models, and ”Meta-Prod2vec (Vasile et al., 2016)” for models trained with the configuration specified in (Vasile et al., 2016). As Meta-Prod2vec was specifically designed to perform well on a cold-start regime, we also report, in Table 2, results on the cold-start scenario, for pair of (query item, next item) that have zero or less than co-occurrences in the training set.
|Model||Music 1 (HR@10)||Music 1 (NDCG@10)||Music 2 (HR@10)||Music 2 (NDCG@10)||E-commerce (HR@10)||E-commerce (NDCG@10)||Click-stream (HR@10)||Click-stream (NDCG@10)|
|Out-of-the-box SGNS||10.77 0.001||0.095 0.0001||8.19 0.001||0.061 0.0001||24.67 0.001||0.172 0.0001||2.81 0.001||0.0157 0.0001|
|Optimized SGNS||22.79 0.9||0.171 0.0001||14.15 0.5||0.098 0.0001||27.10 0.1||0.189 0.0001||23.16 0.3||0.132 0.0001|
|Fully optimized SGNS||24.75 0.4||0.180 0.0001||15.69 0.3||0.107 0.0001||27.46 0.2||0.191 0.0001||24.66 0.3||0.141 0.0001|
|MetaProd2vec (Vasile et al., 2016)||0.0001||-||-||-||-|
|Fully optimized MetaProd2vec||15.84 0.3||0.107 0.0001||-||-||-||-|
|Model (dataset)||Pair frequency = 0||Pair frequency ¡ 3|
|Fully optimized SGNS (Music 1)||8.36 0.4||16.91 0.7|
|Fully optimized MetaProd2vec (Music 1)|
|Fully optimized SGNS (Music 2)||4.85 0.6||9.64 0.7|
|Fully optimized MetaProd2vec (Music 2)|
On the two Music datasets, performing a hyperparameter search roughly doubles the performance (Table 1), over using the default values. The best configurations for these two datasets are quasi-identical (same , sub-sampling parameters and window-size), which is possibly a consequence of the observed similarity of count distributions in Section 3. Hyperparameter optimization allows to increase performance by a factor of for the Click-Stream dataset, and yields substantial performance gains for the E-commerce dataset.
Interestingly, for all datasets, the marginal benefit of including in the joint hyperparameter search is significant in terms of final performance. This is illustrated in Figure 2, where we select the best performing configurations for different values and plot the NEP performance for Music 1 dataset. The original , optimal on linguistic tasks, is clearly suboptimal on this recommendation setting. The optimal hyperparameter is , such that the optimal negative sampling distribution is one more likely to sample unpopular items as negative examples. For the Music 2 and Click-Stream dataset, the optimal is also negative.
We observe that Meta-Prod2vec (Vasile et al., 2016) can also benefit from an hyperparameter optimization, with once again a negative . However, we also note that it is outperformed by an optimized SGNS on both Music 1 and Music 2 (Table 1). On the cold-start regime, results indicates that, once optimized, MetaProd2vec is on par with SGNS (Table 2). Hence, it might be worth optimizing standard methods before moving to more specialized methods.
As expected, results confirm that the optimal choice hyperparameters for SGNS are data-dependent and task-dependent, and that, for the given datasets and the considered task (NEP), it is highly valuable in terms of final performance to jointly optimize the hyperparameters. Especially, the optimal negative sampling distribution clearly differs from the one proven to be optimal for linguistic tasks (Mikolov et al., 2013b; Levy et al., 2015), and optimizing over this additional variable yields significant improvements. From Equation (1) we can see that when is negative, popular items are more often pushed away from unpopular items, which could be beneficial for the NEP task as items within a session are often of the same order of popularity.
Developed first for NLP, SNGS generate words embeddings that help achieving state of the art performance in semantic similarity and analogy tasks. Previous work shows that it can be directly applied to sequences of items to generate item embeddings useful for recommendations. Interestingly, while NLP data and tasks differ in their structure and goals from those of recommendation, the hypothesis behind some of the parameters of the algorithms are barely discussed, nor their default fixed values revised. We show that using different values for some hyperparameters, namely negative sampling distribution, number of epochs, subsampling parameter and window-size, leads to significantly better performances on classical evaluation tasks on 4 recommendation datasets. While performing a joint hyperparameter search for each different type of data and tasks in a real-life recommendation setting can be time consuming and computationally costly, we stress out the benefits of having better item embeddings to better distinguish, cluster and classify content, which can lead to substantial gains in demanding industries, such as on-demand music streaming services, where a few bad recommendations can quickly lead a user to leave the service.
Comparing several recommendation datasets on the same task, we observe that different distributions result in different optimal hyperparameter values. Tuning the parameters seems to drive the algorithm towards organizing the space in a way or another (e.g. better positioning the top items in the space, or pushing away less demanded items). Furthermore, the homogeneity of popularity between items of a same sequence, the shape of the popularity distribution, or the heterogeneity of the items in the catalog have a direct impact on the task evaluation. Intuitively, the optimal hyperparameter values found depend on the structure of the data. We have yet to find if we can induce the former by the latter. This could have a strong impact on improving the current results with SNGS applied to recommendation.
- Barkan and Koenigstein (2016) Oren Barkan and Noam Koenigstein. 2016. Item2vec: neural item embedding for collaborative filtering. In Machine Learning for Signal Processing (MLSP), 2016 IEEE 26th International Workshop on. IEEE, 1–6.
- Ben-Elazar et al. (2017) Shay Ben-Elazar, Gal Lavee, Noam Koenigstein, Oren Barkan, Hilik Berezin, Ulrich Paquet, and Tal Zaccai. 2017. Groove Radio: A Bayesian Hierarchical Model for Personalized Playlist Generation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 445–453.
- Bodon (2003) Ferenc Bodon. 2003. A fast APRIORI implementation.. In Frequent Itemset Mining Implementations Workshop, 2003. Third IEEE International Conference on Data Mining (ICDM). IEEE.
- Brusamento et al. (2016) Mattia Brusamento, Roberto Pagano, MA Larson, and Paolo Cremonesi. 2016. Explicit Elimination of Similarity Blocking for Session-based Recommendation. RecSys 2016 Poster Proceedings (2016).
- Bybee and Hopper (2001) Joan L Bybee and Paul J Hopper. 2001. Frequency and the emergence of linguistic structure. Vol. 45. John Benjamins Publishing.
- Chen et al. (2012) Daqing Chen, Sai Laing Sain, and Kun Guo. 2012. Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing & Customer Strategy Management 19, 3 (2012), 197–208.
- Cover and Hart (1967) Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classification. IEEE transactions on information theory 13, 1 (1967), 21–27.
- Grbovic et al. (2015) Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1809–1818.
- Greer et al. (1973) R Douglas Greer, Laura G Dorow, Gustav Wachhaus, and Elmer R White. 1973. Adult approval and students’ music selection behavior. Journal of Research in Music Education 21, 4 (1973), 345–354.
- Hutter et al. (2014) Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. 2014. An efficient approach for assessing hyperparameter importance. In International Conference on Machine Learning. 754–762.
- Jannach et al. (2017) Dietmar Jannach, Iman Kamehkhosh, and Lukas Lerche. 2017. Leveraging multi-dimensional user models for personalized next-track music recommendation. In Proceedings of the Symposium on Applied Computing. ACM, 1635–1642.
- Järvelin and Kekäläinen (2002) Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
- Le et al. (2007) Quoc V Le, Alex Smola, Olivier Chapelle, Choon Hui Teo, and Ralf Herbrich. 2007. Direct optimization of ranking measures. KDD’13.
- Letham et al. (2013) Benjamin Letham, Cynthia Rudin, and David Madigan. 2013. Sequential event prediction. Machine learning 93, 2-3 (2013), 357–380.
- Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3 (2015), 211–225.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119.
- Musto et al. (2015) Cataldo Musto, Giovanni Semeraro, Marco De Gemmis, and Pasquale Lops. 2015. Word Embedding Techniques for Content-based Recommender Systems: An Empirical Evaluation. RecSys 2015 Poster Proceedings.
- Nedelec et al. (2016) Thomas Nedelec, Elena Smirnova, and Flavian Vasile. 2016. Content2vec: Specializing joint representations of product images and text for the task of product recommendation. (2016).
- Ozsoy (2016) Makbule Gulcin Ozsoy. 2016. From word embeddings to item recommendation. arXiv preprint arXiv:1601.01356 (2016).
- Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
- Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. In Proceedings of the 19th international conference on World wide web. ACM, 811–820.
- Sahlgren (2008) Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Disability Studies 20 (2008), 33–53.
- Steck (2011) Harald Steck. 2011. Item popularity and recommendation accuracy. In Proceedings of the fifth ACM conference on Recommender systems. ACM, 125–132.
- Turrin et al. (2015) Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, and Paolo Cremonesi. 2015. 30Music Listening and Playlists Dataset.. In RecSys 2015 Poster Proceedings.
- Vasile et al. (2016) Flavian Vasile, Elena Smirnova, and Alexis Conneau. 2016. Meta-Prod2Vec: Product Embeddings Using Side-Information for Recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 225–232.