MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation
Abstract
We propose MetaProd2vec, a novel method to compute item similarities for recommendation that leverages existing item metadata. Such scenarios are frequently encountered in applications such as content recommendation, ad targeting and web search. Our method leverages past user interactions with items and their attributes to compute lowdimensional embeddings of items. Specifically, the item metadata is injected into the model as side information to regularize the item embeddings. We show that the new item representations lead to better performance on recommendation tasks on an open music dataset.
MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation
Flavian Vasile 
Criteo 
Paris 
f.vasile@criteo.com 
Elena Smirnova 
Criteo 
Paris 
e.smirnova@criteo.com 
Alexis Conneau ^{†}^{†}thanks: Alexis Conneau did this work while interning at Criteo. 
Facebook AI Research 
Paris 
aconneau@fb.com 
\@float
copyrightbox[b]
\end@float
Recommender systems; Embeddings; Neural Networks
In the recent years, online commerce outpaced the growth of traditional commerce, with a rate of growth of 15% in 2015 and accounting for $1.5 trillion spend in 2014. The research work on recommender systems has consequently grown significantly during the last couples of years. As shown by key players in the online ecommerce space, such as Amazon, Netflix and Alibaba, the product recommendation functionality is now a key driver of demand, accounting in the case of Amazon for roughly [?] of the overall sales.
As of now, the stateoftheart recommendation methods include matrix factorization techniques of the itemitem and useritem matrices that differ in the choice of weighting schemes of the matrix entries, the reconstruction loss functions and their choice with regards to the use of additional item content information. The realworld recommender systems have additional constraints that inform their architecture. Some of these major constraints include: scaling the recommender systems such that they can handle a huge amount of user interaction information, supporting realtime changes in recommendation [?] and handling the coldstart problem [?].
In the last couple of years, a promising new class of neural probabilistic models that can generate user and product embeddings has emerged and has shown promising results. The new methods can scale to millions of items and show good improvements on the coldstart problem. In the context of product recommendation, they were successfully applied to ad recommendations in Yahoo! Mail [?], for Restaurant recommendations by OpenTable [?] and in the 1st prize winners in the 2015 RecSys Challenge [?].
In this paper, we present an extension of the Prod2Vec algorithm initially proposed in [?]. The Prod2Vec algorithm only uses the local product cooccurrence information established by the product sequences to create distributed representations of products, but does not leverage their metadata. The authors have proposed an extension [?] of the algorithm that takes into account the textual content information together with the sequential structure, but the approach is specific to textual metadata and the resulting architecture is hierarchical, therefore missing some of the side information terms by comparison with our method. In this work, we make the connection with the work on recommendation using side information and propose MetaProd2Vec, which is a general approach for adding categorical sideinformation to the Prod2Vec model in a simple and efficient way. The usage of additional item information as side informationonly, e.g. available only at training time, is motivated by realworld constraints on the number of feature values a recommender system can keep in memory for realtime scoring. In this case, using the metadata only at training time keeps the memory footprint constant (assuming an existing recommendation system that uses item embeddings) while improving the online performance.
We show that our approach significantly improves recommendation performance on a subset of the 30Music listening and playlists dataset [?] with a low implementation and integration cost.
In Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation we cover previous related work and the relationship with our method. In Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation we present the MetaProd2Vec approach. In Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation we present the experimental setup and the results on the 30Music dataset. In Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation we summarize our findings and conclude with future directions of research.
Existing methods for recommender systems can roughly be categorized into collaborative filtering (CF) based methods, contentbased (CB) methods and hybrid methods. CFbased methods [?] are based on user’s interaction with items, such as clicks, and don’t require domain knowledge. Contentbased methods make use of the user or product content profiles. In practice, CF methods are more popular because they can discover interesting associations between products without requiring the heavy knowledge collection needed by the contentbased methods. However, CF methods suffer from coldstart problem in which no or few interaction are available with niche or new items in the system. In recent years, more sophisticated methods, namely latent factor models, have been developed to address the data sparsity problem of CF methods which we will discuss in Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation. To further help overcoming coldstart problem, recent works focused on developing hybrid methods by combining latent factor models with content information which we will cover in Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation.
Matrix factorization (MF) methods [?, ?] became popular after their success in the Netflix competition. These methods learn lowrank decompositions of a sparse useritem interaction matrix by minimizing the square loss over the reconstruction error. The dot product between the resulting user and item latent vectors is then used to perform recommendation.
Several modifications have been proposed to better align MF methods with the recommendation objective, for instance, Bayesian Personalized Ranking [?] and Logistic MF [?]. The former learns user and item latent vectors through pairwise ranking loss to emphasize the relevancebased ranking of items. The latter models the probability that a user would interact with an item by replacing the square loss in MF method with the logistic loss [?].
One of the first methods that learns user and item latent representations through neural network was proposed in [?]. The authors utilized Restricted Boltzmann Machines to explain useritem interaction and perform recommendations. Recently, shallow neural networks has been gaining attention thanks to the success of word embeddings in various NLP tasks, the focus being on Word2Vec model [?]. An application of Word2Vec to the recommendation task was proposed in [?], called Prod2Vec model. It generates product embeddings from sequences of purchases and performs recommendation based on most similar products in the embedding space. Our work is an extension of Prod2Vec and we will present its details in Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation.
Many techniques have been used recently to create unified representations from latent factors and content information. One way to integrate user and item content information is to use it to estimate user and item latent factors through regression [?]. Another approach is to learn latent factors for both CF and content features, known as Factorization Machines [?].
Tensor factorization have been suggested as a generalization of MF for considering additional information [?]. In this approach, useritemcontent matrix is factorized in a common latent space. The authors in [?] propose cofactorization approach where the latent user and item factors are shared between factorizations of useritem matrix and user and item content matrices. Similar to [?], they also assigned weights to negative examples based on useritem contentbased dissimilarity.
Graphbased models have also been used to create unified representations. In particular, in [?] useritem interactions and side information are modeled jointly through user and item latent factors. User factors are shared by the useritem interaction component and the side information component. Gunawardana et al. [?] leans the interaction weights between user actions and various features such as user and item metadata. The authors use a unified Boltzmann Machines to make a prediction.
In their Prod2Vec paper [?], Grbovic et al. proposed the use of the Word2Vec algorithm on sequences of product receipts coming from emails. More formally, given a set of sequences of products , , the objective is to find a Ddimensional realvalue representation such that similar products are close in the resulting vector space.
The source algorithm  Word2Vec [?] is originally a highly scalable predictive model for learning word embeddings from text and belongs to the larger class of Neural Net Language Models [?]. Most of the work in this area is based on the Distributional Hypothesis [?], which states that words that appear in the same contexts have close if not identical meanings.
A similar hypothesis can be applied in larger contexts such as online shopping, music and media consumption and has been the basis of CF methods. In the CF setup, the users of the services are used as the distributed context in which products cooccur, leading to the classical item cooccurrence approaches in CF. A further similarity between cocount based recommendation methods and Word2Vec has been established by Omer et al. in [?]; the authors show that the objective of the embedding method is closely related to the decomposition of the matrix containing as entries the Shifted Positive PMI of the locally cooccurring items (words), where PMI is the PointWise Mutual Information:
where and are items frequencies, is the number of times and cooccur, is the size of the dataset and is the ratio of negatives to positives.
In [?] the authors show that the Word2Vec objective (and similarly Prod2Vec’s) can be rewritten as the optimization problem of minimizing the weighted cross entropy between the empirical and the modeled conditional distributions of context products given the target product (more precisely, this represents the Word2VecSkipGram model, which usually does better on large datasets). Furthermore, the prediction of the conditional distribution is modeled as a softmax over the inner product between the target product and context product vectors:
Here, is the crossentropy of the empirical probability of seeing any product in the output space conditioned on the input product and the predicted conditional probability :
where is the input frequency of product i and is the number of times the pair of products has been observed in the training data.
The resulting architecture for Prod2Vec is shown in Figure MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation, where the input space of all products situated in center window is trained to predict the values of the surrounding products by using a Neural Network with a single hidden layer and a softmax output layer.
However, the product embeddings generated by Prod2Vec only take into account the information of the user purchase sequence, that is the local cooccurrence information. Though richer than the global cooccurrence frequency used in Collaborative Filtering, it does not take into account other types of item information that are available (the items’ metadata).
For example, assuming that the inputs are sequences of categorized products, the standard Prod2Vec embedding does not model the following interactions:

given the current visit of the product with category , it is more likely that the next visited product will belong to the same category

given the current category , it is more likely that the next category is or one of the related categories (e.g. after a swimwear sale, it is likely to observe a sunscreen sale, which belongs to an entirely different product category)

given the current product , the next category is more likely to be or a related category

given the current category , the more likely current products visited are or
As mentioned in the introduction, the same authors extended the Prod2Vec algorithm in [?] such that to take into account both product sequence and product text in the same time. If applying the extended method to nontextual metadata, the algorithm models, additionally to the product sequence information, the dependency between the product metadata and the product id, but does not link the sequence of metadata and the sequence of product ids together.
As shown in the Related Work section, there has been extensive work on using side information for recommendation, especially in the context of hybrid methods that combine CF methods and Contentbased (CB) methods. In the case of embeddings, the closest work is the Doc2Vec [?] model where the words and the paragraph are trained jointly, but only the paragraph embedding is used for the final task.
We propose a similar architecture that incorporates the side information in both the input and output space of the Neural Net and parametrizes separately each one of the interactions between the items to be embedded and the metadata, as shown in Figure MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation.
The MetaProd2Vec loss extends the Prod2Vec loss by taking into account four additional interaction terms involving the items’ metadata:
where: M is the metadata space (for example, artist ids in the case of the 30Music dataset), is the regularization parameter. We list the new interaction terms below:
is the weighted crossentropy between the observed conditional probability of input product ids given their metadata and the predicted conditional probability. This sideinformation is slightly different from the next three types because it models the item as a function of its own metadata (same index in the sequence). This is because, in most cases, the item’s metadata is more general than the id and can partially explain the observation of the specific id.
is the weighted crossentropy between the observed conditional probability of surrounding product ids given the input products’ metadata and the predicted conditional probability. An architecture where the normal Word2Vec loss is augmented with only this interaction term is very close to the Doc2Vec model proposed in [?] where we replace the document id information with a more general type of item metadata.
is the weighted crossentropy between the observed conditional probability of surrounding products’ metadata values given input products and the predicted conditional probability.
is the weighted crossentropy between the observed conditional probability of surrounding products’ metadata values given input products metadata and the predicted conditional probability. This models the sequence of observed metadata and in itself represents the Word2Veclike embedding of the metadata.
To summarize, and encode the loss terms coming from modeling the likelihood of the sequences of items and metadata separately, represents the conditional likelihood of the item id given its metadata and and represent the crossitem interaction terms between the item ids and the metadata. In Figure MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation we show the relationship between the item matrix factorized by Prod2Vec and the one factorized by MetaProd2Vec.
The more general equation for MetaProd2Vec introduces a separate for each of the four types of sideinformation: , , and .
In Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation we will analyze the relative importance of each type of sideinformation. Also, in the case when multiple sources of metadata are used, each source will have its own term in the global loss and its own regularization parameter.
In terms of the softmax normalization factor, we have the option of either separate the output spaces of the items and of their metadata or not. Similarly with the simplifying assumption used in Word2Vec, that allows each pair of cooccurring products to be predicted and fitted independently (therefore adding an implicit mutual exclusivity constraint on the output products given an input product), we embed the products and their metadata in the same space, therefore allowing them to share the normalization constraint.
One of the main attractions of the Word2Vec algorithm is its scalability, which comes from approximating the original softmax loss over the space of all possible words with the Negative Sampling loss [?, ?], that fits the model only on the positive cooccurrences together with a small sample of the negative examples to maximize a modified likelihood function :
and:
where probability distribution used to sample negative context examples and is a hyper parameter specifying the number of negative examples per positive example. The sideinformation loss terms , , , are computed according to the same formula, where the i,j indexes range over the respective input/output spaces.
In the case of MetaProd2Vec, the impact of the decision to coembed products and their metadata on the loss is that the set of potential negative examples for any positive pair ranges over the union of items’ and metadata values.
Because of the shared embedding space, the training algorithm used for Prod2Vec remains unchanged. The only difference is that, in the new version of the generation step of training pairs, the original pairs of items are supplemented with additional pairs that involve metadata. In terms of the online recommendation system, assuming we are augmenting a solution that already involves item embeddings, the online system does not incur any changes (since the only time we make use of the metadata is during training) and there is zero additional impact on the online memory footprint.
The experimental section is organized as follows. First, we describe the evaluation setup, namely, the evaluation task, success metrics and the baselines. Then, we report results of experiments on the 30Music open dataset.
We evaluate the recommendation methods on the next event prediction task. We consider time ordered sequences of user interactions with the items. We split each sequence into the consequent training, validation and test sets. We fit the embedding Prod2Vec and MetaProd2Vec models on the first (n2) elements of each user sequence and use the performance on the (n1)th element to bench the hyperparameters and we report our final by training on the first (n1) items of each sequence and predicting the nth item.
We use the last item in the training sequence as the query item and we recommend the most similar products using one of the methods described below.
As mentioned in Section MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation, due to the technical constraint of keeping a constant memory footprint, we are interested in the usefulness of item metadata only at training time. Therefore we do not compare against methods where the metadata is used directly at prediction time, such as the supervised contentbased embedding models proposed in [?] where both the user and item are represented as linear combinations of the item content embeddings and [?], where the products are represented by the associated image content embeddings.
We use the following evaluation metrics averaged over all users:

Hit ratio at K (HR@K) that is equal to if the test product appears in the top K list of recommended products.

Normalized Discounted Cumulative Gain (NDCG@K) favors higher ranks of the test product in the list of recommended products.
Using the aforementioned metrics, we compare the following methods:

BestOf: this method retrieves the top products sorted by their popularity. This simulates the frequently encountered recommendation solution based strictly on popularity.

CoCounts: the standard CF method which uses the cosine similarity of vectors of cooccurrences with other items. This method is performing particularly well in cases where the catalog of possible items is small and does not change a lot over time, therefore avoiding item coldstart problems.

Standalone Prod2Vec: the product recommendation solution based on cosine similarities of the vectorial representation of products obtained by using Word2Vec on product sequences. Similarly with other embedding and matrix factorization solutions, the goal of the method is to address the coldstart problem.

Standalone MetaProd2Vec: our proposed method, which enhances Prod2Vec with item side information and uses the resulting product embeddings to compute cosine similarities. As in Prod2Vec, the goal of the method is to further address coldstart problems.

Mix(Prod2Vec,CoCounts): an ensemble method which returns the top items using a linear combination between the Prod2Vec and the CoCountbased item pair similarities. The motivation of the ensemble methods that combine embeddings and CF is to leverage the benefits of the two in coldstart and non coldstart regimes.
(1) 
Mix(MetaProd2Vec,CoCounts): an ensemble method which returns the top items using a linear combination between the MetaProd2Vec and the CoCountbased item pair similarity.
(2)
Method HR@10 NDCG@10 HR@20 NDCG@20 BestOf 0.0003 (0.0002;0.0003) 0.001 (0.001;0.001) 0.0003 (0.0002;0.0003) 0.002 (0.002;0.002) CoCounts 0.0248 (0.0245;0.0251) 0.122 (0.121;0.123) 0.0160 (0.0158;0.0161) 0.141 (0.139;0.142) Prod2Vec 0.0170 (0.0168;0.0171) 0.105 (0.103;0.106) 0.0101 (0.0100;0.0102) 0.113 (0.112;0.115) MetaProd2Vec 0.0191 (0.0189;0.0194) 0.110 (0.108;0.113) 0.0124 (0.0123;0.0126) 0.125 (0.123;0.126) Mix(Prod2Vec,CoCounts) 0.0273 (0.027;0.0276) 0.140 (0.139;0.141) 0.0158 (0.0157;0.0160) 0.152 (0.151;0.153) Mix(MetaProd2Vec,CoCounts) 0.0292 (0.0288;0.0297) 0.144 (0.142;0.145) 0.0180 (0.0178;0.0182) 0.161 (0.160;0.162) Table \thetable: Comparison of recommendation performance of MetaProd2Vec and competing models in terms of HitRate and NDCG. Method Pair freq=0 Pair freq<3 BestOf 0.0002 0.0002 CoCounts 0.0000 0.0197 Prod2Vec 0.0003 0.0078 MetaProd2Vec 0.0013 0.0198 Mix(Prod2Vec,CoCounts) 0.0002 0.0200 Mix(MetaProd2Vec,CoCounts) 0.0007 0.0291 Table \thetable: Recommendation accuracy (HR@20) in coldstart regime as a function of training frequency of the pair (query item, next item). Side Information % lift HR@20 NDCG@20 MetaProd2Vec with only IM 27% 32% MetaProd2Vec with only MI 50% 52% MetaProd2Vec with only JM 55% 60% MetaProd2Vec without MM 61% 65% Table \thetable: Proportion of the MetaProd2Vec lift over BestOf due to each type of side information. We perform our evaluation on the publicly available 30Music dataset [?] that represents a collection of listening and playlists data retrieved from Internet radio stations through Last.fm API. On this dataset, we evaluate the recommendation methods on the task of next song prediction. For the MetaProd2Vec algorithm we make use of track metadata, namely the artist information. We run our experiments on a sample of 100k user sessions of the dataset with the resulting vocabulary size of 433k songs and 67k artists.
We keep the embedding dimension fixed to 50, window size to 3, the side information regularization parameter to 1. We bench the blending factor in Equations 1,2 of the embeddingbased similarity score with the CoCountsbased cosine similarity score and find that the best value . In addition, we vary the number of training epochs and we find the best parameter value to be 30 for Prod2Vec and 10 for MetaProd2Vec. As shown in Table MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation, MetaProd2Vec has better performance both standalone and in the ensemble model with respect to the Prod2Vec model (results computed at 90% confidence levels).
Most of the gains in performance are coming from the coldstart traffic, where, as shown in Figure MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation and Table MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation, we can see that Standalone MetaProd2Vec outperforms by a large margin all other methods when the true pair of (query item, next item) have zero cooccurrences in the training set and that the Mix(MetaProd2Vec,CoCounts) has the best performance in the cases where the true pair has low observed counts. Interestingly, we observe that the difference of performance between the ensemble models is bigger than between the standalone MetaProd2Vec and Prod2Vec models. We explain this by the fact Standalone MetaProd2Vec outperforms Standalone Prod2Vec on the coldstart traffic therefore helping the overall performance of the CoCounts, that by itself performs really well on head traffic. Indeed, CoCounts memorizes frequent pairs of (query, target) product, while Standalone MetaProd2Vec helps to generalize on unseen ones. These results are mirrored by similar findings covered in [?] and motivate the newly introduced approach of Wide and Deep learning.
We posited the relevance of each type of side information, but we want to confirm experimentally that each of the four types of sideinformation brings additional information. We proceed by introducing each of the types of the side information separately and to compare its performance with the original Prod2Vec baseline.
The only exception for which we test the value by leaving it out of the full MetaProd2Vec model is the type of sideinformation; in this case, the input metadata explains the output metadata and this constraint is not directly valuable in regularizing the product embeddings and needs to be introduced together with the other types of pairs that connect the products to the metadata. Therefore its contribution to the model can be computed as (1  degraded model performance) and is therefore on HR and on NDCG.
In Table MetaProd2Vec  Product Embeddings Using SideInformation for Recommendation, we compute the proportion of lift over the BestOf baseline obtained by using each type of side information separately and we observe that each one of them account for at most 50% of the performance of the full MetaProd2Vec, therefore confirming that the additional terms in our proposed model are relevant.
In this paper, we introduced MetaProd2Vec, a new item embedding method that enhances the existing Prod2Vec method with item metadata at training time. This work makes a novel connection between the recent embeddingbased methods and consecrated Matrix Factorization methods by introducing learning with side information in the context of embeddings. We analyzed separately the relative value of each type of side information and proved that each one of the four types is informative. Finally, we have shown that MetaProd2Vec constantly outperforms Prod2Vec on recommendation tasks both globally and in the coldstart regime, and that, when combined with a standard Collaborative Filtering approach, outperforms all other tested methods. These results, together with the reduced implementation cost and the fact that our method does not affect the online recommendation architecture, makes this solution attractive in cases where item embeddings are already in use. Future work will extend on ways of using item metadata as side information and support of noncategorical information such as images and continuous variables.
 [1] Recsys 2015: Making meaningful restaurant recommendations at opentable. http://tinyurl.com/zs9at2t. Accessed: 20160408.
 [2] Venture beat article. http://venturebeat.com/2006/12/10/aggregateknowledgeraises5mfromkleineronaroll/. Accessed: 20160408.
 [3] D. Agarwal and B.C. Chen. Regressionbased latent factor models. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 19–28, New York, NY, USA, 2009. ACM.
 [4] Y. Bengio, H. Schwenk, J.S. Senécal, F. Morin, and J.L. Gauvain. Neural probabilistic language models. In Innovations in Machine Learning, pages 137–186. Springer, 2006.
 [5] B. Chandramouli, J. J. Levandoski, A. Eldawy, and M. F. Mokbel. Streamrec: a realtime recommender system. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 1243–1246. ACM, 2011.
 [6] H.T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. Wide & deep learning for recommender systems. arXiv preprint arXiv:1606.07792, 2016.
 [7] N. Djuric, H. Wu, V. Radosavljevic, M. Grbovic, and N. Bhamidipati. Hierarchical neural language models for joint representation of streaming documents and their content. In Proceedings of the 24th International Conference on World Wide Web, pages 248–255. International World Wide Web Conferences Steering Committee, 2015.
 [8] Y. Fang and L. Si. Matrix cofactorization for recommendation with rich side information and implicit feedback. In Proceedings of the 2Nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec ’11, pages 65–69, New York, NY, USA, 2011. ACM.
 [9] M. Grbovic, V. Radosavljevic, N. Djuric, N. Bhamidipati, J. Savla, V. Bhagwan, and D. Sharp. Ecommerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pages 1809–1818, New York, NY, USA, 2015. ACM.
 [10] A. Gunawardana and C. Meek. A unified approach to building hybrid recommender systems. In Proceedings of the Third ACM Conference on Recommender Systems, RecSys ’09, pages 117–124, New York, NY, USA, 2009. ACM.
 [11] C. C. Johnson. Logistic matrix factorization for implicit feedback data. Distributed Machine Learning and Matrix Computations, 2014.
 [12] A. Karatzoglou, X. Amatriain, L. Baltrunas, and N. Oliver. Multiverse recommendation: Ndimensional tensor factorization for contextaware collaborative filtering. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, pages 79–86, New York, NY, USA, 2010. ACM.
 [13] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, Aug. 2009.
 [14] M. Kula. Metadata embeddings for user and item coldstart recommendations. In Proceedings of the 2nd Workshop on New Trends on ContentBased Recommender Systems colocated with 9th ACM Conference on Recommender Systems (RecSys 2015), Vienna, Austria, September 1620, 2015., pages 14–21, 2015.
 [15] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053, 2014.
 [16] O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, pages 2177–2185, 2014.
 [17] Y. Li, J. Hu, C. Zhai, and Y. Chen. Improving oneclass collaborative filtering by incorporating rich user information. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, pages 959–968, New York, NY, USA, 2010. ACM.
 [18] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [20] A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noisecontrastive estimation. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2265–2273. Curran Associates, Inc., 2013.
 [21] X. Ning and G. Karypis. SLIM: sparse linear methods for topn recommender systems. In 11th IEEE International Conference on Data Mining, ICDM 2011, Vancouver, BC, Canada, December 1114, 2011, pages 497–506, 2011.
 [22] R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang. Oneclass collaborative filtering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pages 502–511, Washington, DC, USA, 2008. IEEE Computer Society.
 [23] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
 [24] S. Rendle, C. Freudenthaler, Z. Gantner, and L. SchmidtThieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, pages 452–461, Arlington, Virginia, United States, 2009. AUAI Press.
 [25] S. Rendle, Z. Gantner, C. Freudenthaler, and L. SchmidtThieme. Fast contextaware recommendations with factorization machines. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’11, pages 635–644, New York, NY, USA, 2011. ACM.
 [26] P. Romov and E. Sokolov. Recsys challenge 2015: ensemble learning with categorical features. In Proceedings of the 2015 International ACM Recommender Systems Challenge, page 1. ACM, 2015.
 [27] M. Sahlgren. The distributional hypothesis. Italian Journal of Linguistics, 20(1):33–54, 2008.
 [28] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 791–798, New York, NY, USA, 2007. ACM.
 [29] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for coldstart recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 253–260. ACM, 2002.
 [30] R. Turrin, M. Quadrana, A. Condorelli, R. Pagano, and P. Cremonesi. 30music listening and playlists dataset.
 [31] R. Turrin, M. Quadrana, A. Condorelli, R. Pagano, and P. Cremonesi. 30music listening and playlists dataset. In Poster Proceedings of the 9th ACM Conference on Recommender Systems, Sept. 2015.
 [32] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Belongie. Learning visual clothing style with heterogeneous dyadic cooccurrences. In Proceedings of the IEEE International Conference on Computer Vision, pages 4642–4650, 2015.
 [33] S.H. Yang, B. Long, A. Smola, N. Sadagopan, Z. Zheng, and H. Zha. Like like alike: Joint friendship and interest propagation in social networks. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 537–546, New York, NY, USA, 2011. ACM.
