NPE: Neural Personalized Embedding for Collaborative Filtering
Abstract
Matrix factorization is one of the most efficient approaches in recommender systems. However, such algorithms, which rely on the interactions between users and items, perform poorly for “coldusers” (users with little history of such interactions) and at capturing the relationships between closely related items. To address these problems, we propose a neural personalized embedding (NPE) model, which improves the recommendation performance for coldusers and can learn effective representations of items. It models a user’s click to an item in two terms: the personal preference of the user for the item, and the relationships between this item and other items clicked by the user. We show that NPE outperforms competing methods for topN recommendations, specially for colduser recommendations. We also performed a qualitative analysis that shows the effectiveness of the representations learned by the model.
NPE: Neural Personalized Embedding for Collaborative Filtering
ThaiBinh Nguyen, Atsuhiro Takasu SOKENDAI (The Graduate University for Advanced Studies), Japan National Institute of Informatics, Japan {binh,takasu}@nii.ac.jp
1 Introduction
In recent years, recommender systems have become a core component of online services. Given the “historical activities” of a particular user (e.g., product purchases, movie watching, and Web page views), a recommender system suggests other items that may be of interest to that user. Current domains for recommender systems include movie recommendation (Netflix and Hulu), product recommendation (Amazon), and application recommendation (Google Play and Apple Store).
The historical activities of users are often expressed in terms of a useritem preference matrix whose entries are either explicit feedback (e.g., ratings or like/dislike) or implicit feedback (e.g., clicks or purchases). Typically, only a small part of the potential useritem matrix is available, with the remaining entries not having been recorded. Predicting user preferences can be interpreted as filling in the missing entries of the useritem matrix. In this setting, matrix factorization (MF) is one of the most efficient approaches to find the latent representations of users and items [?; ?; ?]. To address the sparseness of the useritem matrix, additional data are integrated into MF as “side information.” This might include textual information for article recommendations [?; ?], product images in ecommerce [?], or music signals for song recommendations [?]. However, there are two major issues with these MFbased algorithms. First, these models are poor at modeling coldusers (i.e., users who have only a short history of relevant activities). Second, because these models consider only useritem interactions, the item representations poorly capture the relationships among closely related items [?].
One approach to colduser recommendation is to exploit user profiles. Such proposed models [?; ?] can learn user representations from their profiles (e.g., gender and age). In this way, these models can make recommendations to new users who have no historical activities, provided their user profiles are available. However, user profiles are often very noisy, and in many cases, they are simply not available. Another approach is itemsimilarity based models [?; ?], which recommends items based on item–item similarity. The main issue of this approach is that it considers only the most recent click when making a recommendation, ignoring previous clicks. In addition, these models are not personalized.
In item representations learning, Item2Vec [?] is an efficient model that borrows the idea behind wordembedding techniques [?] for learning item representations. However, the main goal of Item2Vec is to learn item representations and it cannot be used directly for predicting missing entries in a useritem matrix. Furthermore, in making recommendations, Item2Vec is not personalized: it recommends items based on the similarities between items, computed using item representations, and ignores users’ historical activities.
To address these problems, this paper proposes a neural personalized embedding (NPE) model that fuses item relationships for learning effective item representations in addition to improving recommendation quality for coldusers. NPE models a user’s click on an item by assuming that there are two signals driving the click: the personal preference of the user with respect to the item and the relationships between this item and other items that the user has clicked.
To model the personal preference term, we adopt the same approach as MF, which views the preference of a user for an item as the inner product of the corresponding factor vectors. To model the relationships among items, we propose an itemembedding model that generalizes the idea behind wordembedding techniques to click data. However, our itemembedding model differs from the wordembedding model in that the latter can only learn word representations. In contrast, our embedding model can both learn item representations and fill in the useritem matrix simultaneously.
2 Related Work
Matrix Factorization
MF [?; ?] is one of the most efficient ways to perform collaborative filtering. An MFbased algorithm associates each user with a latent feature vector of preferences and each item with a latent feature vector of attributes. Given prior ratings of users to items, MF learns the latent feature vectors of users and items and uses these vectors to predict missing ratings. To address sparseness in the useritem matrix, additional data about items/users are also used [?; ?; ?].
Recently, the CoFactor [?] and CEMF [?] models have been proposed. These models integrate item embedding into the MF model. They simultaneously decompose the preference matrix and the SPPMI matrix (the item–item matrix constructed from coclick information) in a shared latent space. However, in contrast to our proposed method, CoFactor and CEMF use coclick information to regularize the useritem matrix information, whereas NPE exploits coclick information for learning effective representations of items. In [?], the author uses coclick information to address the data sparsity issue in rating prediction.
For colduser recommendations, [?] and [?] proposed models that learn user presentations from user profiles. In [?], the user representations are learned from user profiles via a deep convolutional neural network for event recommendations, whereas [?] has user representations being learned by an autoencoder. Despite these models being very useful for newuser recommendations, the main issue remains that user profiles are not always available. Furthermore, many user profiles may be very noisy (e.g., users may not want to publish their real gender, age, or location), which leads to inaccurate representations of users.
Embedding Models
Wordembedding techniques [?; ?] have been applied successfully to many tasks in natural language processing. The goal of word embedding is to learn vector representations of words that capture the relationships with surrounding words. The assumption behind word embedding techniques is that words that occur in the same context are similar. To capture such similarities, words are embedded into a lowdimensional continuous space.
If an item is viewed as a word, and a list of items clicked by a user is a context window, we can map word embedding to recommender systems. Item2Vec [?] was introduced as a neural networkbased itemembedding model. However, Item2Vec is not able to predict missing entries in a useritem matrix directly. Furthermore, in its recommendations, Item2Vec relies only on the last item, ignoring previous items that a user has clicked.
Exponential Family Embeddings (EFE) [?], a probabilistic embedding model that generalizes the spirit of word embedding to other kinds of data, which can be used for modeling clicks and learn item representations. However, EFE does not support for side information such as items’ rich contents. In addition, EFE is not personalized.
Itembased Models
In itembased collaborative filtering [?; ?], an item is recommended to a user based on the similarity between this item and the items that the user clicked in the past. In [?], an item– similarity matrix is constructed and is used directly to calculate the item similarities for recommendations. Previous work shows that the performance of this method is highly sensitive to the choice of similarity metric and data normalization [?].
SLIM [?] is a recent model that identifies item similarity from the preference matrix by learning a sparse itemsimilarity matrix from the preference matrix. However, the disadvantage of SLIM is that it can only capture the relations between items that are coclicked by at least one user. This will limit the capability of the model when applied to extremely sparse datasets. Furthermore, SLIM can only be used to predict the missing entries of the useritem matrix and cannot be used for learning effective representations of items.
3 NPE: Neural Personalized Embedding
We propose NPE, a factor model that explains users’ clicks by capturing the preferences of users for items and the relationships between closely related items. We will describe the model and how to learn the model parameters.
3.1 Problem Formulation
Each entry in the useritem preference matrix has one of two values or , such that if user has clicked item and otherwise. We assume that indicates that user prefers , whereas indicates that this entry is nonobserved (i.e., a missing entry).
Given a user and the set of items that previously interacted, our goal is to predict a list of items that may find interesting (topN recommendations).
The notations used in this paper are defined in Table 1.
Notation  Meaning 

the number of users and items, respectively  
the useritem matrix (e.g., click matrix)  
the observation data for user (i.e., the row corresponding to user of matrix  
the dimensionality of the embedding space  
the dimensionality of the user input vector  
the dimensionality of the item input vector  
the input vector of user ,  
the input vector of item ,  
the user embedding matrix,  
the itemembedding matrix,  
the item context matrix,  
the embedding vector of user ,  
the embedding vector of item ,  
the context vector of item ,  
The set of all model parameters  
The regularization term  
the set of items that user clicked, excluding (the context items)  
the set of positive examples,  
the set of negative examples, which is obtained by sampling from zero entries of matrix 
3.2 Model Formulation
We denote the observations for user as:
(1) 
NPE models the probability of each observation conditioned on user and its context items as:
(2) 
This equation captures the intuition behind the model, namely that the conditional distribution of whether user clicks on item is governed by two factors: (1) the personal preference of user for item , and (2) the set of items that has clicked (i.e., ).
The likelihood function for the entire matrix is then formulated as:
(3) 
The conditional probability expressed in Eq. 2 is implemented by a neural network. This neural network connects the input vectors of user , item , and context items to their hidden representations as:
(4)  
(5)  
(6) 
where is an activation function such as ReLU.
Note that there are two hidden representations associated with item : the embedding vector and the context vector , which have different roles. Whereas accounts for the attributes of item , accounts for specifying the items that appear in its context.
We can then define the conditional probability in Eq. 2 via the hidden representations as:
(7) 
Note that the function on the right side of Eq. 7 comprises two terms: the first term accounts for how user prefers item , whereas the second term accounts for the compatibility between item and the items that has already clicked.
From Eq. 7, we can also obtain the probability that as:
(8) 
3.3 The Model Architecture
The architecture of NPE is shown in Fig. 1 as a multilayer neural network. The first layer is the input layer which specifies the input vectors of (1) a user , (2) a candidate item , and (3) the context items. Above this is the second layer (the embedding layer), which connects to the input layer via connection matrices , , and .Above the embedding layer, two terms are calculated: the personal preference of user for item and the relationship between and the context items. Finally, the model combines these two terms to compute the output, which is the probability that will click .
Note that, the input layer accepts a wide range of vectors that describe users and items such as onehot vector or content feature vectors obtained from side information. With such a generic input vectors, our method can address the coldstart problem by using content feature vectors as input vectors for users and items. Since this work focuses on the pure collaborative filtering setting, we use only the identities of users and items in the form of onehot vectors as input vectors. Investigating the effectiveness of using content feature vectors, is left for future work.
3.4 Objective Function
Given an observed matrix , our goal is to learn the model parameters that maximize the likelihood function in Eq. 3. However, instead of modeling all zero entries, we only model a small subset of such entries by picking them randomly (negative sampling). This gives:
(10) 
Maximizing the likelihood in Eq. 10 is equivalent to minimizing the following loss function (its negative log function):
(11) 
where .
This loss function is known as the binary crossentropy.
3.5 Model Training
We adopt the Adam technique (a minibatch stochastic gradient descent approach) [?]. We do not perform negative sampling in advance, which can only produce a fixed set of negative samples. Instead, we perform negative sampling with each epoch, which enables diverse sets of negative examples to be used. The algorithm is summarized in Algorithm 1.
3.6 Connections with Previous Models
NPE vs. MF
In the conditional probability in Eq. 7, we can see that the function is a combination of two terms: (1) user preference and (2) item relationship. If the second term is removed, NPE will reduce to an original MF method.
NPE vs. Word Embedding
Similarly, if we remove the first element of in Eq. 7, NPE will model only the relationship among items. If we view each item as a word, and the set of items that a user clicked as a sentence, the model becomes similar to a wordembedding model. However, our embedding model differs in that wordembedding techniques can only learn word (item) representations and cannot fill the useritem matrix directly. In contrast, our embedding model can learn effective item representations while predicting the missing entries in the useritem matrix.
4 Empirical Study
We have studied the effectiveness of NPE both quantitatively and qualitatively. In our quantitative analysis, we compared NPE with stateoftheart methods on topN recommendation task, using realworld datasets. We also performed a qualitative analysis to show the effectiveness of the item representations.
4.1 Datasets
We used three realworld datasets whose sizes varied from small to largescale, from different domains. First, Movielens 10M (ML10m) is a dataset of usermovie ratings, collected from MovieLens, an online film service. Next, Online Retail [?] is a dataset of online retail transactions that contains all transactions from Dec 1, 2010 to Dec 9, 2011 for an online retailer. Finally, TasteProfile is a dataset of counts of song plays by users, as collected by Echo Nest.^{1}^{1}1http://the.echonest.com/
ML10m  OnlineRetail  TasteProfile  

#users  58,059  3,705  211,830 
#items  8,484  3,644  22,781 
#clicks  3,502,733  235,472  10,054,204 
% clicks  0.71%  1.74%  0.21% 
4.2 Experiment Setup
Data Preparation
For the ML10m, we binarized the ratings, thresholding at 4 or above; for TasteProfile and OnlineRetail, we binarized the data and interpreted them as implicit feedback. Statistical information about the datasets is given in Table 2.
We partitioned the data into three subsets, using 70% of the data as the training set, 10% as the validation set, and the remaining 20% as the test set (ground truth).
Evaluation Metrics
After training the models on the training set, we evaluated the accuracy of their topN recommendations using the test set. We used the rankbased metrics Recall@ and nDCG@, which are common metrics in information retrieval, for evaluating the accuracy of the topN recommendations. (We did not use “Precision” because it is difficult to evaluate, given that a zero entry can imply either that the user does not like the item or does not know about the item).
Competing Methods
We compared NPE with the following competing methods:

Bayesian personalized ranking (BPR) [?]: an algorithm that optimizes the MF model with a pairwise ranking loss

Neural collaborative filtering (NeuCF) [?]: a generalization of an MF method in which the inner product of user and item feature vectors are replaced by a deep neural network

Sparse linear model (SLIM) [?]: a stateoftheart method for topN recommendations, which is based on the similarities between items.
4.3 Implementation Details
Since neural networks are prone to overfitting, we apply a dropout after the hidden representation layer. The dropout rate is tuned for each dataset. We use early stopping to terminate the training process if the loss function does not decrease on the validation set for five epochs. The weights for the matrices , , and are initialized as normal distributions. The size of each minibatch was 10,000.
4.4 Experimental Results
TopN Recommendations
Table 3 summarizes the Recall@20 and nDCG@20 for each model. Note that NPE significantly outperforms the other competing methods across all datasets for both Recall and nDCG. We emphasize that all methods used the same data. However, NPE benefits from capturing the compatibility between each item and other items picked by the same users.
Methods  ML10m  OnlineRetail  TasteProfile  

Re@20  nDCG@20  Re@20  nDCG@20  Re@20  nDCG@20  
SLIM  0.1342  0.1289  0.2085  0.1015  0.1513  0.1422 
BPR  0.1314  0.1253  0.2137  0.0943  0.1598  0.1398 
NeuCF  0.1388  0.1337  0.2199  0.0911  0.1609  0.1471 
NPE (our)  0.1497  0.1449  0.2296  0.1742  0.1788  0.1594 
In Table 4, we summarize Recall@20 values for the four methods when different numbers of items were to be recommended. From these results, we can see that NPE consistently outperformed the other methods at all settings. The differences between NPE and the other methods are more pronounced for small numbers of recommended items. This is a desirable feature because we often only a consider a small number of top items (e.g., top or top).
Methods  ML10m  OnlineRetail  TasteProfile  

Re@5  Re@10  Re@20  Re@5  Re@10  Re@20  Re@5  Re@10  Re@20  
SLIM  0.1284  0.1298  0.1342  0.0952  0.1311  0.2085  0.1295  0.1304  0.1513 
BPR  0.1254  0.1261  0.1314  0.0859  0.1222  0.2137  0.1307  0.1311  0.1598 
NeuCF  0.1347  0.1363  0.1388  0.0871  0.1274  0.2199  0.1342  0.1356  0.1609 
NPE (our)  0.1451  0.1487  0.1497  0.1392  0.1667  0.2296  0.1428  0.1523  0.1788 
The Performance on ColdUsers
We studied the performance of the models for users who had few historical activities. To this end, we partitioned the test cases into three groups, according to the number of clicks that each user had. The Low group’s users had less than clicks, the Medium group’s users had clicks, and the High group’s users had more than clicks.
Fig. 2 shows the breakdown of Recall@20 in terms of user activity in the training set for the ML10m and OnlineRetail. Although the details varied across datasets, the NPE model outperformed the other methods for all three groups of users. The differences between NPE and the other methods are much more pronounced for users who have fewest clicks. This is to be expected because, for such users, NPE captures the item relations when making recommendations.
Effectiveness of the Item Representations
We evaluated the effectiveness of item representations by investigating how well the representations capture the item similarity and items that are often purchased together.
Similar items: The similarity between two items is defined as the cosine distance between their embedding vectors. Fig. 3 shows three examples of the top5 most similar items to a given item in the OnlineRetail dataset. We can see that the items’ embedding vectors effectively capture the similarity of the items. For example, in the first row, given a red alarm clock, four of its top5 similar items are also alarm clocks.
Items that are often purchased together: NPE can also identify items that are often purchased together. To assess if two items are often purchased together, we calculate the inner product of one item’s embedding vector and the other’s context vector . A high value of this inner product indicates that these two items are often purchased together. Fig. 4 shows an example of items that tend to be purchased together with the given item. Here, we see that buying a knitting Nancy, a child’s toy, might accompany the purchase of other goods for children or for a household.
Sensitivity Analysis
We also studied the effect of the hyperparameters on the models’ performance.
Impact of the embedding size: To evaluate the effects of the dimensionality of the embedding space on the topN recommendations, we varied the embedding dimension while fixing the other parameters. Table 5 summarizes the Recall@ for NPE on the three datasets for various embedding sizes: . We can see that the larger embedding sizes seem to improve the performance of the models. The optimal embedding size for OnlineRetail is and, for ML10m and TasteProfile is .
ML10m  OnlineRetail  TasteProfile  

Re@20  Re@20  Re@20  
8  0.1428  0.1187  0.0987 
16  0.1451  0.1596  0.1142 
32  0.1441  0.1950  0.1509 
64  0.1497  0.2296  0.1788 
128  0.1482  0.2284  0.1992 
256  0.1459  0.2248  0.1985 
Impact of the negative sampling ratio: During the training of NPE, we sampled negative examples. We studied the effect of the negative sampling ratio on the performance of NPE by fixing the embedding size and evaluating Recall@20 for . From Table 6, we note that when increases, the performance also increases up to a certain value of . The optimal negative sampling ratios are for OnlineRetail and for ML10m and TasteProfile. This is reasonable because ML10m and TasteProfile, being larger than OnlineRetail, will need more negative examples.
ML10m  OnlineRetail  TasteProfile  

Re@20  Re@20  Re@20  
1  0.1392  0.1608  0.1243 
2  0.1418  0.1795  0.1451 
4  0.1441  0.1950  0.1509 
5  0.1478  0.1952  0.1585 
8  0.1563  0.1941  0.1621 
12  0.1531  0.1937  0.1615 
16  0.1524  0.1925  0.1603 
20  0.1496  0.1908  0.1598 
5 Conclusions and Future Work
We propose NPE, a neural personalized embedding model for collaborative filtering, is effective in making recommendations to coldusers and for learning item representations. Our experiments have shown that NPE can outperform competing methods with respect to topN recommendations in general, and to coldusers in particular. Our qualitative analysis also demonstrated that item representations can capture effectively the different kinds of relationships between items.
One future direction will be to study the effectiveness of the model when using available side information about items.We also aim to investigate different negative sampling methods for dealing with zero values in the useritem matrix.
References
 [Barkan and Koenigstein, 2016] Oren Barkan and Noam Koenigstein. Item2vec: Neural item embedding for collaborative filtering. In RecSys Posters, volume 1688 of CEUR Workshop Proceedings, 2016.
 [Chen et al., 2012] Daqing Chen, Sai Laing Sain, and Kun Guo. Data mining for the online retail industry: A case study of rfm modelbased customer segmentation using data mining. Journal of Database Marketing & Customer Strategy Management, 19(3):197–208, Sep 2012.
 [He and McAuley, 2016] Ruining He and Julian McAuley. Vbpr: Visual bayesian personalized ranking from implicit feedback. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 144–150. AAAI Press, 2016.
 [He et al., 2017] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 173–182, 2017.
 [Herlocker et al., 2002] Jon Herlocker, Joseph A. Konstan, and John Riedl. An empirical analysis of design choices in neighborhoodbased collaborative filtering algorithms. Inf. Retr., 5(4):287–310, October 2002.
 [Hu et al., 2008] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 263–272. IEEE, 2008.
 [Kingma and Ba, 2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [Koren, 2008] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD, pages 426–434. ACM, 2008.
 [Koren, 2010] Yehuda Koren. Collaborative filtering with temporal dynamics. Commun. ACM, 53(4):89–97, 2010.
 [Li et al., 2015a] Shaohua Li, Jun Zhu, and Chunyan Miao. A generative word embedding model and its low rank positive semidefinite solution. In EMNLP, pages 1599–1609. The Association for Computational Linguistics, 2015.
 [Li et al., 2015b] Sheng Li, Jaya Kawale, and Yun Fu. Deep collaborative filtering via marginalized denoising autoencoder. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, pages 811–820, 2015.
 [Liang et al., 2016] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M. Blei. Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence. In RecSys, pages 59–66, 2016.
 [Linden et al., 2003] Greg Linden, Brent Smith, and Jeremy York. Amazon.com recommendations: Itemtoitem collaborative filtering. IEEE Internet Computing, 7(1):76–80, January 2003.
 [Mikolov et al., ] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.
 [Nguyen and Takasu, 2017] ThaiBinh Nguyen and Atsuhiro Takasu. A probabilistic model for the coldstart problem in rating prediction using click data. In Neural Information Processing, ICONIP ’17, pages 196–205, 2017.
 [Nguyen et al., 2017] ThaiBinh Nguyen, Kenro Aihara, and Atsuhiro Takasu. Collaborative item embedding model for implicit feedback data. In International Conference on Web Engineering, ICWE ’17, pages 336–348, 2017.
 [Ning and Karypis, 2011] Xia Ning and George Karypis. Slim: Sparse linear methods for topn recommender systems. In Proceedings of the 2011 IEEE 11th International Conference on Data Mining, ICDM ’11, pages 497–506. IEEE Computer Society, 2011.
 [Ning and Karypis, 2012] Xia Ning and George Karypis. Sparse linear methods with side information for topn recommendations. In Proceedings of the Sixth ACM Conference on Recommender Systems, RecSys ’12, pages 155–162. ACM, 2012.
 [Oord et al., 2013] Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep contentbased music recommendation. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 2, NIPS’13, pages 2643–2651. Curran Associates Inc., 2013.
 [Rendle et al., 2009] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009.
 [Rudolph et al., 2016] Maja Rudolph, Francisco Ruiz, Stephan Mandt, and David Blei. Exponential family embeddings. In Advances in Neural Information Processing Systems 29, pages 478–486. 2016.
 [Salakhutdinov and Mnih, 2008] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, volume 20, 2008.
 [Sarwar et al., 2001] Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John Reidl. Itembased collaborative filtering recommendation algorithms. In World Wide Web, pages 285–295, 2001.
 [Tang and Liu, 2017] L. Tang and E. Y. Liu. Joint userentity representation learning for event recommendation in social network. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pages 271–280, April 2017.
 [Wang and Blei, 2011] Chong Wang and David M. Blei. Collaborative topic modeling for recommending scientific articles. In KDD, pages 448–456, 2011.
 [Wang et al., 2015] Hao Wang, Naiyan Wang, and DitYan Yeung. Collaborative deep learning for recommender systems. In KDD, pages 1235–1244, 2015.