Collaborative Item Embedding Model for Implicit Feedback Data
Abstract
Collaborative filtering is the most popular approach for recommender systems. One way to perform collaborative filtering is matrix factorization, which characterizes user preferences and item attributes using latent vectors. These latent vectors are good at capturing global features of users and items but are not strong in capturing local relationships between users or between items. In this work, we propose a method to extract the relationships between items and embed them into the latent vectors of the factorization model. This combines two worlds: matrix factorization for collaborative filtering and item embedding, a similar concept to word embedding in language processing. Our experiments on three realworld datasets show that our proposed method outperforms competing methods on top recommendation tasks.
Keywords:
Recommender system, collaborative filtering, matrix factorization, item embeddingAuthors’ Instructions
1 Introduction
Modern recommender systems (RSs) are a core component of many online services. An RS analyzes users’ behavior and provides them with personalized recommendations for products or services that meet their needs. For example, Amazon recommends products to users based on their shopping histories; an online newspaper recommends articles to users based on what they have read.
Generally, an RS can be classified into two categories: Contentbased approach and collaborative filteringbased (CFbased) approach. The contentbased approach creates a description for each item and builds a profile for each user’s preferences. In other words, the contentbased approach recommends items that are similar to items for which the user has expressed interest in the past. In contrast, the CFbased approach relies on the past behavior of each user, without requiring any information about the items that the users have consumed. An advantage of the CFbased approach is that it does not require collection of item contents or analysis. In this work, we focus on the CFbased approach.
Input data for CFbased methods are the useritem interaction matrix, in which each entry is the feedback of a user to an item. The feedback can be explicit (e.g., rating scores/stars, like/dislike) or implicit (e.g., click, view, purchase). Early work mainly focused on explicit feedback data such as SVD++ [1], timeSVD [2], or probabilistic matrix factorization [3]. One advantage of explicit feedback is that it is easy to interpret because it directly expresses the preferences of users for items. However, explicit feedback is not always available and is extremely scarce, as few users provide explicit feedback.
Implicit feedback, in contrast, is generated in abundance while users interact with the system. However, interpreting the implicit feedback is difficult, because it does not directly express users’ opinions about items. For example, a user’s click on an item does not mean that he or she likes it; rather, the user may click and then find that he or she does not like the item. On the other hand, even though a user does not interact with an item, this does not imply that the user dislikes it; it may be because the user does not know that the item exists.
Hu et al. proposed the weighted matrix factorization (WMF) [4], a special case of the matrix factorization technique targeted to implicit datasets. The model maps each user and item into a lowdimensional vector in a shared latent space, which encodes all the information that describes the user’s preference or the item’s characteristics. Locations of users and items in the space show their relationships. If two items are close together in the space, they are considered to be similar. On the other hand, if a user and an item are close in the space, that user is considered to like that item.
Detecting the relationships between items is crucial to the performance of the RS. We consider two kinds of relationships, the global relationship and a local one. The former indicates the global structure that relates simultaneously to most or all items, and is captured from the overall information encompassed in all user–item interactions. The latter, in contrast, indicates the relationships between a small set of closely related items [1, 5]. Detecting the local relationship will benefit the RS in recommending correlated items. One example of correlated items in the movie domain is the three volumes of the film series “Lord of the Rings.” Usually, a user who watches one of them will watch the others. The detection of local relationships gives the system the ability to capture such correlations and recommend one of these volumes when it knows that the user has watched the others. However, while WMF as well as other MFbased algorithms are strong at capturing the global relationships, they are poor at detecting the local relationships [1, 5].
In this work, we propose a model that can capture both global and local relationships between items. The idea is to extract the relationships between items that frequently occur in the context of each other, and embed these relationships into the factorization model of WMF [4, 6]. The “context” can be the items in a user’s interaction list (i.e., the items that the user interacts with), or the items in a transaction. Two items are assumed to be similar if they often appear in a context with each other, and their representations should be located close to each other in the space. The proposed model identifies such relationships and reflects them into WMF. This was inspired by wordembedding techniques in natural language processing that represent words by vectors that can capture the relationships between each word and its surrounding words [7, 8, 9, 10].
In detail, we build an item–item matrix containing the context information and embed information from this matrix into the factorization model. The embedding is performed by factorizing the user–item matrix and the item–item matrix simultaneously. In the model, the role of the item–item matrix factorization is to adjust the latent vectors of items to reflect item–item relationships.
The rest of this paper is organized as follows. In Sect. 2, we present the background knowledge related to this work. Section 3 presents the details of our idea and how we add item embedding to the original factorization model. In Sect. 4, we explain our empirical study and the experimental results. After reviewing some related work in Sect. 5, we discuss the results of this work and show some directions for future work in Sect. 6.
2 Preliminary
2.1 Weighted Matrix Factorization
Suppose we have users and items. For each user and item , we denote by the number of times user has interacted with item . We assume that user likes item if he or she has interacted with item at least once. For user and item , we define a reference value indicating whether user likes item (i.e., if and otherwise), and a confidence level to represent how confident we are about the value of . Following [4], we define as:
(1) 
where is a positive number.
Weighted matrix factorization (WMF) [4, 6], is a factorization model to learn the latent representations of all users and items in the dataset. The objective function of the model is:
(2) 
where and are matrices with columns and that are the latent vectors of users and items, respectively; is the Frobenius norm of a vector. This optimization problem can be efficiently solved using the Alternating Least Square (ALS) method as described in [4].
2.2 Word Embedding
Word embedding models [11, 7, 8, 10] have gained success in many natural language processing tasks. Their goal is to find vector representations of words that can capture their relationship with their context words (i.e., the surrounding words in a sentence or paragraph).
Given a corpus and a word , a context word of is a word that occurs within a specificsize window around (context window) in the corpus. Let denote the set of all word–context pairs, i.e., , where and are the set of words and set of context words, respectively. Word embedding models represent a word and a context word by vectors and , respectively, where is the embedding’s dimensionality.
Mikolov et al. proposed an efficient model for learning word vectors [7], which is performed by maximizing the loglikelihood function for every wordcontext pair :
(3) 
where is the sigmoid function: , is a distribution for sampling false context words (hence, negative sampling) and is a hyperparameter specifying the number of negative samples. This model is called Skipgram negative sampling (SGNS) [7]. Based on this model, Mikolov et al. released a wellknown open source package named word2vec^{1}^{1}1https://code.google.com/archive/p/word2vec/.
Levy et al. [9] showed that the optimal solutions of Eq. (3) satisfy:
(4) 
where is the pointwise mutual information between word and context word . The symbol , again, is the number of negative samples.
The PMI [12] of a wordcontext pair is a measure that quantifies the association between a word and a context word . It is defined as:
(5) 
where is the probability that appears in the context of ; and are the probabilities that word and context word appear in the corpus, respectively. Empirically, PMI can be estimated using the actual number of observations in a corpus:
(6) 
where is the size of ; is the number of times the pair appears in ; and and are the numbers of times and appear in , respectively.
Levy et al. [9] then proposed a word embedding model by factorizing the matrix , which has elements that are defined in Eq. (7). This matrix is called the shifted positive pointwise mutual information matrix (SPPMI matrix).
(7) 
In other words, the SPPMI matrix is obtained by shifting the PMI matrix by and then replacing all negative values with zeroes (hence, shifted positive pointwise mutual information).
3 Cooccurrencebased Item Embedding for Collaborative Filtering
3.1 Cooccurrencebased Item Embedding
By considering each item as a word, we aim to extract the relationships between items in the same way as word embedding techniques do. Our motivation is that the representation of an item is governed not only by the users who interact with it but also by the other items that appear in its context. In this work, we define “context” as the items occurring in the interaction list of a user (i.e., the items that the user interacts with). However, other definitions of context can also be used without any problems. We argue that if items cooccur frequently in the interaction lists of some users, they are similar, and their latent vectors should be close in the latent space.
Inspired by the work of Levy et al. [9], which we present in Sect. 2.2, we construct an SPPMI matrix of items based on cooccurrences and embed it into the factorization model.
3.1.1 Constructing the SPPMI matrix for items.
We now show how to construct the SPPMI matrix for items according to their cooccurrences.
Let , where is the set of items with which user has interacted. We use to denote the number of times the item pair appears in and to denote the number of times item appears in .
For example, if we have three users , and whose interaction lists are , , and , respectively, we will have:



.
The item–item matrix has elements:
(8) 
where is the pointwise mutual information of pair , as mentioned above, and is a positive integer corresponding to the number of negative samples in the SGNS model [7]. In our experiments, we set .
Because defined above is symmetric, instead of factorizing into two different matrices as in [9], we factorize it into two equivalent matrices. In more detail, we factorize to the latent vectors of items:
(9) 
In this way, can also be viewed as a similarity matrix between items, where element indicates the similarity between item and item .
3.2 Cooccurrencebased Item Embedded Matrix Factorization (CEMF)
We can now show how to incorporate the cooccurrence information of items into the factorization model. The SPPMI matrix will be factorized to obtain the latent vectors of items. The learned latent factor vectors of items should minimize the objective function:
(10) 
Combining with the original objective function in Eq. (2), we obtain the overall objective function:
(11) 
3.2.1 Learning method.
This function is not convex with respect to and , but it is convex if we keep one of these fixed. Therefore, it can be solved using the Alternating Least Square method, similar to the method described in [4].
For each user , at each iteration, we calculate the partial derivative of with respect to while fixing other entries. By setting this derivative to zero, , we obtain the update rule for :
(12) 
Similarly, for each item , we calculate the partial derivative of with respect to while fixing other entries, and set the derivative to zero. We obtain the update rule for :
(13)  
where is the identity matrix (i.e., the matrix with ones on the main diagonal and zeros elsewhere).
3.2.2 Computational complexity.
For user vectors, as analyzed in [4], the complexity for updating users in an iteration is , where is the number of nonzero entries of the preference matrix . Since , if is small, this complexity is linear in the size of the input matrix. For item vector updating, we can easily show that the running time for updating items in an iteration is , where is the number of nonzero entries of matrix . For systems in which the number of items is not very large, this complexity is not a big problem. However, the computations become significantly expensive for systems with very large numbers of items. Improving the computational complexity of updating item vectors will be part of our future work.
4 Empirical Study
In this section, we study the performance of CEMF. We compare CEMF with two competing methods for implicit feedback data: WMF [4, 6] and CoFactor [13]. Across three realworld datasets, CEMF outperformed these competing methods for almost all metrics.
4.1 Datasets, Metrics, Competing Methods, and Parameter Setting
4.1.1 Datasets.
We studied datasets from different domains: movies, music, and location, with varying sizes from small to large. The datasets are:

MovieLens20M (ML20M) [14]: a dataset of users’ movie ratings collected from MovieLens, an online film service. It contains 20 million ratings in the range 1–5 of 27,000 movies by 138,000 users. We binarized the ratings thresholding at 4 or above. The dataset is available at GroupLens^{2}^{2}2https://grouplens.org/datasets/movielens/20m/.

TasteProfile: a dataset of counts of song plays by users collected by Echo Nest^{3}^{3}3http://the.echonest.com/. After removing songs that were listened to by less than 50 users, and users who listened to less than 20 songs, we binarized play counts and used them as implicit feedback data.

Online Retail Dataset (OnlineRetail) [15]: a dataset of online retail transactions provided at the UCI Machine Learning Repository^{4}^{4}4https://archive.ics.uci.edu/ml/datasets/Online+Retail. It contains all the transactions from December 1, 2010 to December 9, 2011 for a UKbased online retailer.
For each user, we selected 20% of interactions as ground truth for testing. The remaining portions from each user were divided in two parts: 90% for a training set and 10% for validation. The statistical information of the training set of each dataset is summarized in Table 1.
ML20M  TasteProfile  OnlineRetail  

# of users  138,493  629,113  3,704 
# of items  26,308  98,486  3,643 
# of interactions  18M  35.5M  235K 
Sparsity (%)  99.5  99.94  98.25 
Sparsity of SPPMI matrix (%)  75.42  76.34  66.24 
4.1.2 Evaluation metrics.
The performance of the learned model was assessed by comparing the recommendation list with the groundtruth items of each user. We used Recall and Precision@ as the measures for evaluating the performance.
Recall@ and Precision@ are usually used as metrics in information retrieval. The former metric indicates the percentage of relevant items that are recommended to the users, while the latter indicates the percentage of relevant items in the recommendation lists. They are formulated as:
(14)  
where is the list of top items recommended to user by the system and is the list of groundtruth items of user .
4.1.3 Competing methods.
We compared CEMF with the following competing methods.
4.1.4 Parameters.

Number of factors : we learn the model with the number of factors running from small to large values: {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}.

Regularization term: we set the regularization parameter for the Frobenius norm of user and item vectors as .

Confidence matrix: we set . We changed the value of and chose the one that gave the best performance.
4.2 Results
We evaluated CEMF by considering its overall performance and its performance for different groups of users. Results for Precision@ and Recall@ show that our method outperformed the competing methods.
4.2.1 Overall performance.
Overall prediction performance with respect to Precision and Recall are shown in Table 2 and Table 3 respectively. These are the results for ; larger values of produce higher accuracy but the differences in performance between the methods do not change much. The results show that CEMF improves the performances for the three datasets over almost all metrics, except for some metrics with for the TasteProfile. If we use only small values of , say or , CEMF outperforms all competing methods over the three datasets.
Dataset  Model  Pre@5  Pre@10  Pre@20  Pre@50  Pre@100 

ML20M  WMF  0.2176  0.1818  0.1443  0.0974  0.0677 
CoFactor  0.2249  0.1835  0.1416  0.0926  0.0635  
CEMF  0.2369  0.1952  0.1523  0.1007  0.0690  
TasteProfile  WMF  0.1152  0.0950  0.0755  0.0525  0.0378 
CoFactor  0.1076  0.0886  0.0701  0.0487  0.0353  
CEMF  0.1181  0.0966  0.0760  0.0523  0.0373  
OnlineRetail  WMF  0.0870  0.0713  0.0582  0.0406  0.0294 
CoFactor  0.0927  0.0728  0.0552  0.0381  0.0273  
CEMF  0.0959  0.0779  0.0619  0.0425  0.0302 
Dataset  Model  Recall@5  Recall@10  Recall@20  Recall@50  Recall@100 

ML20M  WMF  0.2366  0.2601  0.3233  0.4553  0.5788 
CoFactor  0.2420  0.2550  0.3022  0.4101  0.5194  
CEMF  0.2563  0.2750  0.3331  0.4605  0.5806  
TasteProfile  WMF  0.11869  0.1148  0.1377  0.2129  0.2960 
CoFactor  0.1106  0.1060  0.1256  0.1947  0.2741  
CEMF  0.1215  0.1159  0.1369  0.2092  0.2891  
OnlineRetail  WMF  0.1142  0.1463  0.2136  0.3428  0.4638 
CoFactor  0.1160  0.1384  0.1891  0.3020  0.4159  
CEMF  0.1232  0.1550  0.2191  0.3466  0.4676 
4.2.2 Performance for different groups of users.
We divided the users into groups based on the number of items they had interacted with so far, and evaluated the performance for each group. There were three groups in our experiments:

low: users who had interacted with less than 20 items

medium: users who had interacted with items

high: users who had interacted with more than 100 items.
The Precision@ and Recall@ for these groups are presented in Fig. 1. The results show that CEMF outperforms the competing methods for almost all groups of users. For users with small numbers of interactions, CEMF is slightly better than WMF and much better than CoFactor. For users with many items in their interaction lists, CEMF shows much better performance than WMF and better than CoFactor.
In a system, we usually have users with few interactions and users with many interactions; therefore, using CEMF is more efficient than either WMF or CoFactor.
5 Related Work
Standard techniques for implicit feedback data include weighted matrix factorization [4, 6], which is a special case of the matrix factorization technique that is targeted to implicit feedback data, where the weights are defined from the interaction counts, reflecting how confident we are about the preference of a user for an item. Gopalan et al. [16] introduced a Poisson distributionbased factorization model that factorizes the user–item matrix. The common point of these methods for matrix factorization is that they assume that the user–item interactions are independent; thus, they cannot capture the relationships between strongly related items in the latent representations.
Collective matrix factorization (CMF) [17] proposes a framework for factorizing multiple related matrices simultaneously, to exploit information from multiple sources. This approach can incorporate the side information (e.g., genre information of items) into the latent factor model.
In [18], the authors present a factorizationbased method that uses item–item similarity to predict drug–target interactions. While this model uses the item–item similarity from additional sources as side information, we do not require side information in this work. Instead, we exploit the cooccurrence information that is drawn from the interaction matrix.
The CoFactor [13] model is based on CMF [17]. It factorizes the user–item and item–item matrices at the same time in a shared latent space. The main difference between our method and CoFactor is how we factorize the item–item cooccurrence matrix. Instead of representing each item by two latent vectors as in [13], where it is difficult to interpret the second one, we represent each item by a single latent vector.
6 Discussion and Future Work
We have examined the effect of cooccurrence on the performance of recommendation systems. We proposed a method that combines the power of two worlds: collaborative filtering by MF and item embedding with item context for items in the interaction lists of users. Our goal is a latent factor model that reflects the strong associations of closed related items in their latent vectors. Our proposed method improved the recommendation performance on top recommendation for three realworld datasets.
We plan to explore several ways of extending or improving this work. The first direction is to consider different definitions of “context items”. One approach is to define context items as items that cooccur in the same transactions as the given items. In this way, we can extract relationships between items that frequently appear together in transactions and can recommend the next item given the current one, or recommend a set of items.
The second direction we are planning to pursue is to reduce the computational complexity of the current algorithm. As we mentioned in Sect. 3, the computational complexity for updating item vectors is , which becomes significantly expensive for systems with large numbers of items. We hope to develop a new algorithm that can improve this complexity. An online learning algorithm, which updates user and item vectors when new data are collected without retraining the model from the beginning, is also in our plan to improve this work.
Acknowledgments.
This work was supported by a JSPS GrantinAid for Scientific Research (B) (15H02789).
References
 [1] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434. ACM, 2008.
 [2] Yehuda Koren. Collaborative filtering with temporal dynamics. Commun. ACM, 53(4):89–97, 2010.
 [3] Ruslan Salakhutdinov and Andriy Mnih. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, volume 20, 2008.
 [4] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 263–272. IEEE, 2008.
 [5] Robert M. Bell and Yehuda Koren. Lessons from the netflix prize challenge. SIGKDD Explorations, 9(2):75–79, 2007.
 [6] Rong Pan, Yunhong Zhou, Bin Cao, Nathan Nan Liu, Rajan M. Lukose, Martin Scholz, and Qiang Yang. Oneclass collaborative filtering. In IEEE International Conference on Data Mining (ICDM 2008), pages 502–511, 2008.
 [7] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
 [8] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [9] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27, pages 2177–2185. Curran Associates, Inc., 2014.
 [10] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196, 2014.
 [11] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. journal of machine learning research, 3(Feb):1137–1155, 2003.
 [12] K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 1(16):22–29, 1990.
 [13] Dawen Liang, Jaan Altosaar, Laurent Charlin, and David M. Blei. Factorization meets the item embedding: Regularizing matrix factorization with item cooccurrence. In RecSys, pages 59–66. ACM, 2016.
 [14] F. Maxwell Harper and Joseph A. Konstan. The movielens datasets: History and context. TiiS, 5(4):19, 2016.
 [15] Daqing Chen, Sai Laing Sain, and Kun Guo. Data mining for the online retail industry: A case study of rfm modelbased customer segmentation using data mining. Journal of Database Marketing & Customer Strategy Management, 19(3):197–208, 2012.
 [16] Prem Gopalan, Jake M. Hofman, and David M. Blei. Scalable recommendation with poisson factorization. CoRR, abs/1311.1704, 2013.
 [17] Ajit Paul Singh and Geoffrey J. Gordon. Relational learning via collective matrix factorization. In KDD, pages 650–658. ACM, 2008.
 [18] Xiaodong Zheng, Hao Ding, Hiroshi Mamitsuka, and Shanfeng Zhu. Collaborative matrix factorization with multiple similarities for predicting drugtarget interactions. In KDD, pages 1025–1033. ACM, 2013.