Fast Adaptively Weighted Matrix Factorization for Recommendation with Implicit Feedback

Fast Adaptively Weighted Matrix Factorization for Recommendation with Implicit Feedback

Abstract

Recommendation from implicit feedback is a highly challenging task due to the lack of the reliable observed negative data. A popular and effective approach for implicit recommendation is to treat unobserved data as negative but downweight their confidence. Naturally, how to assign confidence weights and how to handle the large number of the unobserved data are two key problems for implicit recommendation models. However, existing methods either pursuit fast learning by manually assigning simple confidence weights, which lacks flexibility and may create empirical bias in evaluating user’s preference; or adaptively infer personalized confidence weights but suffer from low efficiency.

To achieve both adaptive weights assignment and efficient model learning, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder. The personalized data confidence weights are adaptively assigned with a parameterized neural network (function) and the network can be inferred from the data. Further, to support fast and stable learning of FAWMF, a new specific batch-based learning algorithm fBGD has been developed, which trains on all feedback data but its complexity is linear to the number of observed data. Extensive experiments on real-world datasets demonstrate the superiority of the proposed FAWMF and its learning algorithm fBGD.

Introduction

Recommender systems play an important role in many internet services. Since in practise most of recommender systems only have the implicit feedback (e.g. items consumed), recently research attention is increasingly shifted from explicit feedback (e.g. rating prediction) to implicit feedback. However, learning a recommender system from implicit feedback is more challenging. In implicit feedback scenarios, only positive feedback are observed, and the unobserved user-item feedback data (e.g. a user has not bought an item yet) are a mixture of real negative feedback (i.e. a user does not like it) and missing values (i.e. a user just does not know it). Existing methods address this problem by treating all the un-observed data as negative (dislike) but downweighting their confidence. Although it is reasonable, it poses two important questions: (1) How to assign confidence weights for each data? (2) How to handle the massive volume of the unobserved data efficiently?

While there is a rich literature on recommendation from implicit feedback, to our knowledge, all existing methods lack one or more desiderata. Most of these methods rely on the meticulous assignment of confidence weights to the data. For example, WMF [10] and eALS [7] give uniform or popularity-based confidence weights; BPR [22], CDAE [28], or other sophisticated models downweight the contributions of the negative data in their heuristical (e.g. uniformly) negative sampling strategy. However, choosing these weights usually involves heuristic alterations to the data and needs expensive exhaustive grid search via cross-validation. Furthermore, it is unrealistic for researchers to manually set flexible and diverse weights for millions of data. In practical scenarios, the data confidence weights may change for various user-item combinations. Some unobserved data can be attributed to user’s preference while others are the results of users’ limited scopes. To this end, exposure-based matrix factorization (EXMF) [17] has been proposed to downweight negative data automatically by predicting how a user knows a item. However, EXMF involves iterative inference of the exposure for each data, which is computationally expensive and also potentially suffers from overfitting. More recently, SamWalker [4] adaptively assign confidence weights from users’ social contexts. However, the social network information is hard obtained in many recommender systems.

To address these problems, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder [13] for both adaptive weights assignment and efficient learning. We first analyze EXMF from variational perspective and find that the variational posterior of user’s exposure acts as data confidence weights in learning users’ preference with MF. It is consistent with our intuitions. Only if the user is exposed to the item, he can decide whether to consume the item based on his preference. The data with larger exposure are more reliable in deriving user’s preference. Further, we propose to replace confidence weights (variational posterior of user’s exposure) with a parameterized inference neural network (function), which reduces the number of the inferred parameters and is capable of capturing latent correlations between users’ exposure. In fact, the independent assumption of the exposure is not practical in real world. Typically, recent literatures [19, 31] in social science claim that each of us belongs to some information-sharing communities. Naturally, users are exposed to these communities and thus their exposure exhibit correlations when they belong to common communities. Motivated by this point, a specific community-based inference neural network has been designed, which explores latent communities among users and infer users’ exposure from the communities that they belong to. It will capture more precise exposure and can mitigate the over-fitting problem. By optimizing the variational lower bound, both the inference neural network and MF can be learned from the data.

Efficiently learning a recommendation model from implicit feedback is also challenging since it requires to account for the large-scale unobserved data. Although stochastic gradient descent optimizer (SGD) with sampling strategy can be employed to speed up the inference, it usually suffers from low convergency and high gradient instability. Especially in the implicit recommendation task, the optimizer usually samples the uninformative data, which have low confidence and make limited contributions on gradient update [4]. The final recommendation performance will suffer. Instead, we turn to employ batch gradient descent optimizer (BGD), which computes stable gradient from all feedback data and usually converges to a better optimum. Unfortunately, the naive implement of BGD will suffer from low efficiency caused by the expensive full-batch gradient computation over all data. To address this problem, we develop a fast BGD-based learning algorithm fBGD for our FAWMF. With rigorous mathematical reasoning, massive repeated computations of original BGD can be avoided and the learning process can be accelerated. Notably, despite fBGD computes gradient over all data, its actual complexity is linear with the number of the unobserved positive data. Due to the sparsity of the implicit feedback data, fBGD achieves a significant acceleration.

We summarize our key contributions as follows:

  • We propose a FAWMF model based on variational auto-encoder to achieve both adaptive weights assignment and efficient model learning.

  • A batch-based learning algorithm fBGD has been developed to learn FAWMF efficiently and stably.

  • Extensive experimental evaluations on three well-known benchmark datasets demonstrate the superiority of our FAWMF over the existing implicit MF methods.

Related work

Recommendation from implicit feedback data. Most of the existing methods manually assign coarse-grained confidence weights. For example, the classic weighted factorization matrix model (WMF) [10], eALS [7], ICD-MF [1], used a simple heuristic where the confidence weights are assigned uniformly or based on item popularity; Logistical matrix factorization [12], BPR [22], or neural-based recommendation models (e.g. CDAE [28], NCF[6]) downweight the contribution of negative data implicitly in their heuristic negative sampling strategy (e.g. uniformly). More recently, a new probabilistic model EXMF[17] was proposed to incorporate user’s exposure to items into the CF methods. When inferring user’s preference, user’s exposure can be translated as data confidence. However, this method suffers from the efficiency problem. Also, some sophisticated models have been proposed to learn confidence weights from users’ social contexts [27, 11, 4]. However, the social information is not available in many cases.

Efficient learning algorithm for implicit recommendation models. To handle the large-scale unobserved data, two types of strategies have been proposed for efficient learning: sample-based learning and whole-data based learning. The first type achieves fast learning with stochastic gradient descent (SGD) and negative sampling. The most popular sampling strategy is to draw un-observed feedback data uniformly, which is widely adopted in LMF[12], BPR[22], CDAE [28], NCF [6], Mult-DAE [18] etc. Also, [29], [5] and [8] further propose item popularity-based and item-user co-bias sampling strategy. However, in recommendation task SGD usually suffers from slow convergence and gradient instability when the number of items is large [4].

Thus, some other dynamic sampling strategies are proposed to improve convergence and accuracy. Some literatures [23, 30, 21, 26] propose to oversample the “difficult” negative instances which are hard to be discriminated by the models. However, they will expose efficiency issues in sampling. Also, the stochastic gradient estimator is biased and may amplify the natural noise in users feedback data [16]. More recently, SamWalker [4] conduct the random walk along the social network to select informative instances. Although effectively, it has two weakness: (1) SamWalker requires additional social information but it is not available in many recommender systems. (2) SamWalker still employs uniformly sampling strategy to update its dynamic sampler, leading to insufficient training of it.

A more effective and stable way is updating the model from the whole-data, but it face efficiency challenge due to the large number of the negative data. Thus, memorization strategies (e.g. ALS, eALS) has been proposed [10, 7, 1, 3] to speed up learning. However, these algorithms are just suitable for the MF with simple manual confidence weights, which lacks flexible and may create empirical bias. The recommendation performance will suffer.

Adaptive weights assignment and efficient learning are both important in recommendation from implicit feedback. Existing methods are not able to provide both of these, which motivates the approach described in this paper.

Preliminaries

In this section, we first give the problem definition of implicit recommendation. Then, we introduce exposure-based matrix factorization (EXMF) [17] framework from variational perspective to provide usual insight about the relation between user’s exposure and data confidence.

Problem definition

Suppose we have a recommender system with user set (including users) and item set (including items). The implicit feedback data is represented as matrix with entries denoting whether or not the user has consumed the item . The task of a recommender system can be stated as follow: recommending items for each user that are most likely to be consumed by him.

Exposure-based matrix factorization (EXMF)

Note that unobserved feedback data contain the real negative data (dislike) and the missing values (unknown). EXMF introduces a Bernoulli variable to model users’ exposure: denotes that the user knows the item and denotes not. Then, EXMF models user’s consumption based on as follow:

(1)
(2)
(3)

where denotes ; is the prior probability of exposure. Here we relax function as to make model more robust, where is a small constant (e.g. =1e-5). When , we have , since the user does not know the item and he can not consume it. When , when the user has learned the item, he will decide whether or not to consume the item based on his preference. Thus, is modeled as the classic matrix factorization model, where denotes the -dimensional latent factor (preference) of user , and denotes the -dimensional latent factor (attribute) of item .

Analyses of EXMF from variational perspective

The marginal likelihood of EXMF is composed of a sum over each datapoint , which can be rewritten as:

(4)

where is defined as an approximated variational posterior of . Since the second term – KL-divergence – is non-negative, the first term is the evidence lower bound on (ELBO) the margin likelihood. Classic variational methods [9] usually employ conjugate variational distribution and individual variational parameters1, i.e. . Then, optimizing EXMF can be transferred to minimize the following objective function w.r.t :

(5)

Exposure as data confidence. The objective function consists of three parts: (1) Weighted matrix factorization to learn user’s preference; (2) The loss when the data are predicted as unknown; (3) The regularization term – KL divergency between the prior and the variational posterior. A good property is observed that the parameters , which characterize the probability that user is exposed to item , act as the confidence of the corresponding data in learning user’s preference. It is consistent with our intuitions. Only if the user is exposed to the item, he can decide whether to consume the item based on his preference. The data with larger exposure are more reliable in deriving user’s preference.

Inefficiency problem. EXMF suffers efficiency problems. The number of inferred variational parameters (confidence weights) grows quickly with the number of users and items (), which easily scale to billion or even larger level. It will potentially suffer from overfitting and become the time and space bottleneck for practise recommender systems. Further, the inference of EXMF requires to sum over the terms for each data, which is time-consuming.

Fast adaptively weighted matrix factorization

To address these problems, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational auto-encoder to achieve both adaptive weight assignment and efficient model learning. To reduce the number of the inferred parameters, FAWMF replaces individual parameters with a parameterized function that maps users’ consumption on items into their exposure with parameters . It is more reasonable since there exists interactions between users. Independent assumption of exposure is not practical in real world. Typically, recent literatures [19] in social science claim that each of us belongs to some information-sharing communities. Users are exposed to these communities and thus their exposure exhibit commonality when they belong to similar communities. Motivated by this point, we summarize communities as collections of similar users and further infer users’ exposure from the communities that they belong to. Concretely, the inference function can be modeled as follow:

(6)

where vector denotes the membership that allocates each user to a fixed number of communities and meet , . We cumulate the consumptions of the users in the community to infer how items are exposed to the community, where captures the heterogenous roles of users. Different users may have different influence strength on the communities [24, 25]. Then, a logistic linear function with parameters has been employed to map cumulated consumptions into the exposure. Intuitively, the more influence users in the common community have consumed the item, the user are more likely to know it. Finally, user’s exposure can be depicted by how the user belongs to the communities and how items are exposed to these communities. Overall, the inference of the data confidence can be translated into learning the parameters of inference function (), which captures latent correlations between for better estimation and reduces the number of inferred parameters from to . Different from existing clustering-based methods (e.g. [2, 15]) which cluster users into communities/submatrices based on users’ preference, our FAWMF deduces implicit communities based on users’ exposure.

Also, FAWMF can be well understood from neural network perspective. As shown in Figure 1, at first an outer product between each user’s consumption and his membership has been employed for interactions. The obtained tensor can be regarded as the influence of user’s consumptions along different dimensions (communities). Then, we use a specific striped CNN [14] layer and a linear layer to encode the influences (consumptions) into the “kernel maps” of users’ exposure. Here we choose striped CNN since the adjacent elements of are just caused by the adjacent users(items) id and do not suggest they will have share more commonality. Further, the exposure for different user-item pairs can be depicted as the combination of the “kernel maps”. Finally, the exposure-based MF acts as a probabilistic decoder to predict users’ consumption. Overall, FAWMF forms an auto-encoder to learn both data confidence weights and latent factors of MF.

Figure 1: Illustration of the proposed FAWMF

Discussion

How does FAWNF mitigate overfitting? To better understand this effect, let us draw an analogy with the floating balls in the water, as illustrated in Figure 2. Learning EXMF model by optimizing equation (5) will give a force to push up these positive balls (instances) and push down these unobserved balls (instances). Thus, without strong priors, the confidence weights of the data easily achieve extremely values ( for the positive instances and for the unobserved instances), where unobserved data makes few contributions on training. In this situation, all the instances will be predicted as positive by MF and the model will suffer from over-fitting. In our FAWMF, this over-fitting effect will be mitigated by introducing inference network of the . There exists correlations between users’ exposure, which can be analogies with elastic links lines between the balls. Typically, users tend to have similar exposure (), if they share more common communities. Naturally, with the training of the model, the unobserved data which have correlations with positive instances, will be pulled up due to the force from the lines. This way, when the model has well fitted the data, the positive and the unobserved instances will get stable at different depth in water. The unobserved instances which have stronger correlations with positive instances, usually reaches higher position than other unobserved instances.

Figure 2: Illustration of how FAWMF mitigates over-fitting

Fast learning algorithm fBGD

Learning a recommendation model from implicit feedback is time-cost expensively due to the massive volume of the unobserved data. The popular solution addressing this problem is stochastic gradient descent (SGD) with negative sampling. However, SGD usually suffers from low convergency and gradient instability, which affects the performance of the model. Especially, in the recommendation task, the sampler usually select uninformative instances that have small confidence weights and make limited contributions on update. Thus, we turn to use batch gradient descent optimizer (BGD), which computes stable gradient from all feedback data. Unfortunately, the orginal BGD will suffer from low efficiency due to the full-batch computation of the gradient over instances. To address this problem, we develop new specific learning algorithm fBGD for our FAWMF. We speed up learning of BGD by caching some specific intermediate variables to avoid the massive repeated computations. Here we detail the derivation of the gradient w.r.t ; where the counterpart of others () is achieved likewise and presented in supplemental materials.

For better derivation, we let . Also, we drop out the regularization term in the objective function (Eq. (5)) since the regularization term usually makes no contribution on recommendation accuracy of FAWMF and hinders fast training. It also can be regarded that we set a large value of parameter . Then, the gradient of objective function (Eq. (5)) w.r.t for each user can be derived as follow:

(7)
(8)
(9)

Clearly, the computational cost lies in two parts: (1) the calculation of the gradients of w.r.t for each item , which requires the summation over the terms of each user ; (2) the summation of over each item to get final result. In fact, the first part produces the main cost since it requires a traversal of all users and repeat this operation over all items; while the second part can be accelerated by just iterating over these positive instances (). The overall computational complexity to update is ), which is generally infeasible since can easily reach billion level or even higher in practise.

To speed up the learning, we rewrite the computational bottlenecks – Eq. (8) – by isolating item-independent terms:

(10)

By this reformulation, we can see that the major computation – the and over all users – is independent of item . Thus, we could achieve a significant speed-up by caching these terms. That is, we cache for each and , where is a -dimensional tensor and is a D-dimensional vector. Then, can be calculated as follow:

(11)

The rearrangement of nested sums is the key transformation that allows the fast optimization. The time computation can be reduced from to , where denotes the number of the observed positive data. Here we spend time on calculating and time on calculating the first and second term in equation (11) for each item . For the third term, we just sum over the terms with with complexity . Similar strategy can be used to accelerate the update of other parameters. With this algorithm, learning FAWMF model is efficient and stable, which trains on the all data but its complexity is linear to the number of the observed data.

Experiments and analyses

In this section, we conduct experiments to evaluate the performance of FAWMF. Our experiments are intended to address the following major questions:

  1. Does FAWMF outperform state-of-the-art implicit MF methods?

  2. How does the proposed batch-based learning algorithm fBGD perform?

Experimental protocol

Datasets. Three benchmark datasets Moivelens 2, Amazon (food-reviews) 3, Douban 4 are used in our experiments. These datasets contain users’ feedback on items. The dataset statistics are presented in Table 1. Similar to [4], we preprocess the datasets so that all items have at least three interactions and“binarize” user’s feedback into implicit feedback. That is, as long as there exists some user-item interactions (ratings or reviews), the corresponding implicit feedback is assigned a value of 1. Grid search and 5-fold cross validation are used to find the best parameters. In our FAWMF, we set ==, =1e-5 and learning rate= across all datasets.

Datasets #Users #Items #Oberseved positive feedback
Movielens 6,040 3,952 1,000,209
Amazon 10,619 37,762 256,287
Douban 123,480 20,029 16,624,937
Table 1: Statistics of three datasets.
Methods
Adaptive
weights?
Without
sampling?
Complexity
WMF(ALS)
eALS
BPR
CDAE
EXMF
FAWMF
Table 2: characteristics of the compared methods.
Methods Movielens Amazon Douban
Pre@5 Rec@5 NDCG@5 MRR Pre@5 Rec@5 NDCG@5 MRR Pre@5 Rec@5 NDCG@5 MRR
Item-pop 0.2092 0.0400 0.2201 0.8958 0.0027 0.0048 0.0055 0.0191 0.1409 0.0332 0.1582 0.6308
WMF(ALS) 0.3841 0.0924 0.4059 1.5751 0.0789 0.0406 0.0858 0.3269 0.2400 0.0656 0.2598 1.0113
eALS 0.3955 0.0917 0.4175 1.5998 0.0984 0.0348 0.1051 0.3951 0.2329 0.0646 0.2520 0.9880
BPR 0.3613 0.0798 0.3794 1.5023 0.0988 0.0469 0.1060 0.3969 0.2371 0.0582 0.2570 1.0093
CDAE 0.3786 0.0860 0.3950 1.5454 0.0948 0.0472 0.0947 0.3994 0.2377 0.0589 0.2573 1.0162
EXMF 0.3871 0.0936 0.4071 1.5720 0.0847 0.0418 0.0928 0.3683 0.2353 0.0666 0.2588 1.0016
FAWMF 0.4054 0.0949 0.4275 1.6279 0.1129 0.0441 0.1285 0.4470 0.2661 0.0680 0.2915 1.0984
Impv% 2.49% 1.41% 2.40% 1.76% 14.34% -6.42% 21.18% 11.93% 10.88% 2.11% 12.19% 8.09%
Table 3: The performance metrics of the compared methods. The boldface font denotes the winner in that column. The row ‘Impv’ indicates the relative performance gain of our FAWMF compared to the best results among baselines. ’’ indicates that the improvement is significant with t-test at .

Compared methods. We compare FAWMF with following baselines. Table 2 concludes their characteristics.

  • Item-pop: This is a simple baseline which recommends items based on global item popularity.

  • WMF(ALS) [10, 20]: The classic weighted matrix factorization model for implicit feedback data. The corresponding ALS-based [10] algorithm can reduce inference complexity.

  • eALS [7]: The improved weighted matrix factorization model, where the data confidence weights are assigned heuristically based on item-popularity.

  • BPR [22]: The classic pair-wise method for recommendation, coupled with matrix factorization. BPR implicitly downweights the unobserved data in their uniformly negative strategy.

  • CDAE [28]: The advanced recommendation method based on Auto-Encoders, which is a generalization of WMF with more flexible components. CDAE also employs uniformly negative strategy to learn their model.

  • EXMF [17]: A probabilistic model that directly incorporates user’s exposure to items into traditional matrix factorization.

Evaluation Metrics. We adopt four well-known metrics Precision@K (Pre@K), Recall@K (Rec@K), Normalized Discounted Cumulative Gain (NDCG@K) and Mean Reciprocal Rank (MRR) to evaluate recommendation performance: Recall@K (Rec@K) quantifies the fraction of consumed items that are in the top-K ranking list; Precision@K (Pre@K) measures the fraction of the top-K items that are indeed consumed by the user; NDCG@K and MRR evaluate ranking performance of the methods. Refer to our supplemental material for more details about these metrics.

Performance comparison (Q1)

Table 3 presents the performance of the compared methods in terms of three evaluation metrics. The boldface font denotes the winner in that column. For the sake of clarity, the last row of Table 3 also show the relative improvements achieved by our FAWMF over the baselines. Generally speaking, with one exception, FAWMF outperforms all compared methods on all datasets for all metrics.

The improvement of FAWMF over these baselines can be attributed to three aspects: (1) In the real world, users have personalized communities and thus are exposed to diverse information. Correspondingly, the data confidence (exposure) will vary with different user-item combinations. That is, some unobserved feedback are more likely attributed to user’s preference while others are the results of users’ awareness. By adaptively learning fine-grained data confidence weights from the data, FAWMF achieves better performance than those baselines with manually coarse-grained confidence weights. (2) FAWMF models confidence weight with an community-based inference network, which is capable of capturing latent interactions between users and mitigating the over-fitting problems. It can bee seen from the better performance of FAWMF over EXMF. (3) Instead of employing sampling-based stochastic gradient descent optimizer (SGD), FAWMF employs a specific fast batch-based learning algorithm, which has better convergency and more stable results. We also conduct specific experiments in the next subsection to validate this point.

Running time comparisons. Figure 3 depicts running time of the six compared recommendation methods. As we can see, the speed up of our FAWMF over EXMF is significant. Especially in the largest dataset Douban, EXMF requires 56 hours for training, while FAWMF only takes 1.8 hours. The acceleration of FAWMF over EXMF can be attributed two aspects: (1) Confidence weights for each data have been modeled with a parameterized function, which reduces the number of learned parameters. (2) fBGD speed up gradients calculations. This way, FAMWF achieves similar analytical time complexity as other compared methods, which aim at fast learning but sacrifice the flexibility of the confidence weights. Their actual running time are also in the same magnitude. Also, we observe that these BGD-based methods (WMF, eALS, FAWMF) are relatively more efficient than SGD-based methods (BPR, CDAE), although they have similar analytical time complexity. It is caused by the poor convergency of SGD, which usually requires more iterations.

Effect of batch-based learning algorithm (Q2)

In this subsection, we compare our fBGD with two popular SGD-based learning strategies: (1) Uniform-1X (Uniform-5X, Uniform-25X), which uniformly sample a part of unobserved instances to update the model. Note that the training time and prediction accuracy are largely determined by the size of negative samples. Here we test this learning strategy with three different sampling-size for comparisons, where Uniform-25X (Uniform-5X, Uniform-1X) denotes the number of the sampled instances is 25 (5, 1) times as large as the number of positive instances. (2) Itempop-1X (Itempop-5X, Itempop-25X) whose probability of sampling a un-observed data is proportional to the item popularity. We also present the performance of original BGD algorithm for running time comparisons. Here we do not compare with existing dynamic sampling strategies, since they either suffer from efficiency problems or require other side information.

Figure 4 presents pre@5 of FAWMF on dataset Movielens with different learning strategies versus the number of iteration and running time. There exists the efficiency and effectiveness trade-off of SGD-based learning algorithm with varying sampling-size. The smaller sampling-size will cause insufficient learning and gradient instability. It can be seen from the poor performance final and heavy fluctuation of Uniform-1X (Itempop-1X). On the contrary, when the sampling-size become larger, SGD performs better but spend much more time. However, even if the stochastic optimizer uses a large sampling-size, it can not achieve good performance as BGD-based learning algorithm. Also, we can observe the inefficiency of original BGD. Our proposed fBGD accelerates original BGD for 16 times and even performs more efficient than Itempop-1X. Overall, our proposed fBGD performs better than others in all convergence, speed and recommended performance.

Effect of the parameter

(a) On movielens
(b) On Amazon
(c) On Douban
Figure 3: Running time comparisons.
Figure 4: Recommendation accuracy for different learning strategy versus the number of iterations and running time.
(a) On movielens
(b) On Amazon
(c) On Douban
Figure 5: Impact of the parameter .

Figure 5 shows the performance of FAWMF with varying latent dimension in MF. We also present the results of two close MF methods for comparisons. First, with few exceptions, FAWMF consistently outperforms eALS and WMF across , demonstrating the effectiveness of adaptively assigning personalized confidence weights. Second, all methods can be improved with a larger . But a large might have the risk of overfitting. It can be seen from the worse results when on the dataset Movielens.

Case study

Figure 6: Case study of two communities: the top rows show top-5 items that have highest exposure to the users in the two communities; the bottom rows show the ratio of male/female and the average age of the users in each communities. Note that these side information is not used in model training.

We also conduct a case study to show the learned communities. We do statistical analyses on the communities, two of which are presented in Figure 6. We can find that the genres of the exposed items in the community 1 are mainly about “Horror” and “Thriller”, but the genres in the community 2 are mainly about “Comedy”. These results can be explained by the different kinds of users in the communities. Most of users in the community 1 are young men, who are more likely to enjoy these exciting movies and share these movies with each other. Meanwhile, comedy is a hot topic for the elderly people, who are the main constituents of the community 2. These results validate the effectiveness of our FAWMF. Although the genre information of users and items is not used in model training, the latent correlations between users/items can be captured. Also, we can find different users will belong to different communities and thus have different exposure. Personalized confidence weights are necessary for the implicit recommendation models.

Conclusion

In this paper, we present a novel recommendation method FAWMF based on variational auto-encoder to achieve both adaptive weights assignment and efficient model learning. On the one hand, FAWMF models data confidence weights with a community-based inference neural network, which reduces the number of inferred parameters and is capable of capturing latent interactions between users. On the other hand, a specific batch-based learning algorithm fBGD has been developed to learn FAWMF fast and stably. The experimental results on three real-world datasets demonstrate the superiority of FAWMF over existing implicit MF methods.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No: U1866602) and National Key Research and Development Project (Grant No: 2018AAA0
101503).

Appendix

Details of fBGD

Here we present the detailed derivation of our fBGD. Note that the constraint of the (, ) causes the optimization problem. We re-parameterize by a softmax function with new parameters : . The gradient of objective function w.r.t the parameters , , , , , can be derived as follows:

(12)
(13)
(14)
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)

As we can see, the computational bottlenecks is on computing (Eq.(7)), (Eq.(8)), (Eq.(5)), (Eq.(6)) since it requires a traversal of all users (items) and repeat this operation over all items (users); while others can be accelerated by just iterating over these positive instances (). The overall computational complexity to update is ), which is generally infeasible since can easily reach billion level or even higher in practise.

To speed up the learning, we first rewrite the computational bottlenecks –Eq.(5)(6)(7)(8) by isolating item(user)-independent terms as follows:

(25)
(26)
(27)
(28)

By this reformulation, we can see that the major computation – the , , over all users – is independent of item ; and the, , over all items is independent of user . Thus, we could achieve a significant speed-up by caching these terms. That is, we cache:

(29)
(30)
(31)
(32)

for each and:

(33)
(34)

for each . where , are -dimensional tensors and , are -dimensional vectors; , are -dimensional tensors and , are -dimensional vectors. This way, the computational bottlenecks Eq.(5)(6)(7)(8) can be efficiently calculated as follows:

(35)
(36)
(37)
(38)

The rearrangement of nested sums and the cache strategies can avoid the massive repeated computations. The time computation can be reduced from to . That is, although our fBGD uses all feedback data but its complexity is linearly to the number of observed data. due to the sparsity of the implicit feedback data, our FAWMF with fBGD learning algorithm is efficient. Our experimental results also validate this point. Overall, our fBGD algorithm is presented in algorithm 1.

1:  Initialize parameters randomly;
2:  while not converge do
3:     Calculate intermediate tensors based on Eq.(18)(20)(22)(23); []
4:     Calculate intermediate vectors based on Eq.(19)(21); []
5:     for each user user do
6:         Calculate gradients w.r.t based on Eq.(1)(4)(5)(7-13)(24-27); []
7:     end for
8:     for each user item do
9:         Calculate gradients w.r.t based on Eq.(2)(3)(6)(7-13)(24-27); []
10:     end for
11:     update the parameters based on gradients.
12:  end while
Algorithm 1 Inference of FAWMF based on fBGD

Evaluation Metrics

We adopt the following metrics to evaluate recommendation performance:

  • Recall@K (Rec@K): This metric quantifies the fraction of consumed items that are in the top-K ranking list sorted by their estimated rankings. For each user , we define as the set of recommended items in top-K and as the set of consumed items in test data for user . Then we have:

    (39)
  • Precision@K (Pre@K): This measures the fraction of the top-K items that are indeed consumed by the user:

    (40)
  • Normalized Discounted Cumulative Gain@K (NDCG@K): This is widely used in information retrieval and it measures the quality of ranking through discounted importance based on positions in the top-K recommendation lists. In recommendation, NDCG is computed as follow:

    (41)

    where is defined as follow and is the ideal value of coming from the best ranking.

    (42)

    where represents the rank of the item in the recommended list of the user .

  • Mean Reciprocal Rank (MRR): Given the ranking lists, MRR is defined as follow:

    (43)

    MRR can be interpreted as the ease of finding all consumed items, as higher numbers indicate the consumed items are higher in the list.

Footnotes

  1. Note that the EM algorithm presented in [17] is a special case of the classic variational inference.
  2. https://grouplens.org/datasets/movielens/
  3. https://www.kaggle.com/snap/amazon-fine-food-reviews
  4. https://www.cse.cuhk.edu.hk/irwin.king.new/pub/data/douban

References

  1. I. Bayer, X. He, B. Kanagal and S. Rendle (2017) A generic coordinate descent framework for learning from implicit feedback. In International Conference on World Wide Web, pp. 1341–1350. Cited by: Related work, Related work.
  2. C. Chen, D. Li, Y. Zhao, Q. Lv and L. Shang (2015) WEMAREC: accurate and scalable recommendation through weighted and ensemble matrix approximation. In SIGIR, pp. 303–312. Cited by: Fast adaptively weighted matrix factorization.
  3. J. Chen, C. Wang, Q. Shi, Y. Feng and C. Chen (2019) Social recommendation based on users’ attention and preference. Neurocomputing 341, pp. 1–9. Cited by: Related work.
  4. J. Chen, C. Wang, S. Zhou, Q. Shi, Y. Feng and C. Chen (2019) SamWalker: social recommendation with informative sampling strategy. In The World Wide Web Conference, pp. 228–239. Cited by: Introduction, Introduction, Related work, Related work, Related work, Experimental protocol.
  5. T. Chen, Y. Sun, Y. Shi and L. Hong (2017) On sampling strategies for neural network-based collaborative filtering. In International Conference on Knowledge Discovery and Data Mining, pp. 767–776. Cited by: Related work.
  6. X. He, L. Liao, H. Zhang, L. Nie, X. Hu and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. Cited by: Related work, Related work.
  7. X. He, H. Zhang, M. Kan and T. Chua (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pp. 549–558. Cited by: Introduction, Related work, Related work, 3rd item.
  8. J. M. Hernández-Lobato, N. Houlsby and Z. Ghahramani (2014) Stochastic inference for scalable probabilistic modeling of binary matrices. In International Conference on Machine Learning, pp. 379–387. Cited by: Related work.
  9. M. D. Hoffman, D. M. Blei, C. Wang and J. Paisley (2013) Stochastic variational inference. The Journal of Machine Learning Research 14 (1), pp. 1303–1347. Cited by: Analyses of EXMF from variational perspective.
  10. Y. Hu, Y. Koren and C. Volinsky (2008) Collaborative filtering for implicit feedback datasets. In International Conference on Data Mining, pp. 263–272. Cited by: Introduction, Related work, Related work, 2nd item.
  11. C. Jiawei, F. Yan, E. Martin, Z. Sheng, C. Chun and C. Wang (2018) Modeling users’ exposure with social knowledge influence and consumption influence for recommendation. In International on Conference on Information and Knowledge Management, pp. 953–962. Cited by: Related work.
  12. C. C. Johnson (2014) Logistic matrix factorization for implicit feedback data. Advances in Neural Information Processing Systems 27. Cited by: Related work, Related work.
  13. D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Introduction.
  14. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Fast adaptively weighted matrix factorization.
  15. J. Lee, S. Kim, G. Lebanon, Y. Singer and S. Bengio (2016) LLORMA: local low-rank matrix approximation. The Journal of Machine Learning Research 17 (1), pp. 442–465. Cited by: Fast adaptively weighted matrix factorization.
  16. D. Li, C. Chen, Q. Lv, H. Gu, T. Lu, L. Shang, N. Gu and S. M. Chu (2018) AdaError: an adaptive learning rate method for matrix approximation-based collaborative filtering. In The World Wide Web Conference, pp. 741–751. Cited by: Related work.
  17. D. Liang, L. Charlin, J. McInerney and D. M. Blei (2016) Modeling user exposure in recommendation. In International Conference on World Wide Web, pp. 951–961. Cited by: Introduction, Related work, Preliminaries, 6th item, footnote 1.
  18. D. Liang, R. G. Krishnan, M. D. Hoffman and T. Jebara (2018) Variational autoencoders for collaborative filtering. In World Wide Web Conference, pp. 689–698. Cited by: Related work.
  19. G. Palla, I. Derényi, I. Farkas and T. Vicsek (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435 (7043), pp. 814–818. Cited by: Introduction, Fast adaptively weighted matrix factorization.
  20. R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz and Q. Yang (2008) One-class collaborative filtering. ICDM, pp. 502–511. External Links: Document, ISBN 9780769535029, ISSN 15504786 Cited by: 2nd item.
  21. D. H. Park and Y. Chang (2019) Adversarial sampling and training for semi-supervised information retrieval. In The World Wide Web Conference, pp. 1443–1453. Cited by: Related work.
  22. S. Rendle, C. Freudenthaler, Z. Gantner and L. Schmidt-Thieme (2009) BPR: bayesian personalized ranking from implicit feedback. In conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: Introduction, Related work, Related work, 4th item.
  23. S. Rendle and C. Freudenthaler (2014) Improving pairwise learning for item recommendation from implicit feedback. In International conference on Web search and data mining, pp. 273–282. Cited by: Related work.
  24. Q. Shi, C. Wang, J. Chen, Y. Feng and C. Chen (2019) Location driven influence maximization: online spread via offline deployment. Knowledge-Based Systems 166, pp. 30–41. Cited by: Fast adaptively weighted matrix factorization.
  25. Q. Shi, C. Wang, J. Chen, Y. Feng and C. Chen (2019) Post and repost: a holistic view of budgeted influence maximization. Neurocomputing 338, pp. 92–100. Cited by: Fast adaptively weighted matrix factorization.
  26. J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang and D. Zhang (2017) Irgan: a minimax game for unifying generative and discriminative information retrieval models. In SIGIR, pp. 515–524. Cited by: Related work.
  27. M. Wang, X. Zheng, Y. Yang and K. Zhang (2018) Collaborative filtering with social exposure: A modular approach to social recommendation. In AAAI, New Orleans, Louisiana, USA, February 2-7, 2018, Cited by: Related work.
  28. Y. Wu, C. DuBois, A. X. Zheng and M. Ester (2016) Collaborative denoising auto-encoders for top-n recommender systems. In International Conference on Web Search and Data Mining, pp. 153–162. Cited by: Introduction, Related work, Related work, 5th item.
  29. H. Yu, M. Bilenko and C. Lin (2017) Selection of negative samples for one-class matrix factorization. In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 363–371. Cited by: Related work.
  30. L. Yu, C. Zhang, S. Pei, G. Sun and X. Zhang (2018) Walkranker: a unified pairwise ranking model with multiple relations for item recommendation. In Conference on Artificial Intelligence, Cited by: Related work.
  31. T. Zhou (2011) Understanding online community user participation: a social influence perspective. Internet research 21 (1), pp. 67–81. Cited by: Introduction.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
410700
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description