Fast Adaptively Weighted Matrix Factorization for Recommendation with Implicit Feedback
Abstract
Recommendation from implicit feedback is a highly challenging task due to the lack of the reliable observed negative data. A popular and effective approach for implicit recommendation is to treat unobserved data as negative but downweight their confidence. Naturally, how to assign confidence weights and how to handle the large number of the unobserved data are two key problems for implicit recommendation models. However, existing methods either pursuit fast learning by manually assigning simple confidence weights, which lacks flexibility and may create empirical bias in evaluating user’s preference; or adaptively infer personalized confidence weights but suffer from low efficiency.
To achieve both adaptive weights assignment and efficient model learning, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational autoencoder. The personalized data confidence weights are adaptively assigned with a parameterized neural network (function) and the network can be inferred from the data. Further, to support fast and stable learning of FAWMF, a new specific batchbased learning algorithm fBGD has been developed, which trains on all feedback data but its complexity is linear to the number of observed data. Extensive experiments on realworld datasets demonstrate the superiority of the proposed FAWMF and its learning algorithm fBGD.
Introduction
Recommender systems play an important role in many internet services. Since in practise most of recommender systems only have the implicit feedback (e.g. items consumed), recently research attention is increasingly shifted from explicit feedback (e.g. rating prediction) to implicit feedback. However, learning a recommender system from implicit feedback is more challenging. In implicit feedback scenarios, only positive feedback are observed, and the unobserved useritem feedback data (e.g. a user has not bought an item yet) are a mixture of real negative feedback (i.e. a user does not like it) and missing values (i.e. a user just does not know it). Existing methods address this problem by treating all the unobserved data as negative (dislike) but downweighting their confidence. Although it is reasonable, it poses two important questions: (1) How to assign confidence weights for each data? (2) How to handle the massive volume of the unobserved data efficiently?
While there is a rich literature on recommendation from implicit feedback, to our knowledge, all existing methods lack one or more desiderata. Most of these methods rely on the meticulous assignment of confidence weights to the data. For example, WMF [10] and eALS [7] give uniform or popularitybased confidence weights; BPR [22], CDAE [28], or other sophisticated models downweight the contributions of the negative data in their heuristical (e.g. uniformly) negative sampling strategy. However, choosing these weights usually involves heuristic alterations to the data and needs expensive exhaustive grid search via crossvalidation. Furthermore, it is unrealistic for researchers to manually set flexible and diverse weights for millions of data. In practical scenarios, the data confidence weights may change for various useritem combinations. Some unobserved data can be attributed to user’s preference while others are the results of users’ limited scopes. To this end, exposurebased matrix factorization (EXMF) [17] has been proposed to downweight negative data automatically by predicting how a user knows a item. However, EXMF involves iterative inference of the exposure for each data, which is computationally expensive and also potentially suffers from overfitting. More recently, SamWalker [4] adaptively assign confidence weights from users’ social contexts. However, the social network information is hard obtained in many recommender systems.
To address these problems, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational autoencoder [13] for both adaptive weights assignment and efficient learning. We first analyze EXMF from variational perspective and find that the variational posterior of user’s exposure acts as data confidence weights in learning users’ preference with MF. It is consistent with our intuitions. Only if the user is exposed to the item, he can decide whether to consume the item based on his preference. The data with larger exposure are more reliable in deriving user’s preference. Further, we propose to replace confidence weights (variational posterior of user’s exposure) with a parameterized inference neural network (function), which reduces the number of the inferred parameters and is capable of capturing latent correlations between users’ exposure. In fact, the independent assumption of the exposure is not practical in real world. Typically, recent literatures [19, 31] in social science claim that each of us belongs to some informationsharing communities. Naturally, users are exposed to these communities and thus their exposure exhibit correlations when they belong to common communities. Motivated by this point, a specific communitybased inference neural network has been designed, which explores latent communities among users and infer users’ exposure from the communities that they belong to. It will capture more precise exposure and can mitigate the overfitting problem. By optimizing the variational lower bound, both the inference neural network and MF can be learned from the data.
Efficiently learning a recommendation model from implicit feedback is also challenging since it requires to account for the largescale unobserved data. Although stochastic gradient descent optimizer (SGD) with sampling strategy can be employed to speed up the inference, it usually suffers from low convergency and high gradient instability. Especially in the implicit recommendation task, the optimizer usually samples the uninformative data, which have low confidence and make limited contributions on gradient update [4]. The final recommendation performance will suffer. Instead, we turn to employ batch gradient descent optimizer (BGD), which computes stable gradient from all feedback data and usually converges to a better optimum. Unfortunately, the naive implement of BGD will suffer from low efficiency caused by the expensive fullbatch gradient computation over all data. To address this problem, we develop a fast BGDbased learning algorithm fBGD for our FAWMF. With rigorous mathematical reasoning, massive repeated computations of original BGD can be avoided and the learning process can be accelerated. Notably, despite fBGD computes gradient over all data, its actual complexity is linear with the number of the unobserved positive data. Due to the sparsity of the implicit feedback data, fBGD achieves a significant acceleration.
We summarize our key contributions as follows:

We propose a FAWMF model based on variational autoencoder to achieve both adaptive weights assignment and efficient model learning.

A batchbased learning algorithm fBGD has been developed to learn FAWMF efficiently and stably.

Extensive experimental evaluations on three wellknown benchmark datasets demonstrate the superiority of our FAWMF over the existing implicit MF methods.
Related work
Recommendation from implicit feedback data. Most of the existing methods manually assign coarsegrained confidence weights. For example, the classic weighted factorization matrix model (WMF) [10], eALS [7], ICDMF [1], used a simple heuristic where the confidence weights are assigned uniformly or based on item popularity; Logistical matrix factorization [12], BPR [22], or neuralbased recommendation models (e.g. CDAE [28], NCF[6]) downweight the contribution of negative data implicitly in their heuristic negative sampling strategy (e.g. uniformly). More recently, a new probabilistic model EXMF[17] was proposed to incorporate user’s exposure to items into the CF methods. When inferring user’s preference, user’s exposure can be translated as data confidence. However, this method suffers from the efficiency problem. Also, some sophisticated models have been proposed to learn confidence weights from users’ social contexts [27, 11, 4]. However, the social information is not available in many cases.
Efficient learning algorithm for implicit recommendation models. To handle the largescale unobserved data, two types of strategies have been proposed for efficient learning: samplebased learning and wholedata based learning. The first type achieves fast learning with stochastic gradient descent (SGD) and negative sampling. The most popular sampling strategy is to draw unobserved feedback data uniformly, which is widely adopted in LMF[12], BPR[22], CDAE [28], NCF [6], MultDAE [18] etc. Also, [29], [5] and [8] further propose item popularitybased and itemuser cobias sampling strategy. However, in recommendation task SGD usually suffers from slow convergence and gradient instability when the number of items is large [4].
Thus, some other dynamic sampling strategies are proposed to improve convergence and accuracy. Some literatures [23, 30, 21, 26] propose to oversample the “difficult” negative instances which are hard to be discriminated by the models. However, they will expose efficiency issues in sampling. Also, the stochastic gradient estimator is biased and may amplify the natural noise in users feedback data [16]. More recently, SamWalker [4] conduct the random walk along the social network to select informative instances. Although effectively, it has two weakness: (1) SamWalker requires additional social information but it is not available in many recommender systems. (2) SamWalker still employs uniformly sampling strategy to update its dynamic sampler, leading to insufficient training of it.
A more effective and stable way is updating the model from the wholedata, but it face efficiency challenge due to the large number of the negative data. Thus, memorization strategies (e.g. ALS, eALS) has been proposed [10, 7, 1, 3] to speed up learning. However, these algorithms are just suitable for the MF with simple manual confidence weights, which lacks flexible and may create empirical bias. The recommendation performance will suffer.
Adaptive weights assignment and efficient learning are both important in recommendation from implicit feedback. Existing methods are not able to provide both of these, which motivates the approach described in this paper.
Preliminaries
In this section, we first give the problem definition of implicit recommendation. Then, we introduce exposurebased matrix factorization (EXMF) [17] framework from variational perspective to provide usual insight about the relation between user’s exposure and data confidence.
Problem definition
Suppose we have a recommender system with user set (including users) and item set (including items). The implicit feedback data is represented as matrix with entries denoting whether or not the user has consumed the item . The task of a recommender system can be stated as follow: recommending items for each user that are most likely to be consumed by him.
Exposurebased matrix factorization (EXMF)
Note that unobserved feedback data contain the real negative data (dislike) and the missing values (unknown). EXMF introduces a Bernoulli variable to model users’ exposure: denotes that the user knows the item and denotes not. Then, EXMF models user’s consumption based on as follow:
(1)  
(2)  
(3) 
where denotes ; is the prior probability of exposure. Here we relax function as to make model more robust, where is a small constant (e.g. =1e5). When , we have , since the user does not know the item and he can not consume it. When , when the user has learned the item, he will decide whether or not to consume the item based on his preference. Thus, is modeled as the classic matrix factorization model, where denotes the dimensional latent factor (preference) of user , and denotes the dimensional latent factor (attribute) of item .
Analyses of EXMF from variational perspective
The marginal likelihood of EXMF is composed of a sum over each datapoint , which can be rewritten as:
(4) 
where is defined as an approximated variational posterior of . Since the second term – KLdivergence – is nonnegative, the first term is the evidence lower bound on (ELBO) the margin likelihood. Classic variational methods [9] usually employ conjugate variational distribution and individual variational parameters
(5) 
Exposure as data confidence. The objective function consists of three parts: (1) Weighted matrix factorization to learn user’s preference; (2) The loss when the data are predicted as unknown; (3) The regularization term – KL divergency between the prior and the variational posterior. A good property is observed that the parameters , which characterize the probability that user is exposed to item , act as the confidence of the corresponding data in learning user’s preference. It is consistent with our intuitions. Only if the user is exposed to the item, he can decide whether to consume the item based on his preference. The data with larger exposure are more reliable in deriving user’s preference.
Inefficiency problem. EXMF suffers efficiency problems. The number of inferred variational parameters (confidence weights) grows quickly with the number of users and items (), which easily scale to billion or even larger level. It will potentially suffer from overfitting and become the time and space bottleneck for practise recommender systems. Further, the inference of EXMF requires to sum over the terms for each data, which is timeconsuming.
Fast adaptively weighted matrix factorization
To address these problems, we propose a fast adaptively weighted matrix factorization (FAWMF) based on variational autoencoder to achieve both adaptive weight assignment and efficient model learning. To reduce the number of the inferred parameters, FAWMF replaces individual parameters with a parameterized function that maps users’ consumption on items into their exposure with parameters . It is more reasonable since there exists interactions between users. Independent assumption of exposure is not practical in real world. Typically, recent literatures [19] in social science claim that each of us belongs to some informationsharing communities. Users are exposed to these communities and thus their exposure exhibit commonality when they belong to similar communities. Motivated by this point, we summarize communities as collections of similar users and further infer users’ exposure from the communities that they belong to. Concretely, the inference function can be modeled as follow:
(6) 
where vector denotes the membership that allocates each user to a fixed number of communities and meet , . We cumulate the consumptions of the users in the community to infer how items are exposed to the community, where captures the heterogenous roles of users. Different users may have different influence strength on the communities [24, 25]. Then, a logistic linear function with parameters has been employed to map cumulated consumptions into the exposure. Intuitively, the more influence users in the common community have consumed the item, the user are more likely to know it. Finally, user’s exposure can be depicted by how the user belongs to the communities and how items are exposed to these communities. Overall, the inference of the data confidence can be translated into learning the parameters of inference function (), which captures latent correlations between for better estimation and reduces the number of inferred parameters from to . Different from existing clusteringbased methods (e.g. [2, 15]) which cluster users into communities/submatrices based on users’ preference, our FAWMF deduces implicit communities based on users’ exposure.
Also, FAWMF can be well understood from neural network perspective. As shown in Figure 1, at first an outer product between each user’s consumption and his membership has been employed for interactions. The obtained tensor can be regarded as the influence of user’s consumptions along different dimensions (communities). Then, we use a specific striped CNN [14] layer and a linear layer to encode the influences (consumptions) into the “kernel maps” of users’ exposure. Here we choose striped CNN since the adjacent elements of are just caused by the adjacent users(items) id and do not suggest they will have share more commonality. Further, the exposure for different useritem pairs can be depicted as the combination of the “kernel maps”. Finally, the exposurebased MF acts as a probabilistic decoder to predict users’ consumption. Overall, FAWMF forms an autoencoder to learn both data confidence weights and latent factors of MF.
Discussion
How does FAWNF mitigate overfitting? To better understand this effect, let us draw an analogy with the floating balls in the water, as illustrated in Figure 2. Learning EXMF model by optimizing equation (5) will give a force to push up these positive balls (instances) and push down these unobserved balls (instances). Thus, without strong priors, the confidence weights of the data easily achieve extremely values ( for the positive instances and for the unobserved instances), where unobserved data makes few contributions on training. In this situation, all the instances will be predicted as positive by MF and the model will suffer from overfitting. In our FAWMF, this overfitting effect will be mitigated by introducing inference network of the . There exists correlations between users’ exposure, which can be analogies with elastic links lines between the balls. Typically, users tend to have similar exposure (), if they share more common communities. Naturally, with the training of the model, the unobserved data which have correlations with positive instances, will be pulled up due to the force from the lines. This way, when the model has well fitted the data, the positive and the unobserved instances will get stable at different depth in water. The unobserved instances which have stronger correlations with positive instances, usually reaches higher position than other unobserved instances.
Fast learning algorithm fBGD
Learning a recommendation model from implicit feedback is timecost expensively due to the massive volume of the unobserved data. The popular solution addressing this problem is stochastic gradient descent (SGD) with negative sampling. However, SGD usually suffers from low convergency and gradient instability, which affects the performance of the model. Especially, in the recommendation task, the sampler usually select uninformative instances that have small confidence weights and make limited contributions on update. Thus, we turn to use batch gradient descent optimizer (BGD), which computes stable gradient from all feedback data. Unfortunately, the orginal BGD will suffer from low efficiency due to the fullbatch computation of the gradient over instances. To address this problem, we develop new specific learning algorithm fBGD for our FAWMF. We speed up learning of BGD by caching some specific intermediate variables to avoid the massive repeated computations. Here we detail the derivation of the gradient w.r.t ; where the counterpart of others () is achieved likewise and presented in supplemental materials.
For better derivation, we let . Also, we drop out the regularization term in the objective function (Eq. (5)) since the regularization term usually makes no contribution on recommendation accuracy of FAWMF and hinders fast training. It also can be regarded that we set a large value of parameter . Then, the gradient of objective function (Eq. (5)) w.r.t for each user can be derived as follow:
(7)  
(8)  
(9) 
Clearly, the computational cost lies in two parts: (1) the calculation of the gradients of w.r.t for each item , which requires the summation over the terms of each user ; (2) the summation of over each item to get final result. In fact, the first part produces the main cost since it requires a traversal of all users and repeat this operation over all items; while the second part can be accelerated by just iterating over these positive instances (). The overall computational complexity to update is ), which is generally infeasible since can easily reach billion level or even higher in practise.
To speed up the learning, we rewrite the computational bottlenecks – Eq. (8) – by isolating itemindependent terms:
(10) 
By this reformulation, we can see that the major computation – the and over all users – is independent of item . Thus, we could achieve a significant speedup by caching these terms. That is, we cache for each and , where is a dimensional tensor and is a Ddimensional vector. Then, can be calculated as follow:
(11) 
The rearrangement of nested sums is the key transformation that allows the fast optimization. The time computation can be reduced from to , where denotes the number of the observed positive data. Here we spend time on calculating and time on calculating the first and second term in equation (11) for each item . For the third term, we just sum over the terms with with complexity . Similar strategy can be used to accelerate the update of other parameters. With this algorithm, learning FAWMF model is efficient and stable, which trains on the all data but its complexity is linear to the number of the observed data.
Experiments and analyses
In this section, we conduct experiments to evaluate the performance of FAWMF. Our experiments are intended to address the following major questions:

Does FAWMF outperform stateoftheart implicit MF methods?

How does the proposed batchbased learning algorithm fBGD perform?
Experimental protocol
Datasets. Three benchmark datasets Moivelens
Datasets  #Users  #Items  #Oberseved positive feedback 

Movielens  6,040  3,952  1,000,209 
Amazon  10,619  37,762  256,287 
Douban  123,480  20,029  16,624,937 
Methods 


Complexity  

WMF(ALS)  
eALS  
BPR  
CDAE  
EXMF  
FAWMF 
Methods  Movielens  Amazon  Douban  
Pre@5  Rec@5  NDCG@5  MRR  Pre@5  Rec@5  NDCG@5  MRR  Pre@5  Rec@5  NDCG@5  MRR  
Itempop  0.2092  0.0400  0.2201  0.8958  0.0027  0.0048  0.0055  0.0191  0.1409  0.0332  0.1582  0.6308 
WMF(ALS)  0.3841  0.0924  0.4059  1.5751  0.0789  0.0406  0.0858  0.3269  0.2400  0.0656  0.2598  1.0113 
eALS  0.3955  0.0917  0.4175  1.5998  0.0984  0.0348  0.1051  0.3951  0.2329  0.0646  0.2520  0.9880 
BPR  0.3613  0.0798  0.3794  1.5023  0.0988  0.0469  0.1060  0.3969  0.2371  0.0582  0.2570  1.0093 
CDAE  0.3786  0.0860  0.3950  1.5454  0.0948  0.0472  0.0947  0.3994  0.2377  0.0589  0.2573  1.0162 
EXMF  0.3871  0.0936  0.4071  1.5720  0.0847  0.0418  0.0928  0.3683  0.2353  0.0666  0.2588  1.0016 
FAWMF  0.4054  0.0949  0.4275  1.6279  0.1129  0.0441  0.1285  0.4470  0.2661  0.0680  0.2915  1.0984 
Impv%  2.49%  1.41%  2.40%  1.76%  14.34%  6.42%  21.18%  11.93%  10.88%  2.11%  12.19%  8.09% 
Compared methods. We compare FAWMF with following baselines. Table 2 concludes their characteristics.

Itempop: This is a simple baseline which recommends items based on global item popularity.

eALS [7]: The improved weighted matrix factorization model, where the data confidence weights are assigned heuristically based on itempopularity.

BPR [22]: The classic pairwise method for recommendation, coupled with matrix factorization. BPR implicitly downweights the unobserved data in their uniformly negative strategy.

CDAE [28]: The advanced recommendation method based on AutoEncoders, which is a generalization of WMF with more flexible components. CDAE also employs uniformly negative strategy to learn their model.

EXMF [17]: A probabilistic model that directly incorporates user’s exposure to items into traditional matrix factorization.
Evaluation Metrics. We adopt four wellknown metrics Precision@K (Pre@K), Recall@K (Rec@K), Normalized Discounted Cumulative Gain (NDCG@K) and Mean Reciprocal Rank (MRR) to evaluate recommendation performance: Recall@K (Rec@K) quantifies the fraction of consumed items that are in the topK ranking list; Precision@K (Pre@K) measures the fraction of the topK items that are indeed consumed by the user; NDCG@K and MRR evaluate ranking performance of the methods. Refer to our supplemental material for more details about these metrics.
Performance comparison (Q1)
Table 3 presents the performance of the compared methods in terms of three evaluation metrics. The boldface font denotes the winner in that column. For the sake of clarity, the last row of Table 3 also show the relative improvements achieved by our FAWMF over the baselines. Generally speaking, with one exception, FAWMF outperforms all compared methods on all datasets for all metrics.
The improvement of FAWMF over these baselines can be attributed to three aspects: (1) In the real world, users have personalized communities and thus are exposed to diverse information. Correspondingly, the data confidence (exposure) will vary with different useritem combinations. That is, some unobserved feedback are more likely attributed to user’s preference while others are the results of users’ awareness. By adaptively learning finegrained data confidence weights from the data, FAWMF achieves better performance than those baselines with manually coarsegrained confidence weights. (2) FAWMF models confidence weight with an communitybased inference network, which is capable of capturing latent interactions between users and mitigating the overfitting problems. It can bee seen from the better performance of FAWMF over EXMF. (3) Instead of employing samplingbased stochastic gradient descent optimizer (SGD), FAWMF employs a specific fast batchbased learning algorithm, which has better convergency and more stable results. We also conduct specific experiments in the next subsection to validate this point.
Running time comparisons. Figure 3 depicts running time of the six compared recommendation methods. As we can see, the speed up of our FAWMF over EXMF is significant. Especially in the largest dataset Douban, EXMF requires 56 hours for training, while FAWMF only takes 1.8 hours. The acceleration of FAWMF over EXMF can be attributed two aspects: (1) Confidence weights for each data have been modeled with a parameterized function, which reduces the number of learned parameters. (2) fBGD speed up gradients calculations. This way, FAMWF achieves similar analytical time complexity as other compared methods, which aim at fast learning but sacrifice the flexibility of the confidence weights. Their actual running time are also in the same magnitude. Also, we observe that these BGDbased methods (WMF, eALS, FAWMF) are relatively more efficient than SGDbased methods (BPR, CDAE), although they have similar analytical time complexity. It is caused by the poor convergency of SGD, which usually requires more iterations.
Effect of batchbased learning algorithm (Q2)
In this subsection, we compare our fBGD with two popular SGDbased learning strategies: (1) Uniform1X (Uniform5X, Uniform25X), which uniformly sample a part of unobserved instances to update the model. Note that the training time and prediction accuracy are largely determined by the size of negative samples. Here we test this learning strategy with three different samplingsize for comparisons, where Uniform25X (Uniform5X, Uniform1X) denotes the number of the sampled instances is 25 (5, 1) times as large as the number of positive instances. (2) Itempop1X (Itempop5X, Itempop25X) whose probability of sampling a unobserved data is proportional to the item popularity. We also present the performance of original BGD algorithm for running time comparisons. Here we do not compare with existing dynamic sampling strategies, since they either suffer from efficiency problems or require other side information.
Figure 4 presents pre@5 of FAWMF on dataset Movielens with different learning strategies versus the number of iteration and running time. There exists the efficiency and effectiveness tradeoff of SGDbased learning algorithm with varying samplingsize. The smaller samplingsize will cause insufficient learning and gradient instability. It can be seen from the poor performance final and heavy fluctuation of Uniform1X (Itempop1X). On the contrary, when the samplingsize become larger, SGD performs better but spend much more time. However, even if the stochastic optimizer uses a large samplingsize, it can not achieve good performance as BGDbased learning algorithm. Also, we can observe the inefficiency of original BGD. Our proposed fBGD accelerates original BGD for 16 times and even performs more efficient than Itempop1X. Overall, our proposed fBGD performs better than others in all convergence, speed and recommended performance.
Effect of the parameter
Figure 5 shows the performance of FAWMF with varying latent dimension in MF. We also present the results of two close MF methods for comparisons. First, with few exceptions, FAWMF consistently outperforms eALS and WMF across , demonstrating the effectiveness of adaptively assigning personalized confidence weights. Second, all methods can be improved with a larger . But a large might have the risk of overfitting. It can be seen from the worse results when on the dataset Movielens.
Case study
We also conduct a case study to show the learned communities. We do statistical analyses on the communities, two of which are presented in Figure 6. We can find that the genres of the exposed items in the community 1 are mainly about “Horror” and “Thriller”, but the genres in the community 2 are mainly about “Comedy”. These results can be explained by the different kinds of users in the communities. Most of users in the community 1 are young men, who are more likely to enjoy these exciting movies and share these movies with each other. Meanwhile, comedy is a hot topic for the elderly people, who are the main constituents of the community 2. These results validate the effectiveness of our FAWMF. Although the genre information of users and items is not used in model training, the latent correlations between users/items can be captured. Also, we can find different users will belong to different communities and thus have different exposure. Personalized confidence weights are necessary for the implicit recommendation models.
Conclusion
In this paper, we present a novel recommendation method FAWMF based on variational autoencoder to achieve both adaptive weights assignment and efficient model learning. On the one hand, FAWMF models data confidence weights with a communitybased inference neural network, which reduces the number of inferred parameters and is capable of capturing latent interactions between users. On the other hand, a specific batchbased learning algorithm fBGD has been developed to learn FAWMF fast and stably. The experimental results on three realworld datasets demonstrate the superiority of FAWMF over existing implicit MF methods.
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant No: U1866602) and National Key Research and Development Project (Grant No: 2018AAA0
101503).
Appendix
Details of fBGD
Here we present the detailed derivation of our fBGD. Note that the constraint of the (, ) causes the optimization problem. We reparameterize by a softmax function with new parameters : . The gradient of objective function w.r.t the parameters , , , , , can be derived as follows:
(12)  
(13)  
(14)  
(15)  
(16)  
(17)  
(18)  
(19)  
(20)  
(21)  
(22)  
(23)  
(24) 
As we can see, the computational bottlenecks is on computing (Eq.(7)), (Eq.(8)), (Eq.(5)), (Eq.(6)) since it requires a traversal of all users (items) and repeat this operation over all items (users); while others can be accelerated by just iterating over these positive instances (). The overall computational complexity to update is ), which is generally infeasible since can easily reach billion level or even higher in practise.
To speed up the learning, we first rewrite the computational bottlenecks –Eq.(5)(6)(7)(8) by isolating item(user)independent terms as follows:
(25)  
(26)  
(27)  
(28) 
By this reformulation, we can see that the major computation – the , , over all users – is independent of item ; and the, , over all items is independent of user . Thus, we could achieve a significant speedup by caching these terms. That is, we cache:
(29)  
(30)  
(31)  
(32) 
for each and:
(33)  
(34) 
for each . where , are dimensional tensors and , are dimensional vectors; , are dimensional tensors and , are dimensional vectors. This way, the computational bottlenecks Eq.(5)(6)(7)(8) can be efficiently calculated as follows:
(35)  
(36)  
(37)  
(38) 
The rearrangement of nested sums and the cache strategies can avoid the massive repeated computations. The time computation can be reduced from to . That is, although our fBGD uses all feedback data but its complexity is linearly to the number of observed data. due to the sparsity of the implicit feedback data, our FAWMF with fBGD learning algorithm is efficient. Our experimental results also validate this point. Overall, our fBGD algorithm is presented in algorithm 1.
Evaluation Metrics
We adopt the following metrics to evaluate recommendation performance:

Recall@K (Rec@K): This metric quantifies the fraction of consumed items that are in the topK ranking list sorted by their estimated rankings. For each user , we define as the set of recommended items in topK and as the set of consumed items in test data for user . Then we have:
(39) 
Precision@K (Pre@K): This measures the fraction of the topK items that are indeed consumed by the user:
(40) 
Normalized Discounted Cumulative Gain@K (NDCG@K): This is widely used in information retrieval and it measures the quality of ranking through discounted importance based on positions in the topK recommendation lists. In recommendation, NDCG is computed as follow:
(41) where is defined as follow and is the ideal value of coming from the best ranking.
(42) where represents the rank of the item in the recommended list of the user .

Mean Reciprocal Rank (MRR): Given the ranking lists, MRR is defined as follow:
(43) MRR can be interpreted as the ease of finding all consumed items, as higher numbers indicate the consumed items are higher in the list.
Footnotes
 Note that the EM algorithm presented in [17] is a special case of the classic variational inference.
 https://grouplens.org/datasets/movielens/
 https://www.kaggle.com/snap/amazonfinefoodreviews
 https://www.cse.cuhk.edu.hk/irwin.king.new/pub/data/douban
References
 (2017) A generic coordinate descent framework for learning from implicit feedback. In International Conference on World Wide Web, pp. 1341–1350. Cited by: Related work, Related work.
 (2015) WEMAREC: accurate and scalable recommendation through weighted and ensemble matrix approximation. In SIGIR, pp. 303–312. Cited by: Fast adaptively weighted matrix factorization.
 (2019) Social recommendation based on usersâ attention and preference. Neurocomputing 341, pp. 1–9. Cited by: Related work.
 (2019) SamWalker: social recommendation with informative sampling strategy. In The World Wide Web Conference, pp. 228–239. Cited by: Introduction, Introduction, Related work, Related work, Related work, Experimental protocol.
 (2017) On sampling strategies for neural networkbased collaborative filtering. In International Conference on Knowledge Discovery and Data Mining, pp. 767–776. Cited by: Related work.
 (2017) Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pp. 173–182. Cited by: Related work, Related work.
 (2016) Fast matrix factorization for online recommendation with implicit feedback. In SIGIR, pp. 549–558. Cited by: Introduction, Related work, Related work, 3rd item.
 (2014) Stochastic inference for scalable probabilistic modeling of binary matrices. In International Conference on Machine Learning, pp. 379–387. Cited by: Related work.
 (2013) Stochastic variational inference. The Journal of Machine Learning Research 14 (1), pp. 1303–1347. Cited by: Analyses of EXMF from variational perspective.
 (2008) Collaborative filtering for implicit feedback datasets. In International Conference on Data Mining, pp. 263–272. Cited by: Introduction, Related work, Related work, 2nd item.
 (2018) Modeling users’ exposure with social knowledge influence and consumption influence for recommendation. In International on Conference on Information and Knowledge Management, pp. 953–962. Cited by: Related work.
 (2014) Logistic matrix factorization for implicit feedback data. Advances in Neural Information Processing Systems 27. Cited by: Related work, Related work.
 (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: Introduction.
 (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Fast adaptively weighted matrix factorization.
 (2016) LLORMA: local lowrank matrix approximation. The Journal of Machine Learning Research 17 (1), pp. 442–465. Cited by: Fast adaptively weighted matrix factorization.
 (2018) AdaError: an adaptive learning rate method for matrix approximationbased collaborative filtering. In The World Wide Web Conference, pp. 741–751. Cited by: Related work.
 (2016) Modeling user exposure in recommendation. In International Conference on World Wide Web, pp. 951–961. Cited by: Introduction, Related work, Preliminaries, 6th item, footnote 1.
 (2018) Variational autoencoders for collaborative filtering. In World Wide Web Conference, pp. 689–698. Cited by: Related work.
 (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435 (7043), pp. 814–818. Cited by: Introduction, Fast adaptively weighted matrix factorization.
 (2008) Oneclass collaborative filtering. ICDM, pp. 502–511. External Links: Document, ISBN 9780769535029, ISSN 15504786 Cited by: 2nd item.
 (2019) Adversarial sampling and training for semisupervised information retrieval. In The World Wide Web Conference, pp. 1443–1453. Cited by: Related work.
 (2009) BPR: bayesian personalized ranking from implicit feedback. In conference on uncertainty in artificial intelligence, pp. 452–461. Cited by: Introduction, Related work, Related work, 4th item.
 (2014) Improving pairwise learning for item recommendation from implicit feedback. In International conference on Web search and data mining, pp. 273–282. Cited by: Related work.
 (2019) Location driven influence maximization: online spread via offline deployment. KnowledgeBased Systems 166, pp. 30–41. Cited by: Fast adaptively weighted matrix factorization.
 (2019) Post and repost: a holistic view of budgeted influence maximization. Neurocomputing 338, pp. 92–100. Cited by: Fast adaptively weighted matrix factorization.
 (2017) Irgan: a minimax game for unifying generative and discriminative information retrieval models. In SIGIR, pp. 515–524. Cited by: Related work.
 (2018) Collaborative filtering with social exposure: A modular approach to social recommendation. In AAAI, New Orleans, Louisiana, USA, February 27, 2018, Cited by: Related work.
 (2016) Collaborative denoising autoencoders for topn recommender systems. In International Conference on Web Search and Data Mining, pp. 153–162. Cited by: Introduction, Related work, Related work, 5th item.
 (2017) Selection of negative samples for oneclass matrix factorization. In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 363–371. Cited by: Related work.
 (2018) Walkranker: a unified pairwise ranking model with multiple relations for item recommendation. In Conference on Artificial Intelligence, Cited by: Related work.
 (2011) Understanding online community user participation: a social influence perspective. Internet research 21 (1), pp. 67–81. Cited by: Introduction.