Addressing the Item Cold-start Problem by Attribute-driven Active Learning
In recommender systems, cold-start issues are situations where no previous events, e.g. ratings, are known for certain users or items. In this paper, we focus on the item cold-start problem. Both content information (e.g. item attributes) and initial user ratings are valuable for seizing users’ preferences on a new item. However, previous methods for the item cold-start problem either 1) incorporate content information into collaborative filtering to perform hybrid recommendation, or 2) actively select users to rate the new item without considering content information and then do collaborative filtering. In this paper, we propose a novel recommendation scheme for the item cold-start problem by leverage both active learning and items’ attribute information. Specifically, we design useful user selection criteria based on items’ attributes and users’ rating history, and combine the criteria in an optimization framework for selecting users. By exploiting the feedback ratings, users’ previous ratings and items’ attributes, we then generate accurate rating predictions for the other unselected users. Experimental results on two real-world datasets show the superiority of our proposed method over traditional methods.
Recommender systems (RS) have become extremely common in recent years, and are applied in a variety of domains, from virtual community web sites like movielens.org to electronic commerce companies like amazon.com. In spite of the widespread application of RS, one difficult and common problem is the cold-start problem, where no prior events, like ratings or clicks, are known for certain users or items. The user cold-start problem may lead to the loss of new users due to the low accuracy of recommendations in the early stage. The item cold-start problem may make the new item miss the opportunity to be recommended and remain “cold” all the time. In this paper, we focus on the item cold-start problem, where recommendations are required for items that no one has yet rated.
Content information, such as item attributes, were exploited to address such issues in previous methods [hong:2013co, gantner:2010learning, hauger:2008comparison]. However, items with similar attributes may be of different interest for the same user. As shown in Figure 1 (data is collected from IMDB 111http://www.imdb.com/), movie Taken is favored by many people after release, with a mean rating equal to 8.0. When the follow-up Taken 3 was first released in 2014, it can be seen as a “cold” film. Since genres, screenwriters and many actors of these two films are the same, then if we exploit film attributes to perform hybrid recommendations, we may recommend this “cold” film to users who favored Taken before. However, as can be seen from the figure, the peak of Taken 3’s overall ratings moves down to rating 6, which means that many users might favor Taken but would give low ratings to Taken 3. One reason could be that, although Taken 3 and Taken have many attributes in common, Taken 3 has a lower quality than Taken, thus users who favored Taken before may dislike Taken 3, i.e. the recommendation of Taken 3 to users who favored Taken before might not be accurate. Therefore, it is not a safe way to handle the cold-start issue based on film attributes only. A natural solution is to select a small set of users to watch this “cold” film first, whose feedback can give us more understanding of users’ preferences on this “cold” film. Then we can perform more accurate recommendations. Interestingly, this is similar to the key idea of active learning in the machine learning literature [rubens:2015active].
Most works that apply active learning to recommender systems focus on the user cold-start problem [elahi:2014active, rashid:2008learning, golbandi:2011adaptive]. New users’ preferences are typically obtained by directly interviewing the new user about what his interest is, or asking him to rate several items from carefully constructed seed sets. Seed sets may be constructed based on popularity, contention and coverage [rashid:2008learning]. Items in these constructed seed sets will be rated by every new user. However, the item cold-start problem is different because items cannot be interviewed and typically there are no users willing to rate every new item. Thus we need to construct different user sets to rate different new items, ensuring that users are not always selected for rating requests. In addition, the user set for each new item must be carefully constructed so that we can learn as much as possible about the new item given a limited number of rating requests. However, limited works have been conducted to address the item cold-start problem by active learning. [aharon:2015excuseme, anava:2015budget] use the active learning idea but ignore items’ attribute information. Meanwhile, they select users based on limited criteria. In fact, the new item’s attributes give us some understanding of this item and can be exploited to improve our user selection strategy. For example, we tend to select users who favor attributes existing in the new item, since these users are more willing to give ratings.
In this paper, we propose a novel recommendation framework for the item cold-start problem, where items’ attributes are exploited to improve active learning methods in recommender systems. The attribute-driven active learning scheme has following characteristics:
Explicitly distinguishing 1) whether a user will rate the new item and 2) what rating the user will give to the new item. The former helps us to select users who are willing to give ratings to the new item (feedback ratings). The latter allows us to exploit the rating distribution to improve the selection strategy. For example, we expect to select users who give diverse ratings to generate unbiased predictions. This is easy to understand since if we select users who all give high ratings, then the trained prediction model will generate high biased ratings for all other users, though the other users may not favor the new item at all.
Personalized selection strategy to ensure fairness. We construct our selection strategy based on four criteria, of which two are personalized criteria. The personalized criteria ensure that for new items with different attributes, users selected by our method would be different. This can avoid selecting the same user to rate every new item, which will negatively influence the user experience. These criteria are uniformly modeled as an integer quadratic programming (IQP) problem, which can be efficiently solved by some relaxation.
Dynamic active learning budget. In previous active learning works [huang:2007selectively, anava:2015budget], the budget of active learning (i.e. the number of users selected for rating requests) for a new item is fixed. However, in real-world applications, 1) some new items are under the attention of a small set of users (not popular), e.g. films with unpopular actors and directors, and 2) some would be obviously favored by almost all users (popular and not controversial), e.g. Harry Potter and the Deathly Hallows: Part 2 222http://www.imdb.com/title/tt1201607/, while 3) others are popular but controversial, and the recommender is not sure about users’ preferences on them, e.g. although Taken 3 is famous, the qualities of films previously acted by its main actors vary a lot, thus it is difficult to predict users’ preferences on Taken 3. It is the items in the third case that need more feedback ratings so as to be learned more about. In this paper, we are the first to propose a dynamic active learning budget so that the limited active learning resources will be properly distributed, which can improve the overall prediction accuracy.
Considering exploitation, exploration and their trade-off. Traditional active learning methods aim at maximizing the performance measured on unselected instances in the prediction phase [chattopadhyay:2012batch, chakraborty:2015batchrank], regardless of the cost in the active learning phase, since they assume the labeling cost for each instance is the same. However, in our active learning phase, we prefer a rating request for a user who is willing to rate the item rather than a user who is not, because the latter one will negatively influence the user experience. Our solutions are inspired by [rubens:2007influence, rokach:2008pessimistic], which try to maximize the sum of rewards by balancing the trade-off of exploitation and exploration. The rewards in our task contain two parts, i.e. the user experience in the active learning phase and the prediction phase, respectively. By exploiting “existing knowledge” (exploitation) from the model trained in Figure 3 (b), we are able to select willing users to obtain good user experience in the active learning phase. For the user experience in the prediction phase, we want to learn as much “new knowledge” (exploration) about unselected users’ preferences as possible, so as to generate accurate rating predictions for them. Note that users selected that best satisfy exploitation may not be the most helpful for exploration. Therefore, the “exploitation-exploration trade-off” in our task lies in how we optimize our user selection strategy, in order to obtain relatively good user experience in both of the active learning phase and the prediction phase. Our method considers both of these two goals and can further balance their trade-off by adjusting the parameter setting.
2 Related Work
2.1 The Item Cold-start Problem
To address the item cold-start problem, a common solution is to perform hybrid recommendations by combining content information and collaborative filtering [agarwal:2009regression, gunawardana:2008tied, park:2009pairwise, nasery:2016recommendations]. A regression-based latent factor model is proposed in [agarwal:2009regression] to address both cold and warm item recommendations in the presence of items’ features. Items’ latent factors are obtained by low-rank matrix decomposition. [park:2009pairwise] solve a convex optimization problem, instead of the matrix decomposition, to improve this work. Another approach based on Boltzmann machines is proposed in [gunawardana:2008tied, gunawardana:2009unified] to solve the item cold-start problem, which also combines content and collaborative information. LCE [saveski:2014item] exploits the manifold structure of the data to improve the performance of hybrid recommendations. Other works are under a different setting where few ratings of new items exist, but no items’ attribute information is known. [aharon:2012dynamic, aizenberg:2012build] use a linear combination of raters’ latent factors weighted by their ratings to estimate new items’ latent factors.
2.2 Active Learning in Recommender Systems
Most active learning methods in recommender systems focus on the user cold-start problem, where they select items to be rated by newly-signed users [rubens:2015active, elahi:2016survey]. We briefly introduce these methods since most of them can also be adapted to our new item task. The Popularity strategy [golbandi:2010bootstrapping, golbandi:2011adaptive] and the Coverage strategy [golbandi:2010bootstrapping] are two representative attention-based methods, where the former one selects items that have been frequently rated by users and the latter one selects items that have been highly co-rated with other items. Uncertainty reduction methods aim at reducing the uncertainty of rating estimates [golbandi:2010bootstrapping, rubens:2007influence, rokach:2008pessimistic], model parameters [hofmann:2003collaborative, jin:2004bayesian] and decision boundaries [danziger:2007choosing]. Error reduction methods try to reduce the prediction error on the testing set by either 1) optimizing the performance measure (e.g. minimizing RMSE) on the training set [golbandi:2010bootstrapping, golbandi:2011adaptive], or 2) directly controlling the factors that influence the prediction error on the testing set [rubens:2009output, settles:2008multiple]. [harpale:2008personalized] uses some initial ratings to perform personalized active learning in a non-attribute context. There are also combined strategies [rubens:2007influence, mello:2010active, elahi:2012adapting] considering several objectives at the same time. When applied to our new item task, some of these works require a few initial ratings on new items, which are not available in our task. The other works do not need initial ratings, but perform active learning regardless of the content information. However, the new item’s content information gives us some understanding of the new item and we can exploit it to better perform active learning. In addition, methods such as the Popularity strategy [golbandi:2010bootstrapping, golbandi:2011adaptive] and the Coverage strategy [golbandi:2010bootstrapping] always select the same set of users, which negatively influence the user experience.
[anava:2015budget, aharon:2015excuseme] are works which also address the item cold-start problem in an active learning scheme. However, they both focus on the pure collaborative filtering model and do not consider the content information either.
2.3 The Exploitation-exploration Trade-off
Some works also consider the exploitation-exploration trade-off [rubens:2007influence, rokach:2008pessimistic]. Many of the promising solutions come from the study of the multi-armed bandit problem [feldman:2015recommendations]. The key idea of these solutions is to simultaneously optimize one’s decisions based on existing knowledge (i.e. exploitation) and new knowledge which would be acquired through these decisions (i.e. exploration), in order to maximize the sum of rewards earned through a sequence of actions. The -Greedy algorithm [watkins:1989learning] selects the arm which has the best estimated mean reward with probability , and otherwise randomly selects an other arm. UCB-like (UCB refers to Upper Confidence Bound) algorithms [auer:2002using, dani:2008stochastic, abbasi:2011improved] firstly calculate the confidence bound of all arms and then select the arm with the largest upper confidence bound. The insight is that arms with a large mean reward (exploitation) and high uncertainty (exploration) would have large upper confidence bound. Thompson Sampling algorithms [thompson:1933likelihood, agrawal:2012analysis, russo:2014learning] firstly calculate the probability distribution of the mean reward for each arm, then draw a value from each distribution and finally select the arm with the largest drawn value.
Our task and the multi-armed bandit problem share some common features, e.g. both considering the exploitation-exploration trade-off. However, they have some key differences. In the multi-armed bandit problem, the arms are selected one by one and a reward is generated immediately after each arm is selected. Hence, many solutions (e.g. UCB-like algorithms, Thompson Sampling algorithms) design their selecting strategies based on previous rewards. However, in our setting, a batch of users are selected at the same time, without knowing other users’ feedback (reward), thus many solutions for the multi-armed bandit problem cannot be applied to our task.
3 preliminaries and model
3.1 Task Definition and Solution Overview
We use , and to denote the users, items and attributes set, respectively. Our task is: given a user-item rating matrix , an item-attribute matrix and a new item , whose attributes are denoted as a vector , predict users’ ratings on the new item . , and are shown in Figure 2. In this paper, the task is solved via two phases. The first one is the active learning phase, which is to solve: which users should be selected to rate , so as to learn about as much as possible. The second one is the prediction phase, which is to solve: once given selected users’ feedback, how to accurately predict the other users’ ratings on . For the active learning phase, we carefully select users based on four useful criteria, which involve both classification tasks and regression tasks. For the prediction phase, we model it as a pure regression task. We use Factorization Machines [rendle:2012factorization] to model all classification tasks and regression tasks in these two phases.
3.2 Factorization Machines
Factorization Machines(FM) is a state-of-art framework for latent factor models which can incorporate rich features. Here we briefly introduce it. Please refer to [rendle:2012factorization] for a more detailed description. The prediction problem is described by a matrix and a vector , where each row of is one instance with real-valued features and each entry in is the label of one instance. FM can model nested feature interactions up to an arbitrary order between the features of . For feature interactions up to -order, they are modeled as:
where is the dimensionality of the factorization and the model parameters are: . , and are entries of , and , respectively. The form of FM (Equation (1)) is very general and can be applied to many applications.
In our task, we need to predict 1) what rating a user will give to an item and 2) whether a user will rate an item.
For the regression task to predict what rating a user will give to an item, features can contain users, items and attributes of items, and labels are ratings. The first term of the right-hand side in Equation (1) is a bias of the system. If is large, then there is a bias towards high ratings, which may be due to the good user experience of the system design. The second term is a bias of unary features, that is, some optimistic users tend to give high ratings to every item, and some popular items or items with popular attributes always gain high ratings. The last term is a bias of feature interactions. Many users only give high ratings to certain items (or items with certain attributes) which they are really interested in.
For the classification task to predict whether a user will rate an item, the analysis of each term in Equation (1) is similar, except that labels now represent whether users will rate items or not.
4 Our method
4.1 Select Users to Rate the New Item
As described in section 3.1, we first need to carefully select users to rate the new item , so as to learn about as much as possible. Users are selected based on the following four criteria.
(1) Selected users are with high possibility to rate . This can be modeled as a classification task. To achieve this, we first transform the user-item rating matrix to a 0-1 matrix [koren:2008factorization], where all entries with ratings are assigned to 1, and all entries with no ratings are assigned to 0 (see Figure 3 (a)). Then we use FM to model them [deldjoo:2016using], where all entries equal to 1 are regarded as positive instances, and a same number of negative instances are selected from the entries equal to 0. Features contain users and attributes (without items). Labels are 1 for positive instances and 0 for negative instances. The classification model is trained based on and . The general process is shown in Figure 3 (b).
Finally, a vector is defined as follows:
where is the possibility that user will rate item , which is predicted by our learned classification model. We tend to select if is large.
(2) Selected users’ potential ratings are diverse. Potential ratings are users’ ratings on purely estimated according to ’s attributes (without feedback since there is no feedback yet). We expect selected users’ potential ratings are diverse, so that: 1) selected users tend to have different interest. Ratings of these users would provide more information compared to ratings of similar users, and 2) the final prediction model trained on these users’ feedback would not be biased to a fixed region of ratings. To choose users with diverse potential ratings, we firstly train a regression model based on and , as shown in Figure 3 (c). In this regression model, features contain users and attributes (without items). Labels are users’ ratings. Once the regression model is learned, all users’ potential ratings on the new item can be estimated. Secondly, pair-wise diverse values among all these ratings are calculated to form the diverse matrix . The diverse value between potential ratings of and is defined as:
Calculating is computationally expensive. However, the calculations of diverse values are independent with each other. Therefore, when applied to real world recommender systems, they can be performed parallelly with acceleration techniques such as GPU acceleration [tarditi:2006accelerator], distributed computing [bokhari:2012assignment], etc. We tend to select and together if is large.
(3) Selected users’ generated ratings are objective. A rating on an item is objective means that this rating approximates the average of all ratings on this item, which is a good estimation of this item’s quality [duan:2008online, chevalier:2006effect, dellarocas:2004exploring]. We favor selecting users who always generate objective ratings in the past. Then they are expected to also generate objective ratings for . We form a vector , which consists of all users’ objective values. The objective value of user is defined as:
where is the item set that has rated. is ’s rating on . is the mean rating on . is a penalty for users who have rated few items, since a user may generate a rating that approximates by coincidence. Note that a smaller indicates that is more objective, so we tend to select if is small.
Users selected with this criterion would give ratings that can better reflect the quality of items, i.e. higher for items with better quality and verse vice. Therefore, with this criterion, the re-trained prediction model would generate overall higher/lower prediction ratings for new items with better/worse quality, which is more reasonable. This criterion is a complement to Criterion (2). Criterion (2) encourages the feedback ratings of selected users to have a large variance, thus it could increase the model’s differentiation power in terms of different users. With Criterion (3), we want the average of feedback ratings to be higher/lower for items with better/worse quality, which could increase the model’s differentiation power in terms of different new items.
(4) Selected users are representative. A selected user is representative means that this user is similar to unselected users. Selected users should be representative so that from their feedback, we can learn more about the preference of unselected users. To achieve this, we firstly construct a similarity matrix from users’ rating history. That is, each user is represented as a row vector of the user-item matrix , then the similarity between two users can be measured based on their vectors. is defined as:
where and are vectors of users and , and is their similarity. In this paper, cosine similarity is used to measure it. Acceleration techniques described in Criterion (2) can also be applied to calculating . We tend to select one of and if is large. Criterion (2) and Criterion (4) are highly related to the avoiding redundancy and ensuring representativeness described in [chattopadhyay:2012batch], which can be interpreted from the perspective of minimizing the distribution difference between the labeled and unlabeled data.
We now formulate the user selection task as an explicit mathematical optimization problem, where the objective is to select a batch of users based on above criteria. Specifically, we define a binary vector with entries (), where each entry denotes whether will be included in the batch () or not (). Thus our user selection strategy (with given batch size ) can be expressed as the following integer quadratic programming (IQP) problem:
The first term is to satisfy Criterion (1). Supposing is with high possibility to rate ( is large), then in order to optimize the objective function, is encouraged to be selected ( is encouraged to be 1). The second term is to satisfy Criterion (2). Supposing potential ratings of and are very diverse ( is large), then and are encouraged to be selected together ( and are encouraged to be 1 together) in order to optimize the objective function. The third term is to satisfy Criterion (3), whose analysis is similar to the analysis for the first term, except that we minus this term, since we want to select when is small. The last term enforces selected users to be similar to unselected users, ensuring representativeness, which satisfies Criterion (4). This can be similarly analyzed as for the second term. , , and are trade-off parameters.
Equation (6) can be reformulated as:
where is a vector with all entries equal to and is a diagonal matrix, whose -th entry is equal to the -th entry of (). Since we have the constraint , thus we derive . is defined as:
Finally, the objective function is transformed to:
Directly solving this integer quadratic programming (IQP) problem is NP-hard. However, it can be relaxed to 1) a convex quadratic programming (QP) problem by relaxing the constraint to [chattopadhyay:2012batch], or 2) a convex linear programming (LP) problem of two types, one from [chattopadhyay:2012batch] and the other from [chakraborty:2015batchrank]. Here we briefly describe the convex LP solution from [chakraborty:2015batchrank]. It is by two steps. In step 1, compute a vector containing column sums of and identify the k largest entries in to derive the initial solution (replace the k largest entries of with value 1 and replace the other entries with value 0, then assign it to ). In step 2, it is a iterative process as shown in Algorithm 1. Starting with initial solution , we generate a sequence of solutions until convergence. Finally, we get the solution , whose 1-value entries indicate the selected user set to rate .
If is positive semi-definite, Algorithm 1 has a guaranteed monotonic convergence. If is not positive semi-definite, with a positive scalar added to the diagonal elements, this algorithm can still be run on the shifted quadratic function to guarantee a monotonic convergence [yuan:2013truncated]. Due to the monotonic convergence, the quality of the solution can only improve over iterations. The iterative process converges fast, thus there is only a marginal increase of the running time. Therefore, the complexity of our algorithm is O(), where is the number of users. Refer to [chakraborty:2015batchrank] for a more detailed description of the complexity analysis. In real recommender systems, may be too large for the memory to load. Our algorithm can still work well in this situation. For step 1, the memory only needs to load one column of at a time to calculate column sums of . For each iteration of step 2, the memory only needs to load one row (equal to corresponding column, is symmetric) of and at a time to calculate .
4.2 Active Learning for a Batch of Items
As described in the introduction section, the budget of active learning is fixed for each new item in previous active learning works. In this paper, we propose a dynamic active learning budget so that the limited active learning resources can be properly distributed. We use to denote new items. The total budget is denoted as . Budget for all new items is denoted as , where are corresponding numbers of selected users for each new item. Thus we have . We propose that more budget is distributed to new items with following two features.
Firstly, these items are popular, which means many people would be willing to rate them. In the active learning phase, since popular items tend to be rated by more selected users, thus we will get more feedback ratings if we require ratings on popular items rather than requiring ratings on unpopular items. In the prediction phase, since popular items also tend to receive more ratings from unselected users, learning more about popular items, rather than unpopular ones, will influence and generate accurate predictions for more ratings. This is a problem of whether users will rate items (described in Criterion (1)). We use the mean of all users’ willing scores to measure it:
where is defined in section 4.1.
Secondly, these items are controversial, which means we are uncertain about whether they will be liked or disliked by users. For items which will be obviously favored by almost all users, we already have a high confidence what ratings users tend to give to them. In contrary, it is the controversial items that we need to learn more about. This is a problem of what ratings users will give to items (described in Criterion (2)). We use the standard deviation of potential ratings to measure it:
where is defined in section 4.1. is the average of potential ratings on . A budget score for each new item is defined as:
where is a parameter to balance importance of two features. Finally, budget is distributed as:
are rounded to be integers. This equation ensures that more popular and controversial items will get more budget, and meanwhile each item has the opportunity to gain some budget.
4.3 Rating Prediction Based on Feedback
Once selected users’ feedback is obtained, we use another regression model to predict unselected users’ ratings. Features in this model contain not only users and items’ attributes, but also items. The instances contain both previous ratings and the newly obtained feedback ratings. Again, this is modeled by the Factorization Machines. To reduce the iteration number and accelerate the convergence speed, we firstly pre-train the regression model using previous ratings to get pre-trained parameters. Then when feedback ratings are obtained, we use these pre-trained parameters as initial parameters, all previous ratings and feedback ratings as training data, to re-train the model. Finally, ratings of all unselected users are predicted. Figure 4 shows the detailed procedure.
The procedure of firstly pre-training and then re-training is similar to the idea of Bayesian Analysis [berger:2013statistical]. That is, by pre-training on previous items, we learn users’ preferences on attributes, and gain a “prior” understanding of users’ preferences on estimated according to ’s attributes. Then users’ feedback on enhance our understanding of and allow us to give “posterior” estimations for users’ preferences on .
4.4 Exploitation-exploration Analysis
As described in the introduction section, there are two goals in our task, i.e. exploitation (exploiting “existing knowledge” to select users who are willing to rate new items, in order to obtain good user experience in the active learning phase) and exploration (selecting users whose feedback can provide as much “new knowledge” about unselected users’ preferences as possible and generate accurate rating predictions for unselected users, in order to obtain good user experience in the prediction phase). 1) The strategy of dynamic budget distributes more budget to popular items on which people are more willing to give ratings. Criterion (1) encourages users who are more willing to rate a certain new item to be selected. Thus they both contribute to improving the user experience in the active learning phase. 2) The strategy of dynamic budget and four criteria all help us to learn more about unselected users’ preferences and generate more accurate rating predictions. Thus they all contribute to improving the user experience in the prediction phase. Therefore, our method considers both of these two goals (exploitation and exploration). In addition, we are able to adjust the parameter setting to further balance their trade-off. Specifically, once the best prediction accuracy is obtained with all parameters assigned to appropriate values, if we want to attach more importance to the user experience in the active learning phase, we just need to simply increase (the weight of Criterion (1)). The reason is that, putting more weight on Criterion (1) would result in a higher rate of feedback ratings. However, increasing will destroy the optimized parameter setting for rating prediction, thus the prediction accuracy would decrease.
Our proposed algorithm is evaluated on two datasets, Movielens-IMDB and Amazon. For the Movielens-IMDB dataset, ratings are collected from Movielens 333http://grouplens.org/datasets/movielens/ and attributes of movies are collected from imdbpy 444http://imdbpy.sourceforge.net/. Ratings 555http://jmcauley.ucsd.edu/data/amazon/links.html and attributes 666https://developer.amazon.com/ are also collected for the Amazon dataset. The statistics of these two datasets are shown in Table I. In Movielens-IMDB, the number of attributes is the total number of directors, actors, genres, etc. In Amazon, the number of attributes is the total number of authors, publishers, etc. We collect ratings from the Movielens dataset rather than the official Netflix dataset 777https://www.kaggle.com/netflix-inc/netflix-prize-data. The reason is that, items’ detailed attributes are required in our setting. The attributes of movies in the Movielens dataset can be collected from http://imdbpy.sourceforge.net/ by accurately linking the movie ids in Movielens and IMDB. However, we did not figure out how to accurately obtain the attributes of movies in the Netflix dataset. For each dataset, 20% items are randomly chosen as “new items” (i.e. testing items) in our conducted experiments. Ratings and attributes of the other 80% items are used to train different models. Our goal is to generate accurate rating predictions on the testing items. Following [aharon:2015excuseme, anava:2015budget], the training-testing experiments are done once (also called holdout [tan:2006introduction]). Inspired by [harpale:2008personalized], we randomly select half of all users as the active-selection set and the remaining users form the prediction set. For all tesing items, users are selected from the active-selection set in the active learning phase. The rating prediction and the evaluation are performed for users in the prediction set.
|Number of users||5000||973|
|Number of items||9998||5000|
|Number of ratings||5154925||97967|
|Number of attributes||255942||3840|
5.2 Compared Algorithms
Since in this paper, we want to handle new items with no rating, thus many previous active learning recommendation methods [hofmann:2003collaborative, jin:2004bayesian, schohn:2000less], which require at least a small amount of initial ratings, are not applicable in our task. [anava:2015budget, aharon:2015excuseme] are the most related works to ours, which also address the item cold-start problem in an active learning scheme. However, the approach proposed by [aharon:2015excuseme] is under an online setting in which users arrive and are decided to be given rating requests one by one and apparently it is not applicable for our task. [anava:2015budget] assumes that selected users will always rate the new item (the rate of feedback ratings is 100%), while our task is under a more realistic setting that only a subset of selected users will give feedback ratings. Thus it is unfair to compare the method in [anava:2015budget] with our method and other baselines. We denote our method without dynamic budget as FMFC (Factorization Machines with Four Criteria) and our method with dynamic budget as FMFC-DB. We try our best to adapt following baselines from previous literature to compare with our proposed methods. HBRNN, LCE and FM are hybrid methods which combine both content and collaborative information. The remaining algorithms all exploit active learning to perform recommendations. The pre-train schedule in Figure 4 is the same for all active learning methods, but the re-train schedule differs since different active learning methods have different feedback ratings.
Hybrid-based Recommendation with Nearest Neighbor (HBRNN): This method [iaquinta:2007hybrid] is a combination of content-based recommendation and item-based collaborative filtering. The similarity between two items , (including training items and testing items) is defined as follows:
where and are row vectors of the item-attribute matrix , representing and based on attributes. Once similarities between items are obtained, all users’ ratings on the new item are predicted using an item-based collaborative filtering idea:
where is the item set that rates. is the rating that gives to .
Local Collective Embeddings (LCE): This method [saveski:2014item] also combines content-based recommendation and collaborative filtering. Different from HBRNN, which is a hybrid-based recommendation method from the nearest neighbor perspective. LCE is a hybrid-based recommendation method from the perspective of matrix factorization. In addition, it exploits the manifold structure of the data to improve the performance. We use the publicly available Matlab implementation 888https://github.com/msaveski/LCE of the LCE algorithm. Parameters are set and tuned as recommended in [saveski:2014item].
Factorization Machines without Active Learning phase (FM): This method uses Factorization Machines [rendle:2012factorization] to model user behaviours. We directly use the pre-trained model in Figure 4 to predict users’ ratings on the new item .
Factorization Machines with Random Sampling in the Active Learning phase (FMRSAL): In this baseline, for the new item , users are randomly selected from the active-selection set for rating requests. Since these users are randomly selected and ratings are sparse in our dataset, thus the rate of feedback ratings is expected to be low. The performance improvement may be limited when compared to FM. However, rating requests are given to users without bias to any type of users, thus no one is always selected for rating requests in this user selection strategy.
Factorization Machines with -Greedy in the Active Learning phase (FMGAL): The -Greedy algorithm is from the study of the multi-armed bandit problem [feldman:2015recommendations]. We adapt it to our task as follows. For the new item , we select users by sequential actions. For each action, we select the user who has the highest possibility to rate (i.e. with the largest ) with probability , and otherwise randomly select other users. When one user is selected in one action, he/she does not participate in following actions. In this user selection strategy, one more parameter, i.e. , needs to be tuned. When we set , it is equal to our FMFC with only Criterion (1). When we set , it is transformed to FMRSAL. When we set , due to the randomness, rating requests are distributed to a wide range of users. Meanwhile, it can ensure a rate of feedback ratings higher than FMRSAL. In our experiments, we find no matter how varies, the performance in terms of all metrics is always between the performance of FMFC with only Criterion (1) and FMRSAL, thus we only show the experiment results with equal to 0.5 for simplicity.
Factorization Machines with Poplar Sampling in the Active Learning phase (FMPSAL): Inspired by [golbandi:2011adaptive, rubens:2009output], for the new item , users who have given the most ratings to the training items are selected for rating requests. Since these users are “frequently” rating users, they also tend to rate , which can ensure a high rate of feedback ratings. Note that different from our Criterion (1), which is “personalized” for different new items, users selected in this strategy are always the same.
Factorization Machines with Coverage Sampling in the Active Learning phase (FMCSAL): Inspired by [golbandi:2010bootstrapping, rubens:2009output], for the new item , users who have highly co-rated items with other users are selected for rating requests. We define , where is the number of items that are rated by both users and . Users with high Coverage values are then selected. The heuristic used by this strategy is that users co-rate the same items with many other users can better reflect other users’ interest, and thus their rating behaviors are more helpful for predicting rating behaviors of other users.
Factorization Machines with Exploration Sampling in the Active Learning phase (FMESAL): As described in [rubens:2015active], exploration is important for recommendation, especially for new items as in our task. Inspired by studies about exploration in [rubens:2007influence, chattopadhyay:2012batch], for the new item , users are selected for rating requests ensuring that selected users are representative of unselected users, and at the same time, selected users themselves are with high diversity. This can be achieved by optimizing the following objective function:
where and are defined as in Equation (7). The first term is to ensure “diversity” (selected users are dissimilar to each other) and the second term is for “representative” (selected users are similar to unselected users). This integer quadratic programming (IQP) problem can be relaxed to a standard quadratic problem (QP) and be solved by applying many existing solvers.
In the active learning phase, we use following two metrics to measure the user experience of selected users.
percentage of feedback ratings (): the ratio of users who give feedback ratings among all users who receive rating requests. It is formally defined as:
Users giving feedback ratings are more likely to be willing to rate the item than those who do not give feedback ratings. A higher means that more selected users give feedback ratings, which indicates a better user experience.
Average Selecting Times (): average selecting times per user after a certain selection strategy is applied for all testing items. It is formally defined as:
A higher means some users are always selected for rating requests, which will quickly annoy them and indicates a poorer user experience. For algorithms without active learning, i.e. HBRNN, LCE and FM, these two metrics are not measured.
In the prediction phase, we use Root Mean Square Error () and Mean Absolute Error () to measure the user experience of unselected users. They are defined as follows:
are sets of (user, item) pairs that users give ratings to new items. is the rating that actually gives to and is the predicted rating.
For methods with no active learning, i.e. HBRNN, LCE and FM, models are directly trained on training items. For the remaining methods, given a new testing item, we select some users from the active-selection set in the active learning phase to see whether they have actual ratings on the testing item. If yes, we regard these actual ratings as feedback ratings. In the prediction phase, we exploit all feedback ratings to re-train the model. For all methods, and are evaluated in the prediction set.
Apart from rating prediction, we can mimic a setting of top- recommendations as follows. Firstly, for all testing items, we select users from the active-selection set to get feedback and predict ratings for users in the prediction set (for HBRNN, LCE and FM, we directly predict users’ ratings). Secondly, for each user, we select (we set ) testing items with the largest predicted ratings (i.e. top- items) as the recommendation list. Finally, we regard new items with actual ratings larger than 3 as users’ preferred items [guan:2016weakly]. Performance is evaluated based on how many preferred items existing in the recommendation list, their actual ratings and their ranking positions. Following ranking metrics are used to evaluate the performance of top- recommendations.
Precision, Recall: is defined as the number of correctly recommended items (i.e. the number of preferred items existing in the recommendation list) divided by the number of all recommended items. is defined as the number of correctly recommended items divided by the total number of items which should be recommended (i.e. the number of preferred items). and are corresponding values at ranking position . In our setting, there are items in the recommendation list, while the number of preferred items is relatively large, thus the original is too small. We multiply it by a appropriate factor to get the modified [zhu:2016heterogeneous].
Normalized Discount Cumulative Gain (): at position is defined as:
where is the relevance rating of the item at position . is set so that the perfect ranking has a value of 1. In our case, is set to be the actual rating for preferred items and 0 for the other items.
5.4 Parameter Setting
Before setting the parameters, we calibrate the four criteria first. The calibration contains following two steps.
Step 1: we normalize , , , to be , , , by standardization. Specifically, the normalization formula is: , , , , where and are the -th entries of and , respectively. and are the mean and standard deviation of all entries in . and are entries with the -th row and -th column in and , respectively. and are the mean and standard deviation of all entries in . The denotations for and are similar to those in and , respectively.
Step 2: we divide and by to be and . There are , , and entries in , , and , respectively. If we do not calibrate , , and , then Eq. (6) will be vastly influenced by the second and fourth terms (due to much more entries in them). To prevent and from dominating and when optimizing Eq. (6), we divide and by . and in Eq. (6) are tuned based on , , , . Due to these two steps, the tuned and could then have similar orders of magnitude.
, , , and are the main parameters in our paper. is the number of selected users for active learning. We empirically set for each testing item to tune the other parameters. There are four parameters , , and to trade off the importance of different criteria. In fact, they can be multiplied by an arbitrary scaling factor, so there are exactly three free parameters. We fix and tune the other three free parameters by grid search. We use as the tuning metric, where is measured by cross-validation on the training data (users are also split to the active-selection set and the prediction set). For the Movielens-IMDB dataset, the final tuned parameters are . We regard the performance measured on all testing items using this parameter setting as the performance of FMFC. Furthermore, we fix , and implement FMFC-DB, i.e. our method with dynamic active learning budget. The total budget (see section 4.2) is set to be , where is the number of testing items. We regard the performance measured in this setting as the performance of FMFC-DB. For the other active learning baselines, the performance is measured with the number of selected users equal to for each testing item. For the Amazon dataset, the final tuned parameters are . The other settings are the same as for the Movielens-IMDB dataset.
5.5 Results and Analysis
5.5.1 Algorithm Comparison
We now compare our methods with all baselines. All the performance is measured on testing items. As shown in Table II and Table III, our methods (FMFC and FMFC-DB) outperform the other baselines in terms of and , which indicates our methods have the highest prediction accuracy in the prediction phase. Our methods also perform the best in the task of top- recommendations according to Table IV and Table V. Factorization Machines with different active learning strategies perform better than Factorization Machines without active learning (FM). This is easy to understand since feedback ratings give us more understanding of testing items, which can be exploited to enhance the prediction model. As methods without active learning, HBRNN and LCE perform better than FM and even better than active learning methods FMRSAL and FMESAL. The reason may be that HBRNN and LCE can make better use of both content and collaborative information than FM. LCE has a slightly better performance than HBRNN. The reason may be that it exploits the manifold structure of the data to improve the performance of hybrid recommendations. FMPSAL, FMCSAL, FMFC and FMFC-DB perform better than the other three active learning methods. The main reason is that these four methods can ensure a high rate of feedback ratings (high ), which is the domain factor that influences the prediction accuracy (we will analyze this in the next section). Our methods outperform FMPSAL and FMCSAL because 1) our methods achieve higher rates of feedback ratings than FMPSAL and FMCSAL, and 2) our methods consider not only the rate of feedback ratings (Criterion (1)), but also other three factors (Criteria (2), (3), (4)) to improve the prediction accuracy. and are both metrics that measure the user experience in the active learning phase. There is no active learning phase for HBRNN, LCE and FM, thus we compare the other methods in terms of and . For , FMRSAL and FMESAL have rather few feedback ratings, which indicates they always give rating requests to users who do not really want to rate them. Thus these two methods negatively influence the user experience. FMGAL has a relatively higher . The other active learning methods all have a considerable number of feedback ratings. For , due to the natural randomness, FMRSAL undoubtedly performs the best with the lowest and FMGAL performs the second best. FMPSAL, FMCSAL and FMESAL select the same user set to rate all testing items and it will certainly annoy them. Overall, our methods are the best when considering all these metrics. When comparing FMFC with FMFC-DB, it can be seen that our proposed dynamic active learning budget can further improve the performance in terms of all metrics.
The significant test is performed to show whether the differences of our experiment results are statistically significant. Paired t-tests are conducted to compare the differences between the experiment performance of (1) our two proposed methods (i.e. FMFC and FMFC-DB), (2) FMFC and all baselines, and (3) FMFC-DB and all baselines. Specifically, for each dataset, we repeat the training-testing experiments for times (the testing item set is independently chosen in different times). Then given a certain metric, each method would generate metric values. The inputs of paired t-test are two sets of metric values, with each set corresponding to one compared method. Since we want to validate that one method is better than the other, we use one-tailed hypothesis. Statistical significance is set at . The results show that, in terms of all metrics except for , (1) FMFC-DB has significantly better performance than FMFC, and (2) compared to all baselines, both FMFC and FMFC-DB have significantly better performance.
5.5.2 Criteria Analysis
|on Movielens-IMDB||on Amazon|
|No Criterion (1)||9.91||10.16||2.25||2.57|
|The Average Diverse Value on Movielens-IMDB||The Average Diverse Value on Amazon|
|No Criterion (2)||1.21||1.23||1.13||1.16|
|The Difference on Movielens-IMDB||The Difference on Amazon|
|No Criterion (3)||1.22||1.19||1.43||1.39|
|The Average Similarity Value on Movielens-IMDB||The Average Similarity Value on Amazon|
|No Criterion (4)||0.54||0.56||0.46||0.49|
As mentioned in section 4.1, to generate more accurate rating predictions of the new item, our methods select users based on four criteria. In this section, we firstly validate whether each criterion works as they claim. Then we validate their contributions to the final prediction performance.
Criterion (1): Selected users are with high possibility to rate
We remove Criterion (1) in FMFC and FMFC-DB to see how varies. The results are shown in Table VI. Without Criterion (1), the decreases dramatically for both FMFC and FMFC-DB. Since a higher means that more selected users rate , the result validates the effectiveness of Criterion (1).
Criterion (2): Selected users’ potential ratings are diverse
The purpose of selecting users with diverse potential ratings is to ensure these users’ actual ratings also tend to be diverse, so that the final prediction model is not biased to a fixed region of ratings. We remove Criterion (2) in FMFC and FMFC-DB to see how the average diverse value (defined in Eq. (3)) of selected users’ actual ratings varies. As shown in Table VII, without Criterion (2), the average diverse value decreases for both FMFC and FMFC-DB. The result validates the effectiveness of Criterion (2).
Criterion (3): Selected users’ generated ratings are objective
The insight of Criterion (3) is to let the average rating of users selected by our methods (denoted as ) approximate the average rating of all users (denoted as ). We measure the difference between and , i.e. . Meanwhile, we calculate the average rating of users selected without Criterion (3) (denoted as ) and measure . The result is shown in Table VIII. We can see that for both FMFC and FMFC-DB, which indicates that Criterion (3) can actually make the average rating of selected users closer to .
Criterion (4): Selected users are representative
The insight of Criterion (4) is to let the selected users similar to unselected users. We measure the average similarity value between selected and unselected users with/without Criterion (4) to validate this criterion. The result is shown in Table IX. We can see that without Criterion (4), the average similarity value declines, which indicates that Criterion (4) can actually make the selected users more similar to unselected users.
We further validate the contribution of each criterion to the final prediction performance. The results are shown in Figure 5. increases when we remove each criterion, which indicates each criterion contributes to the prediction improvement. increases the most when we remove Criterion (1), which indicates this criterion is a domain factor that influences the prediction improvement. is the number of selected users. decreases when increases. This is easy to understand, since larger leads to more feedback ratings, which will give us more understanding of the new item and thus generate more accurate predictions.
5.5.3 Dynamic Budget Analysis
As mentioned in section 4.2, we use the strategy of dynamic budget to properly distribute limited active learning resources. As shown in Figure 6, for different values of the total budget, the dynamic budget all contributes to the performance improvement in terms of both and . The improvement is narrowed when the total budget increases. This is because the dynamic budget is proposed to address the problem of limited active learning resources. When the total budget is sufficient, this strategy provides less help.
5.5.4 Exploitation-exploration Analysis
Results in Table II and Table III have shown that our method can achieve high performance for both exploitation (high in the active learning phase) and exploration (low and in the prediction phase). Now we analyze how , i.e. the weight of Criterion (1), can further balance their trade-off. We vary but fix other tuned parameters to see how the performance changes. As shown in Figure 7, of FMFC first decreases then increases when varies, and obtains the best result when is around . keeps increasing when varies. We certainly will not set , which will achieve poor performance in terms of both and . For , if we attach more importance to the active learning phase, we need to assign a larger value to . Similarly, if we pay more attention to the prediction phase, then a smaller value is assigned. These experiment results are consistent with the analysis of section 4.4.
In this paper, we propose a novel recommendation scheme for the item cold-start problem by leveraging both active learning and items’ attribute information. We firstly pre-train the rating prediction model with users’ historical ratings and items’ attributes. Secondly, given a new item, a small portion of users are selected to rate this item based on four useful criteria. Thirdly, the prediction model is re-trained by adding feedback ratings. Finally, unselected users’ ratings are predicted by the re-trained model. We further propose a dynamic active learning budget to properly distribute active learning resources, which contributes to better recommendation performance. The idea of dynamic active learning budget can also be applied to other active learning related tasks. Our methods are able to ensure a relatively good user experience for both of selected users in the active learning phase and unselected users in the prediction phase. For future work, we will explore more other criteria to improve our user selection strategy. In addition, libFM used in this paper is a regression model, which is suitable for the task of rating prediction. We will try to expend our method with a ranking model to better address the top- recommendation task.
This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2013CB336500, National Nature Science Foundation of China (Grant Nos: 61522206, 61379071, 61373118), and National Youth Top-notch Talent Support Program.
Yu Zhu received the B.S. degree in Computer Science from Zhejiang University, China, in 2013. He is currently a Ph.D. student in computer science at Zhejiang University. His research interests include machine learning, data mining and recommender systems.
Jinghao Lin is currently a master candidate in the State Key Lab of CAD&CG, College of Computer Science at Zhejiang University, China. He received the BS degree from Zhejiang University, China in 2015. His research interests are data mining and recommendation systems.
Shibi He is a third year undergraduate in Zhejiang University. He is now a research scholar in University of Illinois at Urbana-Champaign supervised by Prof. Peng Jian. Before that, he is a member of social network group in the State Key Lab of CAD&CG under the supervision of professor Deng Cai. His research interests include Deep Learning, Social Network, Bioinformatics and Computer Vision.
Beidou Wang received his BS degree from Zhejiang University, China, in 2011. He is currently in a duo PhD program of Zhejiang University, China and Simon Fraser University, Canada. His research interests include social network mining and recommender systems.
Ziyu Guan received the BS and PhD degrees in Computer Science from Zhejiang University, China, in 2004 and 2010, respectively. He had worked as a research scientist in the University of California at Santa Barbara from 2010 to 2012. He is currently a full professor in the School of Information and Technology of Northwest University, China. His research interests include attributed graph mining and search, machine learning, expertise modeling and retrieval, and recommender systems.
Haifeng Liu is an Associate Professor in the College of Computer Science at Zhejiang University, China. She received her Ph.D. degree in the Department of Computer Science at University of Toronto in 2009. She got her Bachelor degree in Computer Science from the Special Class for the Gifted Young at University of Science and Technology of China. Her research interests lie in the field of machine learning, pattern recognition, and web mining.
Deng Cai is a Professor in the State Key Lab of CAD&CG, College of Computer Science at Zhejiang University, China. He received the PhD degree in computer science from University of Illinois at Urbana Champaign in 2009. Before that, he received his Bachelor’s degree and Master’s degree from Tsinghua University in 2000 and 2003 respectively, both in automation. His research interests include machine learning, data mining and information retrieval. He is a member of the IEEE.