User Bias Removal in Review Score Prediction

User Bias Removal in Review Score Prediction



Review score prediction of text reviews has recently gained a lot of attention in recommendation systems. A major problem in models for review score prediction is the presence of noise due to user-bias in review scores. We propose two simple statistical methods to remove such noise and improve review score prediction. Compared to other methods that use multiple classifiers, one for each user, our model uses a single global classifier to predict review scores. We empirically evaluate our methods on two major categories (Electronics and Movies and TV) of the SNAP published Amazon e-Commerce Reviews data-set and Amazon Fine Food reviews data-set. We obtain improved review score prediction for three commonly used text feature representations.


1 Introduction

1.1 User Bias Problem

Different users generally do not rate food or e-commerce products on the same scale. Every user has his/her own preferred scale of rating a product. Some users are generous and rate an item as 4 or 5 (out of 1, 2, 3, 4, 5), thus introducing a positive bias. At the other extreme, some users may give 1 or 2 (out of 1, 2, 3, 4, 5), thus introducing a negative bias in the scores. These preferred rating choices of particular users make it difficult to learn a general model for review score prediction. Thus, user-bias removal is a problem that must be handled for accurate review score prediction. Figure 1(L) shows score distribution of three users with negative(user 1), neutral(user 2) and positive(user 3) biases. Figure 1(R) shows the score distribution of three users after bias removal. user-bias removal methods try to avoid spurious good or bad reviews given by users who always tend to upvote or downvote irrespective of the item quality.

Past works that perform user-bias modelling and use review text for score prediction focus on building user-specific regressors that predict ratings for a single user only [Seroussi et al.(2010)Seroussi, Zukerman, and Bohnert] [Li et al.(2011)Li, Liu, Jin, Zhao, Yang, and Zhu]. Most users don’t review that often as indicated in Figure 2, leading to insufficient training instances. Furthermore, these models are computationally expensive while training and evaluation and have huge storage requirements. Matrix factorization techniques such as those mentioned in [Koren et al.(2009)Koren, Bell, and Volinsky] and [Ricci et al.(2011)Ricci, Rokach, and Shapira] model user-bias as a matrix factorization problem and rely only on collaborative filtering to predict review scores. They do not make use of the review text for prediction of scores. In our approach, we build two simple yet universal statistical models (UBR-I and UBR-II) that estimate user-bias for all users. We then learn a single regressor on review text and unbiased review scores to predict unknown scores given corresponding reviews. The main contributions of our work are:

  1. A simple, global user-bias model and a single linear regressor to predict review scores from review text.

  2. Model is computationally efficient with reduced sample, time and space requirements during training and testing.

In Section 2, we present the UBR-I and UBR-II techniques in detail, followed by thorough experiments in Section 3 and a discussion of relevant previous work in Section 4.



Figure 1: User Bias Problem


Figure 2: Shows heavy tail distribution for Amazon Electronic dataset, where x-axis is number of review rated and y-axis number so users

2 Our User-Bias Removal (UBR) Methods

This section explains the proposed model for rectifying user-bias to improve score prediction. We remove user-bias in scores corresponding to each user () by learning a statistical mapping from a user specific scale to a general scale common to all users. We propose two methods to learn such a mapping, as described in detail below.

2.1 User-Bias Removal-I (UBR-I)

We develop a user specific statistical mapping for user-bias removal, by normalizing each review score with respect to the mean and standard deviation of all products rated by that user. During prediction we use the same user specific mean and standard deviation (statistical mapping) to revert back to the original scale.

Let represent the review score of user for product . We calculate the normalized score for training, and predict score for new review of user for product during prediction as follows:-

  1. For each user, calculate and store the mean of all scores given by user .

    Here, represents the number of products reviewed by user .

  2. Similarly, for every user calculate and store standard deviation of all the scores given by the user .

  3. For every review score, calculate the Normalised user-bias removed score as follows :

    Here, represents the normalised score (after user-bias removal) for user and product . In the trivial case when is zero (i.e. all reviews have the same ratings) we set equal to zero.

  4. We use this normalised as a label and review text as input features (either tfidf,lda and doc2vec) to train an least square linear regressor [Galton(1886)]

  5. During prediction, regressor is used to predict a normalised review rating for new review of user for product . We recover the original user-biased score by the equation:

    Here, and are the predicted score and predicted normalised score (user-bias removed) respectively for user and product . Since the true rating are integers, we floor or cap the to the nearest integer in [1,5] to get the final prediction rating. This rating is used for final error calculation.


Instead of normalising over all reviews, like previous work does, we do user specific normalization in order to implicitely identify user-specific bias.



Figure 3: Architecture for UBR I (L) and UBR II (R) for bias removal for specific user

2.2 User-Bias Removal-II (UBR-II)

A product has ratings given by multiple users having positive, negative or no bias. Hence, we assume the average rating for the product is unbiased. Then, the differences of a specific user’s score from this average rating can be considered that user’s bias. These individual biases averaged over all products gives us the net bias for the user. This bias can then be used in a manner similar to that in Method I. The details are as follows:

  1. For each product calculate the mean of the scores given by all the users.

    Here, is the number of reviews for product .

  2. For a user and product , - is the bias of that user for product . Now calculate the net bias of that user.

    Here, is the number of products reviewed by user .

  3. For each review score, calculate the user-bias removed score as follows :

    Here, represents the normalised score (after user-bias removal) for user and product .

  4. Using this Normalised score as labels and text as input features (either tfidf, lda and doc2vec) to train an least square linear regressor [Galton(1886)] .

  5. During prediction, regressor model is used to predict the normalised review rating for new review of user for product . We recover the original user-biased score by the equation:

    Here, and are the predicted score and predicted normalised score (user-bias removed) respectively for user and product . Since the true rating are integers , we used floor or cap to nearest integer in 1, 2, 3, 4, 5 to get the final prediction score. This will be used to calculate the final error.


Note that, instead of normalising over all reviews, we do product specific zero mean normalization and thus consider only review scores of products that the user has reviewed to gauge his bias. In both, UBR-I and UBR-II, we assume that user has reviews at least one product. This is a fair assumption, since a new user can’t provide any information or cues to model their bias.

3 Experiments

3.1 Dataset Description

We use the Amazon Food Review Dataset [McAuley and Leskovec(2013)] consisting of 568,454 reviews by Amazon users up to October 2012. The dataset is publicly available for download from the Kaggle site ( as Amazon Fine Food Reviews [Kaggle(2012)]. Each review has a ReviewId, UserId, Score, Text and a brief Summary of the review.

We also experiment on two major categories, Electronics and Movies and TV, in the Amazon e-commerce dataset. These are described in detail in [McAuley and Leskovec(2013)] and [McAuley et al.(2015)McAuley, Targett, Shi, and van den Hengel]. We use a 4:1 train-test split on all datasets. We uniquely identify each user by thier UserId provided in the dataset. Similarly, each product is identified by the ProductId field.

Here, and are the true and predicted value respectively for test dataset and is the total number of samples used in the test set.

3.2 Baselines

We compare our methods to 5 statistical methods and 4 classification methods that don’t model user-bias:

  • Majority Voting: Predict score for a review as the mode of all reviews.

  • User Mode User Mode: Predict score for a review by user as the mean/ mode of all review scores of user .

  • Product Mean / Product Mode : Predict score for a review of product as the mean/ mode of all review scores of the product .

  • LinearSVM: Train a multi-class classifier (Linear SVM one-vs-rest ) with text features from text+summary field to predict scores (class) for a given review.

  • NaiveNB: Train a multi-class classifier (Bernoulli/Multinomial Naive Bayes) with text features from text+summary field to predict scores (class) for a given review.

  • Decision Tree: Train a multi-class classifier (Decision Tree) with text features from text+summary field to predict scores (class) for a given review.

Majority voting method is independent of specific product or specific user. The first five baseline methods are independent of extracted review text features. We evaluate our models on both bigram (bi) and unigram (uni) vocabulary with all baselines. All baseline give integer rating. Implementation for both the methods (UBR-I and UBR-II) and baselines is available on github 2.

3.3 Results

We compare the baselines with both our methods i.e. user-bias Removal-I (UBR-I) and user-bias Removal-II (UBR-II) using standard root mean square error (rmse) as it is a more relavant score to measure the performance of relative scoring than accuracy is. We evaluate our approach with three feature formation techniques tf-idf [Salton and McGill(1986)] (25K Vocabulary), LDA [Blei et al.(2003)Blei, Ng, and Jordan] (ntopics = 100) and Doc2Vec (PV-DBOW) [Le and Mikolov(2014)] to check the effect of the feature formation technique. In tf-idf we compare our approach on both unigram (25K Vocabulary) and bi-grams (25K Vocabulary). All the hyper-parameters are tuned and the performance reported is the best performance.

Methods tf-idf LDA PV-DBoW
Majority Voting 1.535 1.535 1.535
User Mean 0.599 0.599 0.599
User Mode 2.557 2.557 2.557
Product Mean 1.140 1.140 1.140
Product Mode 1.746 1.746 1.746
LinearSVM 0.888 1.494 1.06
LinearSVM (bi) 0.737 - -
MultinomialNB 1.360 1.535 1.535
MultinomialNB(bi) 1.047 - -
BernoulliNB 1.173 1.535 1.182
Bernoulli NB(bi) 1.041 - -
Decision Tree 1.042 1.259 1.485
Decision Tree (bi) 1.015 - -
UBR-I 0.546 \colorred0.597 \colorred0.56
UBR-I (bi) \colorred0.529 - -
UBR-II 0.669 0.778 0.71
UBR-II (bi) 0.642 - -
Table 1: RMSE results for Amazon food dataset (assume unigram vocabulary unless mention, values in red show best performance, the UBR method of this paper)
Methods tf-idf LDA PV-DBoW
Majority Voting 1.417 1.417 1.417
User Mean 1.022 1.022 1.022
User Mode 1.278 1.278 1.278
Product Mean 1.095 1.095 1.095
Product Mode 1.358 1.358 1.358
LinearSVM 0.932 1.434 1.1
LinearSVM (bi) 0.805 - -
MultinomialNB 1.299 1.417 1.417
MultinomialNB(bi) 1.045 - -
BernoulliNB 1.225 1.417 1.1706
Bernoulli NB(bi) 1.137 - -
Decision Tree 1.237 1.434 1.480
Decision Tree (bi) 1.199 - -
UBR-I 0.815 \colorred0.988 \colorred0.86
UBR-I (bi) 0.763 - -
UBR-II 0.821 1.011 0.9
UBR-II (bi) \colorred0.761 - -
Table 2: RMSE results for Amazon e-Commerce Electronic dataset (assume unigram vocabulary unless mention, values in red show best performance, the UBR method of this paper)
Methods tf-idf LDA PV-DBoW
Majority Voting 1.494 1.494 1.494
User Mean 1.005 1.005 1.005
User Mode 1.258 1.258 1.258
Product Mean 1.066 1.066 1.066
Product Mode 1.347 1.347 1.347
LinearSVM 0.936 1.273 1.08
LinearSVM (bi) 0.853 - -
MultinomialNB 1.271 1.494 1.494
MultinomialNB(bi) 1.041 - -
BernoulliNB 1.264 1.494 1.098
Bernoulli NB(bi) 1.206 - -
Decision Tree 1.294 1.445 1.466
Decision Tree(bi) 1.270 - -
UBR-I 0.818 \colorred0.959 \colorred0.87
UBR-I (bi) 0.783 - -
UBR-II 0.814 0.982 \colorred0.87
UBR-II (bi) \colorred0.775 - -
Table 3: RMSE results for Amazon e-Commerce Movies dataset (assume unigram vocabulary unless mention, values in red show best performance, the UBR method of this paper)

Table 1 shows results for Amazon Fine Food Reviews. It is clear from Table 1 that UBR-I and UBR-II generally outperform all six baselines for all feature formation techniques (tf-idf, LDA and Doc2Vec). Tf-idf with bigram features outperforms tf-idf with unigram possibly because of automatic handling of negation bigrams in the text. We also experiment with Amazon e-Commerce electronics and movies & TV data-sets. The corresponding results are shown in Table 2 and Table 3 respectively. Again, UBR-I and UBR-II generally outperform all six baselines for all feature formation techniques (tf-idf, LDA and Doc2Vec). Note, in all tables represent not applicable.

4 Related Work

Most relevant work that handles user-bias and is similar to our approach is described in [Seroussi et al.(2010)Seroussi, Zukerman, and Bohnert]. It is based on memory based collaborative filtering for score prediction. Multiple score prediction models are trained, one for each user, using reviews and scores corresponding to that user. Compared to our approach [Seroussi et al.(2010)Seroussi, Zukerman, and Bohnert] requires multiple models, one for each user. In general, many users have very few review texts which results in poor models for user specific classifiers/regressors due to sparsity in training data. Another problem is large prediction and training time along with space requirements since it has to learn multiple models. For good generalization performance, their models have large sample, space and time complexity. Since we use a single regressor in our model the sample complexity needed is much lower. In addition, it is also fast in training and prediction. [Li et al.(2011)Li, Liu, Jin, Zhao, Yang, and Zhu] also used multiple user-product specific models corresponding to each user-product pair to handle bias. They use coordinate descent or alternate minmax to learn parameters. Similar to [Seroussi et al.(2010)Seroussi, Zukerman, and Bohnert], the method requires large sample complexity and has high space and time complexity for good generalization performance.

Other approaches mentioned in [Tang et al.(2015b)Tang, Qin, Liu, and Yang],[Tang et al.(2015a)Tang, Qin, and Liu] and [Chen et al.(2016)Chen, Xu, He, Xia, and Wang] use deep learning models to incorporate user-bias, product bias or both. All these models have a large number of training parameters i.e. weights of the deep network. Also, model complexity increases because of user and product specific parameters. Compared to other approaches, these models require large training datasets as well. Other approaches are generally described for binary i.e. 0-1 score prediction where the user-bias problem is not as germane as for ordinal rating prediction.

In real world data sets there are not enough reviews per user to train separate models for each user. In all the three amazon datasets, the distribution of number of review per user have a long tail as shown in Figure 2. Also training separate classifier over user is intractable due to large sample requirement, large training and prediction time and large storage requirements.

5 Conclusion

We consider the problem of user-bias in review score prediction and suggest two simple statistical approaches to reduce prediction error (RMSE). We experimented on three popular feature vector representations, tfidf, LDA and Doc2Vec on the Amazon fine food reviews dataset and on two major categories of the Amazon e-commerce dataset (Electronics and Movies and TV). Our approach showed improved RMSE performance as compared to baseline approaches which don’t remove user-bias. Compared to other methods which use multiple classifiers for every user, our proposed methods only use a single global classifier for predicting scores. Our proposed methods have lower sample, space and time complexity compared to other methods mentioned in literature.

6 Future Work

Currently, we only used review text to predict scores and not for user-bias removal. We can define bias jointly over different types of feedback (like text sentiments, review scores etc.) as a future direction. We plan to extend proposed model to take into account positive and negative terms sentiment along with scores for more accurate bias modeling. We can jointly model both, individual user-bias(UBR-I) and collective user-bias for a given product i.e. product Bias(UBR-II) into a combined model (UBR-III).


The authors wants to thank Nagarajan Natarajan (Post-Doc, Microsoft Research, India), Janish Jindal (Student, IIT Kanpur) and Bhargavi Paranjape (Research Fellow, Microsoft Research, India) for encouraging and valuable feedback .


  1. ’, *


  1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3:993–1022.
  2. Tao Chen, Ruifeng Xu, Yulan He, Yunqing Xia, and Xuan Wang. 2016. Learning user and product distributed representations using a sequence model for sentiment analysis. IEEE Computational Intelligence Magazine 11(3):34–44.
  3. Francis Galton. 1886. Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland 15:246–263.
  4. Kaggle. 2012. Amazon fine food reviews, kaggle - initial analysis and word clouds .
  5. Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42(8).
  6. Quoc V Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In ICML. volume 14, pages 1188–1196.
  7. Fangtao Li, Nathan Liu, Hongwei Jin, Kai Zhao, Qiang Yang, and Xiaoyan Zhu. 2011. Incorporating reviewer and product information for review rating prediction. In IJCAI. volume 11, pages 1820–1825.
  8. Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, SIGIR ’15.
  9. Julian John McAuley and Jure Leskovec. 2013. From amateurs to connoisseurs: Modeling the evolution of user expertise through online reviews. In Proceedings of the 22Nd International Conference on World Wide Web. ACM, pages 897–908.
  10. Francesco Ricci, Lior Rokach, and Bracha Shapira. 2011. Introduction to recommender systems handbook. Springer.
  11. Gerard Salton and Michael J McGill. 1986. Introduction to modern information retrieval .
  12. Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2010. Collaborative inference of sentiments from texts. In International Conference on User Modeling, Adaptation, and Personalization. Springer, pages 195–206.
  13. Duyu Tang, Bing Qin, and Ting Liu. 2015a. Learning semantic representations of users and products for document level sentiment classification. In ACL (1). pages 1014–1023.
  14. Duyu Tang, Bing Qin, Ting Liu, and Yuekui Yang. 2015b. User modeling with neural network for review rating prediction. In IJCAI. pages 1340–1346.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description