Estimating the Rating of Reviewers Based on the Text
User-generated texts such as reviews and social media are valuable sources of information. Online reviews are important assets for users to buy a product, see a movie, or make a decision. Therefore, rating of a review is one of the reliable factors for all users to read and trust the reviews. This paper analyzes the texts of the reviews to evaluate and predict the ratings. Moreover, we study the effect of lexical features generated from text as well as sentimental words on the accuracy of rating prediction. Our analysis show that words with high information gain score are more efficient compared to words with high TF-IDF value. In addition, we explore the best number of features for predicting the ratings of the reviews.
Keywords: Review Mining, Natural Language Processing, Machine Learning, Big Data.
With a rapid growth of Internet and online shopping systems, customers share opinion on online platforms to assist other customers in making wiser decisions. This contribution has resulted in active communities which are known as valuable sources for both scholars and industry owners. Online reviews are important assets for users to buy a product, see a movie, or choose a product. Moreover, ratings of the reviews are important factors for showing the quality of a product, and as the result, it can be used as a reliable feature for the users to read and trust the reviews or purchase a product . Review ratings are also used in recommender systems to push and automatically suggest products to users, based on similar choices and attributes compared to others. Due to their importance, business owners and academic scholars have studied user generated reviews to find efficient techniques for estimating ratings based on the content of the text [1, 37, 20, 7, 6]. The research in the area of review mining is very vast and is not just limited to finding rating analysis. Other areas such as opinion extraction and sentiment analysis, building recommendation systems and summarizing the texts are among domains that are extensively explored in this area [30, 4, 24, 25, 16, 18, 22] .
Various numbers of studies focused on predicting ratings of different products on Amazon (please refer to section 2 for more information). In this paper, we focus on predicting the ratings of films (documentary and non-documentary) based on the text of the reviews. Documentary films have reviews with richer texts focusing on the themes of the films, while non-documentary films consist of attributes such as famous casts. We believe that a combination of these two types of films and using the textual features will provide good insights about the underlying characteristics of the texts.
The rest of the paper is organized as follows: section 2 discusses related work. Section 3 discusses the data collection. In the method section (4), we first explain the feature selection and discuss the classifiers that we chose for this study, and then report the results of the classifier. Section 5 discusses the conclusion and future directions for improving the paper.
3 Related Works
As mentioned earlier prior work on review mining is very vast. Researchers in this area have tried to find efficient algorithms for predicting rating, helpfulness, and sentiment of the reviews [13, 9, 15, 35, 32, 25, 26, 17]. In this section, we explore the most related work in the area of review mining and analyze the papers that used text of the reviews to extract information. Bing and Zhang (2012) reviewed the most efficient algorithms in opinion mining and sentiment analysis of the reviews. Supervised and unsupervised approaches are employed to extract the sentiment of the reviews on sentence or document level . Since a positive or negative opinion about a product (e.g. Camera) does not show the feeling of the opinion holder about every specific feature of that product (e.g. picture quality of a camera), aspect-based sentiment analysis is introduced to find the opinions related to each attribute on the sentence level.
In a research, De Albornoz et al.  used both topic and sentiment of the reviews on sentence level to assess the impact of text-driven information in predicting the rating of the reviews in recommendation systems. This article aims to predict the overall rating of a product review based on the userâs opinion about different product features that are evaluated in the review. After identifying the features that are relevant to consumers, they extracted users’ opinions about different product features. The salience of different product features and the values that quantify the users’ opinions are used to construct the feature vector to represent the review. This vector was used as the input of the machine learning model to classify the review in different rating categories. As a result, the models achieved 84 percent for Logistic, 83 percent for LibSVM and 81.9 percent for FT. The errors are assumed to be the result of: (1) mislabeled instances in the training set (2) frequent spelling errors in the reviews and (3) the presence of neutral sentences which do not express any opinion but are necessarily classiï¬ed as positives or negatives. In another research, Ghose et al  conducted a study to create a dataset of products from the Amazon website. The dataset consisted of product-specific characteristics and the details of the product review. They generated a training set with two classes of documents; a) a set of âobjective” documents that contained the product descriptions of each 1,000 products and b) a set of âsubjective” documents that contained randomly retrieved reviews. As reported, this model successfully identified the most helpful reviews to the users. The most âhelpfulâ reviews can be displayed on top to improve usersâ re-viewing experience on electronic marketplaces [3, 12, 28, 27, 29].
Kim et al. in  suggested an automated assessment for the review helpfulness by considering the length of the review as the most useful feature compared to the other ones. The main contribution of this paper is: a) using helpfulness score to implement an automatic computational model to rank the reviews, and b) leveraging various features in the reviews such as structural, lexical, syntactic, semantic, and reviewsâ starts to predict the helpfulness score of the reviews. For products such as MP3 players and cameras, the model achieved correlation coefficient scores of 0.656 and 0.604. The detailed analysis of features showed that length of the reviews, unigrams, and product rating were the most helpful ones in the prediction and structural and syntactic features had no significant impact .
Rezapour and Diesner in  studied the movie reviews from a new perspective that, as claimed by the authors, is new in the area of review and opinion mining. In this work, the authors captured the impact of movies from the reviews by first creating a novel dictionary of impact and then annotating the reviews on sentence level with various types of impact as change in behavior, change in cognition, and etc. They used three different classifiers with three sets of features to classify the impact of each film on the authors. The results showed that SVM classifier and the combination of all features are the best predictor of impact in reviews. In another work, the amount of alignment of social media (Facebook and reviews), news articles and the transcript of the film were studied . It was found that social media are more aligned with the main subject of the films. The result of these two studies show that films are important sources that are capable of influencing peopleâs behavior and cognition.
The study presented in this paper builds upon the previous research. We study the influence of lexical features as well as the impact of feature size on the classification task. Moreover, our analysis will highlight the importance of using the correct size for rating classification. Note that this study is a small-scale rating prediction. Using the insights of this paper, for data parallel processing of big data, more sophisticated algorithms based on MapReduce can be used for speeding up the processing time, e.g. look at [36, 39, 5, 40, 19].
We used the reviews of eight films from Amazon . Table 1 shows the names and the number of the reviews for each film. Around 65 percent of the reviews are 5 star and 20 percent of them are 1, 2 and 3 stars. To have more distinct classes, 4-star reviews were excluded from the dataset. For preprocessing, first, all the reviews were divided in two categories/classes as High (5Stars) and Low (1, 2, and 3 stars). We then removed the stop words and tokenized the sentences. All the preprocessing was performed using Python NLTK  and custom programs. The resulted dataset consisted of around 39,802 sentences and 307,138 words. Words in a collection of documents mostly follow the zipâs law, where the rare and common words are scattered on the end of two tails of the graph. To address this problem, we removed the words with less than 10 counts. The words with high counts will be downscaled in feature selection, using TF-IDF or information gain. More details can be found in the following section.
|Name||Reviews||5 Star||4 Star||3 Star||2 Star||1 Star|
|The Imitation Game||829||577||158||54||14||26|
Reviews as users generated texts entail feelings and personal ideas of the users . Unlike usual opinion mining and sentiment analysis in the field of review mining, movies such as documentary films do not benefit from special attributes such as famous cast or directors. Since the aim of a documentary movie is to raise the awareness about a social issue or introduce a new topic, individuals try to focus on these areas in their reviews as well . To find the ratings of the reviews, some prior studies leveraged helpfulness level as one of the features in rating classification. The helpfulness level is also a great deal in creating the recommendation systems. We did not consider this feature in our study for two reasons: 1) documentary films are not among popular genres of films, and therefore the number of reviews written and viewed by customers is limited. In some cases, this lack of interest results in limited number of helpfulness rate. Moreover, the number of reviews which were viewed more than one time and also had helpfulness rate were around 1400, which is a very small input data for the classifiers. 2) In this study, we just focus on textual features with no external input.
5.1 Feature Selection
Term frequency-inverse document frequency (TF-IDF) is a numerical score that can highlight the importance of a word in a document collection or corpus. Areas such as information retrieval and text mining use this score as a weighting factor. TF-IDF score is proportional to the number of times that a word appears in the document. Using this feature, we can downscale the words with high frequency, also known as stop words. We considered top 500 and 900 words with the highest TF-IDF score to train and test the classifiers. We considered top 500 and 900 words with the highest TF-IDF to train the classifiers.
This metric leverages the presence or absence of the terms in the documents to calculate the amount of information obtained for each category prediction. To leverage this feature, we calculated the information gain of all words and chose top 200, 600, 900 and 1000 words as features. We compare the results of the classifiers using different sets of features to find the ones that help the most in predicting the ratings.
One of the popular features in estimating the reviews is using sentiment of the words. In sentiment analysis, each word is tagged with a polarity as positive, negative or neutral. There are two well-known approaches for analyzing the sentiment of the texts. In this paper, we considered a lexicon based approach which leverages a predefined lexicon consisting words as well as their polarity (as a tag or ratio). To get the sentimental words, we used MPQA subjectivity lexicon , and extracted and tagged the words in the reviews. In total 2,055 words were extracted.
After extracting the words of each feature set, we created the feature vectors using python scikit-learn library . Table 2 shows the list of features that were used in this study. In addition, top 10 words of each feature set, as Info-Gain, TF-IDF and sentiment are listed in Table 3.
|TF-IDF||Top 500 words|
|TF-IDF||Top 900 words|
|Info-Gain||Top 200 words|
|Info-Gain||Top 600 words|
|Info-Gain||Top 900 words|
|Info-Gain||Top 1000 words|
|Sentiment word||Top sentiment words in more with a count of 5 or more|
We used two well-known classifiers, Support Vector Machines (SVM) and NaÃ¯ve Bayes (NB) to classify the ratings [10, 30]. These two algorithms are among the most common ones in this area of research. Support vector machines are universal learners and are based on the structural risk minimization principle from computational learning theory. In general, SVMs are highly accurate, and will work the best when using an appropriate kernel, especially when the data is not linearly separable. These models also work very well with high-dimensional spaces. Therefore, based on theoretical evidences, SVMs should perform well for text categorization. NaÃ¯ve Bayes is one of the simple classifiers. They perform well with semi-supervised learning or fully supervised classifications. After creating the feature vectors, we randomly selected 90 percent of the data for training the classifiers. The rest of the data will be used for testing the classifiers with the highest accuracy and the most efficient feature sets. We used WEKA  to implement the two algorithms. Table 4 shows the results of the selected features and the values of average accuracy, precision, recall, and f-score of two classifiers, SVM and NB. Based on the results in Table 4 the highest accuracy was resulted using top 600 Info-Gain words and top 900 Info-Gain words for SVM and top 200 Info-Gain for NB. The average accuracy value, 82.0 percent, and the F-score value, 90 percent are the highest among all other features. SVM classifier achieved a better performance compared to NB.
We can also see that using information gain is more efficient than sentiment and TF-IDF. The precision value of TF-IDF is very high, but unfortunately, it is just predicting the high-rank reviews which are the larger class. The average accuracy of both top 500 and 900 TF-IDF is the lowest among all others and as Table 5 shows the classifiers are not showing enough confidence in predicting the classes. Based on the average accuracy, prediction confidence and F-score values in Table 4 and Table 5 both top 600 and top 900 information gain words are the best features. Unfortunately, sentiment of the words did not help us in rating as we expected, but compared to TF-IDF are still among the top features.
|Top 500 TF-IDF||1.17||0.8|
|Top 900 TF-IDF||2.3||0.96|
|Top 200 Info-Gain||62.1||44.6|
|Top 600 Info-Gain||63.9||42.11|
|Top 900 Info-Gain||63.3||42|
|Top 1000 Info-Gain||55||43|
Comparing the result of this paper with other papers, and especially with De Albornozâs work  (83 percent for SVM), we showed that the result of this research is almost comparable with the other works in this area. One important note here is that reviewed works leveraged various types of features as well sentiment. The result presented in this work is solely based on lexical features. To take a step further, we tested the trained classifier on the 10 percent test set data (as explained before). The results are slightly different from what we achieved earlier. Table 6 shows the features and the result of the classifiers. Since we have a small data set, we decided to choose (1) top 200 words to avoid overfitting, (2) and top 600 words as one of the best features. Same as the training, SVM resulted in higher accuracy compared to NB. However, the 200 info-Gain words performed better.
|SVM (Overall Accuracy)||NB (Overall Accuracy)|
Based on the results in Table 4 and 5 increasing the number of attributes in features were not helpful in enhancing the prediction of the ratings. One assumption is that the features that are proportional to the size of the data work better than the high or low number of the attributes. The higher number may result in overfitting of the classifier and the low number may not be able to extract the necessary information from the content to classify the reviews. The best features were top 600 and 900 Information gain words with the highest average accuracy, 82 percent, confidence of the classifier, 64 percent, and f-score, 90 percent. Running the test data showed that sometimes the words of the largest group or domain can dominate the selected features like info gain and may result in biased classifiers. In addition, we found that TF-IDF is not always the best metric for extracting the most salient words from the document. We showed that words with the highest information scores perform better. As noted in the methodology and results, we did not combine the features to increase the accuracy. We found that the words in different feature sets highly overlap, which would result in overfitting the classifiers.
We plan to explore other algorithms and approaches in the future. With the popularity of the deep learning algorithms, we can test the same approach using word embedding and LSTM or CNN. In addition, we will consider adding other features such as syntactic features, the length of the reviews and n-gram words to our analysis. Finally, in our future work, we plan to add social media texts as new features to the rating prediction. Same as reviews, social media such as Twitter, consists of user-generated texts. We believe that this new feature can tremendously help in rating prediction of the reviews, especially in sparse matrix situations. Tweets are great sources of user-specific features such as sentiments, hashtags, location, and texts. We plan to extract the text and hashtags related to films and add them as the sentimental and/or topical word to our features to expand and improve the prediction models. Hashtags were used in previous study in social media analysis for topic modeling , sentiment analysis , and opinion mining . Numbers of research leveraged social media information to predict a movieâs success . However, the research on combining these two user-generated texts is not well explored in the area of rating prediction.
-  Ramin Bagherzadeh and Rohullah Bayat. Investigating online consumer behavior in iran based on the theory of planned behavior. Modern Applied Science, 10(4):21, 2016.
-  Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”, 2009.
-  Maziar Isapour Chehardeh, Mishari Metab Almalki, and Constantine J Hatziadoniu. Remote feeder transfer between out-of-phase sources using sts. In Power and Energy Conference at Illinois (PECI), 2016 IEEE, pages 1–5. IEEE, 2016.
-  Hang Cui, Vibhu Mittal, and Mayur Datar. Comparative experiments on sentiment classification for online product reviews. In AAAI, volume 6, pages 1265–1270, 2006.
-  Amirali Daghighi and Mohammadamir Kavousi. Scheduling for data centers with multi-level data locality. In Electrical Engineering (ICEE), 2017 Iranian Conference on, pages 927–936. IEEE, 2017.
-  Cristian Danescu-Niculescu-Mizil, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. How opinions are received by online communities: a case study on amazon. com helpfulness votes. In Proceedings of the 18th international conference on World wide web, pages 141–150. ACM, 2009.
-  Jorge Carrillo De Albornoz, Laura Plaza, Pablo Gervás, and Alberto Díaz. A joint model of feature mining and sentiment analysis for product review rating. In European conference on information retrieval, pages 55–66. Springer, 2011.
-  Jana Diesner, Rezvaneh Rezapour, and Ming Jiang. Assessing public awareness of social justice documentary films based on news coverage versus social media. IConference 2016 Proceedings, 2016.
-  Mohammad Fahim, Haewon Jeong, Farzin Haddadpour, Sanghamitra Dutta, Viveck Cadambe, and Pulkit Grover. On the optimal recovery threshold of coded matrix multiplication. In Communication, Control, and Computing (Allerton), 2017 55th Annual Allerton Conference on, pages 1264–1270. IEEE, 2017.
-  Gayatree Ganu, Noemie Elhadad, and Amélie Marian. Beyond the stars: improving rating predictions using review text content. In WebDB, volume 9, pages 1–6. Citeseer, 2009.
-  Anindya Ghose and Panagiotis G Ipeirotis. Designing novel review ranking systems: predicting the usefulness and impact of reviews. In Proceedings of the ninth international conference on Electronic commerce, pages 303–310. ACM, 2007.
-  Anindya Ghose and Panagiotis G Ipeirotis. Estimating the helpfulness and economic impact of product reviews: Mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering, 23(10):1498–1512, 2011.
-  Farzin Haddadpour, Mahdi Jafari Siavoshani, and Morteza Noshad. Low-complexity stochastic generalized belief propagation. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 785–789. IEEE, 2016.
-  Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
-  Yu Hong, Jun Lu, Jianmin Yao, Qiaoming Zhu, and Guodong Zhou. What reviews are satisfactory: novel features for automatic helpfulness voting. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 495–504. ACM, 2012.
-  Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177. ACM, 2004.
-  Farhad Imani, Changqing Cheng, Ruimin Chen, and Hui Yang. Nested gaussian process modeling for high dimensional imputation in healthcare systems. In Institute of Industrial and Systems Engineers Annual Conference & Expo. IISE, 2018.
-  Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 219–230. ACM, 2008.
-  Mohammadamir Kavousi. Affinity scheduling and the applications on data center scheduling with data locality. arXiv preprint arXiv:1705.03125, 2017.
-  Soo-Min Kim, Patrick Pantel, Tim Chklovski, and Marco Pennacchiotti. Automatically assessing review helpfulness. In Proceedings of the 2006 Conference on empirical methods in natural language processing, pages 423–430. Association for Computational Linguistics, 2006.
-  Steven Lehrer and Tian Xie. Box office buzz: Does social media data steal the show from model uncertainty when forecasting for hollywood? Review of Economics and Statistics, 99(5):749–755, 2017.
-  Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu. Structure-aware review mining and summarization. In Proceedings of the 23rd international conference on computational linguistics, pages 653–661. Association for Computational Linguistics, 2010.
-  Kar Wai Lim and Wray Buntine. Twitter opinion topic model: Extracting product opinions from tweets by leveraging hashtags and sentiment lexicon. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1319–1328. ACM, 2014.
-  Bing Liu and Lei Zhang. A survey of opinion mining and sentiment analysis. In Mining text data, pages 415–463. Springer, 2012.
-  Chien-Liang Liu, Wen-Hoar Hsaio, Chia-Hoang Lee, Gen-Chi Lu, and Emery Jou. Movie rating and review summarization in mobile environment. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(3):397–407, 2012.
-  Yue Lu, Panayiotis Tsaparas, Alexandros Ntoulas, and Livia Polanyi. Exploiting social context for review quality prediction. In Proceedings of the 19th international conference on World wide web, pages 691–700. ACM, 2010.
-  Seyed Nima Mozaffari, Spyros Tragoudas, and Themistoklis Haniotakis. A new method to identify threshold logic functions. In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 934–937. IEEE, 2017.
-  Seyed Nima Mozaffari, Spyros Tragoudas, and Themistoklis Haniotakis. A generalized approach to implement efficient cmos-based threshold logic functions. IEEE Transactions on Circuits and Systems I: Regular Papers, 65(3):946–959, 2018.
-  Susan M Mudambi and David Schuff. Research note: What makes a helpful online review? a study of customer reviews on amazon. com. MIS quarterly, pages 185–200, 2010.
-  Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 115–124. Association for Computational Linguistics, 2005.
-  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
-  Babak Ravandi, Ioannis Papapanagiotou, and Baijian Yang. A black-box self-learning scheduler for cloud block storage systems. In Cloud Computing (CLOUD), 2016 IEEE 9th International Conference on, pages 820–825. IEEE, 2016.
-  Rezvaneh Rezapour and Jana Diesner. Classification and detection of micro-level impact of issue-focused documentary films based on reviews. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pages 1419–1431. ACM, 2017.
-  Rezvaneh Rezapour, Lufan Wang, Omid Abdar, and Jana Diesner. Identifying the overlap between election result and candidatesâ ranking based on hashtag-enhanced, lexicon-based sentiment analysis. In Semantic Computing (ICSC), 2017 IEEE 11th International Conference on, pages 93–96. IEEE, 2017.
-  Hongning Wang, Yue Lu, and Chengxiang Zhai. Latent aspect rating analysis on review text data: a rating regression approach. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 783–792. ACm, 2010.
-  Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality. IEEE/ACM Transactions on Networking (TON), 24(1):190–203, 2016.
-  Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 347–354. Association for Computational Linguistics, 2005.
-  Shuang-Hong Yang, Alek Kolcz, Andy Schlaikjer, and Pankaj Gupta. Large-scale high-precision topic modeling on twitter. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1907–1916. ACM, 2014.
-  Ali Yekkehkhany. Near data scheduling for data centers with multi levels of data locality. (Dissertation, University of Illinois at Urbana-Champaign).
-  Ali Yekkehkhany and et al. Gb-pandas:: Throughput and heavy-traffic optimality analysis for affinity scheduling. ACM SIGMETRICS Performance Evaluation Review.
-  Rong Zhang, Wenzhe Yu, Chaofeng Sha, Xiaofeng He, and Aoying Zhou. Product-oriented review summarization and scoring. Frontiers of Computer Science, 9(2):210–223, 2015.
-  Li Zhuang, Feng Jing, and Xiao-Yan Zhu. Movie review mining and summarization. In Proceedings of the 15th ACM international conference on Information and knowledge management, pages 43–50. ACM, 2006.