Lutz, Pröllochs, and Neumann
Review Length and Argumentation Changes
The Longer the Better? The Interplay Between Review Length and Line of Argumentation in Online Consumer Reviews
Bernhard Lutz \AFFUniversity of Freiburg, \EMAILbernhard.email@example.com, \URL \AUTHORNicolas Pröllochs \AFFUniversity of Giessen, \EMAILnicolas.firstname.lastname@example.org \URL \AUTHORDirk Neumann \AFFUniversity of Freiburg, \EMAILdirk.email@example.com \URL
Review helpfulness serves as focal point in understanding customers’ purchase decision-making process on online retailer platforms. An overwhelming majority of previous works find longer reviews to be more helpful than short reviews. In this paper, we propose that longer reviews should not be assumed to be uniformly more helpful; instead, we argue that the effect depends on the line of argumentation in the review text. To test this idea, we use a large dataset of customer reviews from Amazon in combination with a state-of-the-art approach from natural language processing that allows us to study argumentation lines at sentence level. Our empirical analysis suggests that the frequency of argumentation changes moderates the effect of review length on helpfulness. Altogether, we disprove the prevailing narrative that longer reviews are uniformly perceived as more helpful. Our findings allow retailer platforms to improve their customer feedback systems and to feature more useful product reviews.
Consumer reviews, word-of-mouth, decision-making, text analysis, e-commerce
Online product reviews provide a valuable source of information for customers before making purchase decisions (Yin.2016). An interesting feature of modern retailer platforms is that they also allow customers to rate the perceived helpfulness of a product review (Mudambi.2010). Previous research has shown that more helpful customer reviews also have a greater influence on retail sales (Dhanasobhon.2007). The question of what constitutes a helpful review has received increasing attention lately, mainly because review helpfulness serves as focal point for the study of human decision-making (Mudambi.2010). For example, previous works have found that the review rating is an important determinant of review helpfulness (e. g. Pavlou.2006b). In addition to meta-data, online customer reviews typically contain review texts detailing customer opinions or user experiences (Zimmermann.2018). An overwhelming majority of previous works identify the length of the review text, e. g. the number of a sentences, as a key explanatory variable and unanimously find longer reviews to be more helpful than short reviews (e. g. Mudambi.2010, Pan.2011, Yin.2016). A plausible explanation is that longer reviews tend to be more diagnostic as they can provide more arguments about product quality and previous experiences (Korfiatis.2012).
In this paper, however, we propose that longer reviews should not be assumed to be uniformly more helpful. Instead, we argue that the effect depends on the line of argumentation in the review text. Specifically, we suggest that frequent changes between positive and negative arguments require greater cognitive effort and may result in situations of information overload (Jacoby.1977). As a result, it may become difficult for customers to comprehend the review; and thus the review is unlikely to facilitate the purchase decision-making process. For example, it is an intriguing notion to expect long reviews jumping excessively between positive and negative arguments to be not particularly helpful for customers. In contrast, a review providing a clear-cut, one-sided opinion or a support-then-refute order of positive and negative arguments may be easier to comprehend and also more persuasive. Therefore, we expect a higher frequency of argumentation changes in reviews to decrease perceived helpfulness. Moreover, given increased complexity and consumers’ limited cognitive capacities, the (positive) effect of review length on perceived review helpfulness should be moderated by the frequency of argumentation changes in the review text.
To test these ideas, this paper examines the effects of review length and argumentation changes on review helpfulness. For this purpose, we use a frequently employed dataset of Amazon customer reviews together with a state-of-the-art approach from natural language processing that allows us to study the line of argumentation on the basis of individual sentences. The method uses distributed text representations in combination with multi-instance learning to infer sentence polarity labels given only the review label. Specifically, the model learns to assign similar sentences in reviews to the same polarity label and differing sentences to an opposite polarity label. The order in which positive and negative sentences appear then allows us to detect argumentation changes. Concordant with our propositions, our analyses suggest that the frequency of argumentation changes moderates the effect of review length on helpfulness.
Our findings have important implications for Information Systems research and practice: we challenge the prevalent narrative in IS research that longer reviews are perceived as more helpful in general. To the best of our knowledge, our paper is the first study demonstrating that argumentation patterns and review length are closely intertwined. From a practical perspective, our findings can directly assist retailers in optimizing their customer feedback systems and to feature more useful product reviews.
We now derive our research hypotheses, all of which are based on the notion that seeking helpful pre-purchase information plays an important role in consumers’ decision-making processes (Engel.1982). The goal of this information search is to reduce risk and uncertainty to make better purchase decisions (Murray.1991).
A product review usually consists of a star rating and a textual description (Willemsen.2011). The review text is commonly used to describe the product quality and previous experiences with the product (Zimmermann.2018), where longer review texts are likely to contain more information (Mudambi.2010). Tversky.1974 find that decision-makers are more confident when there are more justifications in favor of a decision. Similarly, Schwenk.1986 shows that managers’ arguments are more persuasive if they provide more information in support of the advocated position. There are multiple factors contributing to this preference for diagnostic information. For example, a consumer may be inclined to purchase a product, but he/she has not yet made the cognitive effort to identify the pros and cons of a product (Mudambi.2010). In this scenario, a detailed review that provides a wide range of convincing arguments is likely to help the consumer make the purchase decision. Furthermore, the length of a review may reflect the reviewer’s expertise. The more effort the reviewer puts into writing the review, the more likely it is that he/she will provide high quality information that aids others in making their purchase decisions (Pan.2011). Longer and detailed reviews are also harder to fabricate as a reviewer must have a certain degree of knowledge and experience to accurately describe different aspects of a product (Jensen.2013). Hence, it is reasonable to assume that longer reviews contain more elaborate arguments presented by better-informed reviewers that are more helpful to other customers. A positive effect of review length on the helpfulness of a review has been demonstrated by a vast number of previous works. Our first hypothesis thus simply tests this link as discussed in the existing literature:
Hypothesis 1 (H1). Longer consumer reviews are perceived as more helpful.
A particularly relevant aspect of reviews is the extent of how much it is written in favor of or against the product. Reviews can be one-sided, i. e., arguing strictly for or against a product, or two-sided, enumerating pros and cons of a product. Existing literature found that two-sided reviews are perceived as more credible (Jensen.2013) and more helpful (e. g. Lutz.2018). Yet Crowley.1994 note that the persuasiveness of two-sided argumentation is likely to depend on the mixture of positive and negative information. In a similar vein, Jackson.1987 argue that a two-sided message can be structured in three ways: (i) by starting with supporting arguments followed by opposing arguments, (ii) by starting with opposing arguments and then providing supportive arguments, or (iii) by interweaving supportive and opposing arguments. Hence, we expect that a relevant feature of two-sided reviews is the rate of argumentation changes, i. e. how often the reviewer changes the line of argumentation from positive to negative and vice versa. Jackson.1987 find that a “support-then-refute order” is more persuasive than providing supporting and opposing arguments in an alternating manner. Providing arguments in an alternating manner also increases information entropy, i. e. messages are not sufficiently organized as to be easily recognized as significant (Hiltz.1985). Altogether, we expect a higher rate of argumentation changes to present a less organized structure, which may make the review less helpful.
Hypothesis 2 (H2). A higher rate of argumentation changes decreases perceived review helpfulness.
Following the above reasoning, an important question is whether review length and the rate of argumentation changes exhibit isolated effects on review helpfulness or rather depend on each other. Most consumer reviews are very one-sided in favor of or against a particular product (Jensen.2013). Strictly one-sided reviews do not change their line of argumentation from positive to negative or vice versa. Since a higher number of arguments in favor of a position makes a message more persuasive (e. g. OKeefe.1998), we expect longer reviews to be more helpful in situations in which the line of argumentation does not change between positive and negative arguments. In contrast, two-sided reviews enumerating pros and cons of a product change their argumentation at least once. We expect that processing a review with a high rate of argumentation changes requires greater cognitive effort than processing a review in which arguments are provided in clearly separated parts. A vast number of previous studies found that consumers’ cognitive capacities are limited (e. g. Bettman.1979). Information overload theory suggests that consumers can process a certain amount and complexity of information, and that information which exceeds these capacities leads to poorer purchase decisions (Jacoby.1977). Hence, we expect that frequent changes between positive and negative arguments in long reviews can make it more difficult for customers to comprehend the review, thus moderating the positive effect of review length on helpfulness.
Hypothesis 3 (H3). The (positive) effect of review length on perceived review helpfulness is moderated by the rate of argumentation changes in the review text.
Dataset and Methodology
This section presents our dataset. Subsequently, we make use of state-of-the art methods from natural language processing for sentence-level polarity classification of texts. The order in which positive and negative sentences appear then allows us to determine argumentation changes in reviews.
To test our hypotheses, we use a large dataset of Amazon consumer reviews (He.2016). Compared to alternative review sources, this dataset exhibits several favorable characteristics. For example, the reviews are verified by Amazon and the reviewers must have actually purchased the product. Amazon also features a high number of retailer-hosted reviews per product due to a particularly active user base (Gu.2012). In addition, Amazon reviews are the prevalent choice in the related literature when studying review helpfulness (see e. g. Gu.2012, Mudambi.2010). Our dataset111We use the Amazon 5-core dataset available from http://jmcauley.ucsd.edu/data/amazon/. To account for possible imbalances, and to mitigate the effects of spammers, we focus on reviews that contain at most five reviews per reviewer. Moreover, we restrict our analysis to reviews that were created after 2010 and for which the helpfulness has been assessed at least once by other customers. contains product reviews, ratings, and reviewer meta-data for different product categories. In order to reduce our dataset to a reasonable size, we follow previous research (e. g. Mudambi.2010, Ghose.2011) by restricting our analysis to a subset of product categories. We include all reviews from low-involvement products listed in the categories Groceries, Music CDs, and Videos (Kannan.2001). These products feature a lower perceived risk of poor purchase decisions due to a lower price and lesser durability (Gu.2012). In addition, we include high-involvement product reviews listed in the categories Cell phones, Digital cameras, and Office electronics. These products feature a higher price and greater durability, and hence a higher perceived risk (Gu.2012).
Our complete dataset consists of 51,837 customer reviews for 4647 low-involvement products and 2335 high-involvement products with the following information: (i) the numerical star rating assigned to the product, (ii) the number of helpful and unhelpful votes for the review, (iii) the date on which the review was posted. Our reviews received between 0 and 4531 helpful votes, with a mean of 8.37. The mean star rating is 4.23. In addition, the corpus contains a textual description (the review text), which undergoes several preprocessing steps. First, we use the sentence-splitting tool from Stanford CoreNLP (Manning.2014) to split the review texts into sentences. The length varies between one and 384 sentences, with a mean of 10.9 sentences. Second, we use doc2vec (Le.2014) to create numerical representations of all sentences. This allows us to overcome the drawbacks of the bag-of-words approach (e. g. Prollochs.2016b, Prollochs.2019), such as missing context (Prollochs.2018). The doc2vec library uses a deep learning model to create numeric feature representations of text, which capture semantic information. We use the settings as recommended by Lau.2016 and initialize the word vectors of the doc2vec model with the pre-trained word vectors from the Google News dataset (Lutz.2019). The pretrained Google News dataset is a common choice when generating vector representations of Amazon reviews (e. g. Kim.2015) and has several advantages (Lau.2016, Kim.2015): (1) tuning vector representations to a given dataset requires a large amount of training data; (2) the results are particularly robust and more reproducible.
Sentence-Level Polarity Classification Using Multi-Instance Learning
We are facing a multi-instance learning problem (Dietterich.1997, Kotzias.2015), in which we have to predict the polarity label for all sentences in a set of reviews. Let denote the set of reviews, the number of reviews, the number of sentences, and the set of all sentences. Each review is represented by a multiset of sentences with label which equals 1 for positive reviews and 0 for negative reviews. The learning task is to train a classifier with parameters to predict the polarity labels of individual sentences given only the review labels.
The above problem can be solved by optimizing a loss function consisting of two components. First, a term punishing different labels for similar sentences. Second, a term punishing misclassifications at the document (review) level. The loss function is then minimized with respect to the classifier parameters ,
where is a free parameter which scales the prediction error at review-level. In Equation 1, denotes a similarity measure between two sentence representations and , denotes the squared error between the predictions for sentences and , and denotes the predicted label for review . We adapt to our problem of predicting sentence-level polarity labels by specifying the placeholders as follows: We use a radial basis function to measure the similarity between two sentence representations, i. e. . In addition, we use a logistic regression model to predict due to its simplicity and interpretability. Finally, we define as the average polarity label of the sentences in . This results in a specific loss function which is to be minimized by the parameter of the logistic regression .
Determining Argumentation Changes in Reviews
We use the aforementioned multi-instance learning approach as described to train a classifier for out-of-sample prediction of polarity label of sentences in reviews. For training the model, we use a disjunct training dataset consisting of 5,000 positive and 5,000 negative reviews. The resulting classifier then allows us to predict a polarity label for each sentence in the dataset that is used in our later empirical analysis. As previously mentioned, each sentence in the corpus is first transformed into its vector representation . Subsequently, we use the logistic regression model to calculate . If the result of is greater than or equal to sentence is assigned to a positive label, i. e. , and to a negative label otherwise. This approach achieves an accuracy of on a manually labeled, out-of-sample dataset of sentences, which can be regarded as sufficiently accurate in the context of our study.
Based on the polarity labels for each sentence, we then determine the rate of argumentation changes for review as follows. If the review consists of only a single sentence, then is defined as 0. For reviews that consist of at least two sentences, is defined as the number of argumentation changes divided by the length of the review in sentences minus 1,
where denotes the number sentences of review , and is an indicator function which equals to 1, if is true and 0 otherwise. Hence, is zero for one-sided reviews, and one for reviews in which the line of argumentation changes between each sentence. For example, a review consisting of five positive sentences followed by two negative sentences is mapped to the value .
The target variable of our analysis is . This variable denotes the number of users who voted Yes in response to the question “Was this review helpful to you?”. The total number of users who responded to this question is denoted by . Following Pan.2011 and Yin.2016, we model review helpfulness as a binomial variable with trials.
Concordant with previous works (e. g. Mudambi.2010, Korfiatis.2012, Pan.2011, Yin.2016), we incorporate the following variables to explain review helpfulness. First, we include the star rating of the review between 1 and 5 stars () and the average rating of the product (). Second, we control for the product type by adding a dummy that equals to 1 for high-involvement products and 0 for low-involvement products (). Third, we control for multiple characteristics of the review text that may influence review helpfulness. Specifically, we calculate the fraction of cognitive and emotive words ( and ) using LIWC 2015 and control for readability using the Gunning-Fog index (Gunning.1968) (). The key explanatory variables for our research hypotheses are review length () and the rate of argumentation changes (). To examine the interaction between review length and the rate of argumentation changes, we additionally incorporate an interaction term into our model. Altogether, we model the number of helpful votes, , as a binomial variable with probability parameter and trials,
with intercept , a random intercept for each product, and error term .
We estimate our model using mixed effects generalized linear models and maximum likelihood estimation (Wooldridge.2010). The regression results are reported in Table 1. To facilitate the interpretability of our findings, we z-standardize all variables, so that we can compare the effects of regression coefficients on the dependent variable measured in standard deviations. Column (a) of Table 1 presents a baseline model in which we only include the control variables from previous works. We find that more recent reviews, higher star ratings, and reviews with a higher readability index are perceived as more helpful. In contrast, higher average ratings and higher shares of cognitive and emotive words have a negative effect. In addition, we find that high-involvement products tend to receive more helpful reviews.
|Table 1. Regression Linking Review Length and Argumentation Changes to Helpfulness|
|All Reviews||Review Subsets|
|Stated: stand. coef. and std. dev. in parentheses. Signif.: p0.05; p0.01; p0.001. Product-level effects are included.|
To test H1, we additionally include the review length () in our model. Column (b) of Table 1 reports the results. The coefficient of is positive and significant (). This suggests that a one standard deviation increase in the length of the review text increases the probability of a helpful vote by . All other model coefficients remain stable. Therefore, H1 is supported. We now test H2. For this purpose, we add the rate of argumentation changes () to our model. The results are shown in column (c) of Table 1. The coefficient of is not significant. Hence, H2 is rejected.
Next, we examine whether there is a significant interaction between review length and argumentation changes. For this purpose, we add the interaction to our model. Column (d) of Table 1 shows the results. The coefficient of the interaction term is negative and statistically significant (), and the coefficient of became negative and significant (). This suggests that the effects of review length and argumentation changes depend on each other. To shed light on the interaction, we plot the marginal effects of review length along with the confidence intervals. Figure 1 shows that (i) longer customer reviews are perceived as more helpful if the rate of argumentation changes is small, and (ii) longer reviews are even perceived as less helpful if the rate of argumentation changes is very high. We thus find support for H3, which states that the positive effect of review length is moderated by the rate of argumentation changes.
Ultimately, we perform several checks and complementary analyses. First, we estimate two separate regressions for low- and high-involvement products. The results are shown in columns (e) and (f) of Table 1. Concordant with our previous findings, we find that review length is moderated by the rate of argumentation changes. Interestingly, we further observe that the coefficient of is only significant for low-involvement products. A possible explanation is that customers prefer clear-cut opinions for low-involvement products as these products typically exhibit a relatively low amount of perceived risk. Second, we tested an alternative variant for measuring that additionally accounts for neutral sentences (Ghose.2011). This approach yields qualitatively identical results. Ultimately, we repeat our analysis using a mixed-effects tobit model as suggested by Mudambi.2010. All regression estimates support our findings.
Discussion and Future Research
This work makes several contributions to research on electronic commerce and online word-of-mouth. Most importantly, we disprove the prevailing narrative in previous research (e. g. Mudambi.2010, Yin.2016) that longer reviews are uniformly perceived as more helpful. Instead, we propose that frequent changes between positive and negative arguments require greater cognitive effort which can lead to information overload. This can make it less likely for customers to perceive longer reviews as helpful. Our work thereby extends the experimental study from Park.2008 which indicates that information overload can occur at product level such that consumers’ involvement with a product is reduced if confronted with too many reviews. Our study provides evidence that information overload can also occur at the review level. Specifically, to the best of our knowledge, our paper is the first study demonstrating that, given increased complexity and consumers’ limited cognitive capacities, the (positive) effect of review length on perceived helpfulness is moderated by the frequency of argumentation changes in the review text.
In addition, our findings have important implications for practitioners in the field of electronic commerce. Retailers need to understand the determinants of review helpfulness in order to gain a better understanding of consumer information search behavior and purchase decision-making. Our findings and the proposed method for measuring the line of argumentation in reviews can help retailers to optimize their information systems towards a better shopping experience, e. g. by improving the ranking of the most helpful reviews. The order in which reviews appear plays a crucial role, since most online platforms prominently display the most helpful positive and negative reviews, before presenting other reviews (Yin.2016). Our findings are also relevant for reviewers on retailer platforms, who can use our conclusions to write more helpful product reviews. Specifically, our study suggests that reviewers should avoid excessive alternation between positive and negative arguments, as this may make it more difficult to comprehend the review.
Overall, this study allows for a better understanding of the roles of review length and argumentation changes in the reception of consumer reviews. In future work, we will expand this study in three directions. First, we plan to study the interplay between review length and argumentation changes in the context of refutational and non-refutational reviews. Second, we will conduct further analysis to better understand potential differences regarding the role of argumentation changes in the context of low-involvement and high-involvement products. Third, it is an intriguing notion to validate our findings with reviews from other recommendation platforms, such as hotel or restaurant reviews.