Crowdsourcing Argumentation Structures in Chinese Hotel Reviews
Argumentation mining aims at automatically extracting the premises-claim discourse structures in natural language texts. There is a great demand for argumentation corpora for customer reviews. However, due to the controversial nature of the argumentation annotation task, there exist very few large-scale argumentation corpora for customer reviews. In this work, we novelly use the crowdsourcing technique to collect argumentation annotations in Chinese hotel reviews. As the first Chinese argumentation dataset, our corpus includes 4814 argument component annotations and 411 argument relation annotations, and its annotations qualities are comparable to some widely used argumentation corpora in other languages.
In customer reviews, users usually not only give their opinions on the products/services, but also provide reasons supporting their opinions. For example, consider the following review excerpt posted on Tripadvisor.com:
Example 1: 1⃝ 房间的电器设施让人很失望。 2⃝ 有一台很老很小的黑白电视。 3⃝ 空调也是坏的。 1⃝ Appalling in room electrical facilities. 2⃝ There was an old, small, black TV. 3⃝ Air conditioner did not work.
Clause 1⃝ gives the customer’s opinion (or claim) on the appliances in the room, and clauses 2⃝ and 3⃝ are reasons/evidences (or premises) supporting the claim. Such discourse structures are known as arguments , and the techniques for automatically extracting arguments and their relations (e.g. support/attack) from natural language texts are known as argumentation mining . Performing argumentation mining on customer reviews can reveal the reasons behind users’ opinions, thus can greatly facilitate the product producers and service providers to figure out their weaknesses and hence has huge commercial potentials.
There exist a great demand for reliably annotated corpora on customer reviews, since they are required for training supervised-learning-based argumentation mining techniques. Existing argumentation corpora are mostly constructed from highly professional genres, e.g. legal documents , persuasive essays , newspapers and court cases . Compared to these genres, customer reviews are written by ordinary people in casual scenarios, thus their linguistic complexities are usually lower and do not contain much domain knowledge; as a result, we believe that even novice people are able to identify the argumentation structures in customer reviews. Crowdsourcing has been widely recognised as a reliable and economic method for some annotating tasks . In this work, we investigate the applicability of crowdsourcing for argumentation annotation in Chinese hotel reviews. Specifically, the contributions of this work are threefold:
We propose a novel argumentation model111An argumentation model gives the definition of arguments, e.g. what components an argument is consisting of, what kinds of relations are allowed between different argument and argument components. for hotel reviews, which extends the classic “premise-claim” model, and can be potentially used for defining argumentation structures in other types of customer reviews and in other languages.
We novelly employ crowdsourcing to annotate argumentation structures (i.e. argument components and their relations) in Chinese hotel reviews, design some mechanisms so as to help the workers reduce their chances of making mistakes, and use a clustering algorithm to aggregate collected annotations.
The aggregated annotations are published as a publicly available corpus, and the annotating quality of the corpus is comparable to the state-of-the-art English argumentation corpora. Furthermore, because of the controversial nature of the argumentation annotation task, we provide a confidence score to each label, so as to help users understand the controversy degree of each annotation. To the best of our knowledge, this is the first Chinese argumentation corpus, and the first use of confidence score in argumentation corpora.
2 Related Work
We first review existing argumentation corpora for customer reviews; in particular, we highlight the argumentation models they used to define arguments. After that, we review works on crowdsourcing for argumentation annotation and some related tasks, e.g. annotating discourse structures.
2.1 Argumentation Corpora
A comprehensive review on argumentation corpora is beyond the scope of this paper; good overviews can be found in e.g. [10, 5, 16]. Here we only review argumentation corpora constructed from customer reviews.
Wachsmuth et al.  built the ArguAna corpus, consisting of 2.1k hotel reviews posted on Tripadvisor.com. Instead of directly labelling arguments, they annotate statements and sentiments polarities. A statement is “at least a clause and at most a sentence that is meaningful on its own”. They designed a rule-based tool to segment statements, and employed crowdsourcing to annotate the sentiment and features (e.g. location, services, facilities) in each statement. Results suggested that crowdsourcing workers can reliably identify the sentiments of statements (approval rate 72.8%) but have controversies for identifying features (rejection rate 43.3%). We view ArguAna as an intermediate resource for building argumentation corpora, because each argument usually contains several statements (e.g. one positive/negative statement serving as the claim, and several neutral statements serving as premises). Thus, ArguAna cannot be directly used for training argumentation mining techniques.
Garcia Villalba and Saint-Dizier  investigated suitable argumentation models for customer reviews. They viewed many different type of expressions (e.g. illustrations, elaborations and reformulations) as argument components, and built a corpus consisting of 50 customer reviews in French and English in the domains of hotels and restaurants, hifi products, and the French political campaign. Wyner et al.  built a corpus consisting of 84 reviews (posted on Amazon.com) for one specific camera model. They considered one specific argumentation model: an argument consists of two premises (premise 1 gives “camera X has property P” and premise 2 gives “property P promotes value V”) and one claim (the customer should perform action ACT; possible ACT include “buy the camera”, “avoid using the flashlight”, etc.). However, for both these corpora, their inter-rater agreement (IRA)222IRA is a widely used metric to evaluate the annotation quality. There exist multiple methods for computing the IRA score; in this paper, for each IRA score, we will point out the computation method it uses. Larger IRA values suggest higher agreement between the annotators, thus suggest higher reliability of the obtained annotations. were not reported, and they were not publicly available.
2.2 Crowdsourcing for Argumentation Annotation and Related Tasks
Ghosh et al.  proposed an annotation mechanism to annotate arguments and their relations in blog comments: they hired crowdsourcing workers to label claims-premises relations. Note that the arguments segments were provided a priori (annotated by domain experts), and the crowdsourcing workers were asked to only label the argument component types and relations. They reported that crowdsourcing workers achieved 0.45-0.55 IRA score (in terms of multi- ), suggesting the agreement is moderate; also, they suggested that the agreement scores of the crowdsourcing workers is highly correlated to those of the expert annotators.
Crowdsourcing has been used to annotate discourse structures. Kawahara et al.  designed a two-stage crowdsourcing mechanism to annotate two levels of discourse relations in Japanese texts crawled from multiple online genres. In their work, discourse relations include contrast, concession, cause-effect, etc., and these relations closely resemble some relations in argumentation structures (e.g. the attack relation between arguments can be viewed as contrast, and the premise-claim relation is closely related to the cause-effect relation). They did not report the IRA of the crowdsourcing workers, but instead, testified the quality of their discourse corpus by training a discourse parser on their corpus. Results suggested that the quality of their corpus is comparable to the state-of-the-art English discourse corpus, indicating that crowdsourcing can be reliably used for annotating relations between clauses.
In this section, we describe how we design our annotation guideline. In particular, we perform some preliminary annotating experiments to decide i) how to segment clauses (e.g. by rule-based automatic methods or by crowdsourcing workers), and ii) which argument model to use, i.e. which argument components constitute an argument, and what relations are permitted between argument components. Since most existing annotation guidelines are for English texts, we annotate twenty English and ten Chinese hotel reviews (without titles) from Tripadvisor.com to draft our guideline. Five annotators participate in the experiments; they are all Chinese native speakers fluent in English. The annotation is performed on the brat  open-source annotation platform.
As for clause segmenting, we test two approaches: sub-sentence based segmenting  (i.e. viewing each sub-sentence as a clause), and free segmenting (i.e. any span of texts can be viewed as a clause). We find that the free segmenting strategy is more suitable for Chinese hotel reviews, because punctuations are often missing or misused. In addition, the free segmenting strategy enables the annotators to label the exact boundary of each argument component, avoiding including some connecting words in argument components (e.g. “而且”(in addition)).
As for the argumentation model, we consider three candidate models: the Premises-Claim-MajorClaim (PCM) model  for persuasive essays, the extended Claim-Premises (ECP) model  for long Web documents and the extended Toulmin’s (ETM) model  for short Web documents. Our experiments suggested that:
In both Chinese and English hotel reviews, users often give their overall impression on hotels (e.g. 强烈推荐(strongly recommend), 我觉得是很不错的酒店(I think the hotel is quite good), 很美好的回忆(living here leaves me beautiful memories)). We believe the MajorClaim component in the PCM model is the most suitable argument component type for labelling these clauses.
In hotel reviews, users often give evidences supporting some implicit claims: consider the clause “离市中心走路不到五分” (just 5 minutes’ walk to the city center), we believe this clause is not only for stating the location of the hotel, but also supports some claims (e.g. the location of the hotel is good), although these claims are not explicitly presented in the hotel reviews. We thus believe that the argument component premise supporting implicit claim (PSIC) should be used in our argumentation model.
Distinguishing different kinds of premises (e.g. grounds warrants, refutation) in hotel reviews is error-prone even for expert annotators; thus we decide not to use the ETM model, although ETM is proposed for short user-generated documents.
Our argumentation model is an extension and integration of the PCM and ECP models; it includes four argument components: MajorClaim, Claim, Premise and PSIC; texts not labelled as any argument component are non-argumentative (NA for short). Premises are allowed to support/attack claims, but claims are not allowed to support other claims, because this will lead to cascading support, which overcomplicates the annotating process . In addition, annotators are also asked to annotate the sentiment polarity (positive, negative or neutral) for MajorClaims and Claims, because these two argument components are subjective. Fig. 1 illustrates our argumentation model, and some examples are given in Table I. Note that each premise must support/attack some claim, but a claim may have no premises supporting/attacking it.
The final annotation guideline is in Chinese, including detailed explanations of the argument model, segmenting rules as well as considerable illustrative examples. To test the readability and applicability of our annotation guideline, we ask additional two students (Chinese native speakers) to independently annotate five Chinese hotel reviews using our annotation guideline. The average IRA in terms of Krippendorff’s  for their annotations is 0.715, suggesting that the agreement is substantial.
|Major Claim||“总之，[这个酒店很好](Major Claim1)，[我很满意](Major Claim2)。” “In short, [this hotel is very good] (Major Claim1), [I am very satisfied with this hotel.] (Major Claim2).”|
|Claim||“[舒适的环境](Claim1)和[周到的服务](Claim2)，[房间设施也很齐全](Claim3)。” “[Comfortable environment] (Claim1) and [thoughtful service](Claim2), and [the room has all necessary appliances and equipment](Claim3).”|
|Premise||1⃝ “[服务员很好](Claim)，[会主动帮我们拿行李](Premise)。” “[The staff are very nice] (Claim), [they came out to help us take our luggage] (Premise).”|
|2⃝ “[酒店餐厅的美食还挺不错](Claim),就是[食物的价格都比较贵](premise)。” “[The food in the dining hall of this hotel is pretty good] (Claim), despite [the steep price] (Premise).”|
|PSIC||“【房间的冰箱里面就有水，还有一些其它的吃的东西，都是可以先享受再付费这样的】(PSIC)。” “[There are water in room’s refrigerator and some other food, you can pay after you use them] (PSIC).”|
|Support||In example 1⃝, the Premise supports the Claim|
|Attack||In example 2⃝, the Premise attacks the Claim|
4 Crowdsourcing Experiments
Our crowdsourcing experiment is performed as an optional assignment in the social media mining course in University of Chinese Academy of Sciences in 2017. Over four-hundred MSc students are registered for this course, above 90% are Chinese native speakers. We call for volunteers to participate in our experiment, and inform them that the participated students can obtain the corpus and its statistics in return, which they can use in their final project to train some machine learning algorithms. At last, 388 students participate and we give an one-hour tutorial to help them go through the guideline and to illustrate some examples.
The crowdsourcing is performed on the brat  platform. Each student can only view the hotel reviews assigned to him/herself. To help the annotators reduce mistakes, we customise and extend the original brat system so that i) only legal relations (see Fig. 1) are allowed to annotate, ii) the annotator is reminded if the sentiments for some MajorClaims or Claims are not labelled, and iii) the annotator is reminded if there exist some premises that do not support/attack any claims (this is not allowed; see Sect. 3).
Each student receives twenty-five hotel reviews to annotate; one in these twenty-five reviews is a gold standard review, which has been annotated by our expert annotators in our pre-study and is used to evaluate the devotedness of the annotator. Each non-gold-standard hotel review is allocated to 4 students to independently annotate. All hotel reviews are crawled from Tripadvisor.com.cn, and their length are all between 100 to 150 words. Students are asked to finish all labelling in one week. At last, we collect 2332 hotel reviews’ annotations. Because around 10% students (38 students) fail to annotate all document assigned to them, some documents receive only one student’s annotation, and we remove these documents. In addition, we remove the annotations that violate our annotation guideline. In total, 7 hotel reviews and their annotations are removed.
We compute for each hotel review333 At the moment, the for a hotel review considers only the annotations for component types, and ignores the agreement for annotations on sentiment and relations, because we observe that the agreement scores for sentiment and relations are highly correlated to the for argument component. Detailed agreement for sentiment and relations will be presented in Sect. 5.2. so as to evaluate the quality of annotations. The distribution of scores is presented in Fig. 2. We find that only 21% documents receive , suggesting that the annotations for most documents are quite diverging and cannot be directly used to build the corpus. We believe that two factors may have resulted in the low annotation agreement: i) the low devotedness of some annotators, and ii) the controversial nature of argumentation annotation in some hotel reviews. In the next section, we will give a specific analysis of these two reasons and attempt to improve the annotation quality along these two lines, i.e. by removing less-devoted students’ annotations and by identifying controversial sentences, so as to produce a high-quality Chinese hotel review corpus.
5 Post-Processing and Corpora Generation
We identify the less-devoted students and remove their annotations in Sect. 5.1, compute the confidence score for annotations and generate the final corpora in Sect. 5.2, and perform the error analysis in Sect. 5.3.
5.1 Remove Less-Devoted Students’ Annotations
We use the gold standard annotations to evaluate each student’s devotedness. The gold standard texts are similar to the examples in the guideline, thus we believe that the students whose annotations diverge widely from the gold standard annotations are less devoted. For each sentence in a gold standard text, we compute all students’ agreement (in terms of ) against the gold standard annotation on this sentence, and rank these agreement scores. If a student’s agreement score on this sentence falls into the bottom 10%, we increase this student’s less-devotedness degree by 1. Students whose less-devotedness degrees are equal or larger than 2 are labelled as less-devoted, and all of their annotations are removed. In total, we found 38 less-devoted students, and we additionally remove 39 hotel reviews because they receive fewer than two students’ annotations after deleting the less-devoted annotations.
We find that removing the less-devoted students’ annotations can indeed improve the annotation quality as a whole. For example, after removal, the percentage of reviews whose increases from 21% to 24%. In addition, the average agreement score for each gold standard text has also be increased thanks to the removal (see Table II). However, even after the removal, there still exist considerable controversial annotations, due to the controversial nature of our guideline and the argument annotation task itself. Next, we will identify the controversial texts, remove their annotations and so as to obtain high-quality corpora.
|gold 1||gold 2||gold 3||gold 4||gold 5|
5.2 Dealing With Controversial Annotations and Obtain the Final Corpora
By manually reading and analysing the annotations, we find that the hotel reviews generally fall into two categories: i) easy reviews, in which the argument component type of each sentence is quite clear and their relations are easy to identify; and ii) controversial reviews, in which a high percentage (over 30%) of sentences meet multiple argument component types’ definitions, and their labels are heavily dependent on their contexts. For the easy reviews, we can obtain the annotations for the argument component, sentiments and relations; for the controversial reviews, although many sentences have controversial annotations, we may still be able to find some less-controversial sentences and obtain their annotations. Thus, we build two corpora based on the annotations we have collected: one consists of easy reviews, and the other consists of the relatively less-controversial sentences in controversial reviews, so as to extract as much useful information from the annotations as possible.
Reviews whose scores are equal or larger than 0.6 are marked as easy reviews; we find that the agreement for relations and sentiments annotations are also high among these reviews (details are given in Sect. 5.2). In total, 316 hotel reviews are marked as easy reviews. The remaining 1911 hotel reviews are marked as controversial, and we segment them by sentences, so as to find the less-controversial sentences therein. In total, the controversial hotel reviews include 5212 sentences; among them, sentences whose is equal or larger than 0.7 are marked as less-controversial sentences, accounting for around one-fourth sentences (1452/5212).
5.2.1 Easy Reviews Corpus
We first need to aggregate the annotations for argument components, which lays the foundation for aggregating sentiments and relations. As a concrete example, consider a sentence and its annotations presented in Fig. 3. We can see that three annotations are different in their boundaries annotations as well as on their argument component types annotations; to obtain a converged annotation from these diverging annotations, two sub-tasks are involved: i) resolve the conflicts between argument component boundaries, so as to obtain clause texts, and ii) decide the argument component type for the obtained clause.
We employ the K-means clustering technique  to perform the argument component aggregation, which can perform the above two subtasks as a whole. Specifically, we vectorise each student’s annotation in one-hot manner: each character is represented by 5 digits, representing the student’s annotation for this character. We perform 1-cluster clustering on the annotation vectors, and the centroid of the clustering gives the boundary as well as the argument component type. Still consider the example in Fig. 3: the first 5 digits in the centroid is (0.33,0,0,0,0.67), where the first digit corresponds to MajorClaim and the last digit corresponds to Non-Argumentative, and thus we label it as Non-argumentative, and the confidence of this label is 0.67. By computing the centroid, we not only aggregate the component annotations, but also obtain the confidence score for each label. Some statistics of the argument components annotations in easy reviews are presented in Table III, and according to our statistics, the average of each review is annotated by three students; as for IRA metrics, besides , we also use percentage agreement, multi-  and Krippendorff’s .
Based on the converged annotations for components, we first evaluate the quality of annotations for sentiments and reviews in the easy hotel reviews. Table IV presents the agreement scores for sentiments and relations. We can see that almost all scores are over 0.5, suggesting the agreement is substantial. Thus, we directly employ the majority voting technique to aggregate the annotations for relations and sentiments, and obtain the easy review corpus.
5.2.2 Less-Controversial Sentences Corpus
Again, we use K-means to aggregate the annotations for argument component in the less-controversial sentences. Some statistics of argument component annotations are presented in Table V, and each sentence has an average of three students to annotate. We can see that the size of the less-controversial sentence corpus is even larger than the less-controversial review corpus, indicating that much useful information can be extracted even from the highly diverging annotations. As for the relations annotations in controversial hotel reviews, among all pairs of less-controversial sentences, only 5% have been annotated (by as least one annotator) as having relations; thus, we do not aggregate the relation annotations for the less-controversial sentences. The sentiment scores for these sentences are presented in Table VI.
5.3 Error Analysis
In order to study the disagreements in the annotations, we created confusion probability matrices (CPM)  for argument components annotations. A CPM contains the conditional probabilities that an annotator gives a certain label (column) given that another annotator has chosen the label in the row for a specific item. For example, CPM for the easy review corpus and the less-controversial sentences corpus are presented in Table VII, and the upper-left cell in Table VII means that, when some other annotators have labelled a clause as MajorClaim, an annotator will have 0.481 probability to label this clause also as MajorClaim.
From Table VII we can see that there exist no significant confusion between annotations in the less-controversial sentences corpus, and we believe the reason is the strict criterion for selecting less-controversial sentences (; see Sect. 5.2). However, in the easy reviews corpus (see Table VII), we find that the confusion between NA and PSIC is significant. As a concrete example, consider the following sentence obtained from the easy reviews corpus: “酒店旁边有个很有特色的酒吧，非常喜欢那里。” (“There is a very special bar next to the hotel; I like the bar very much.”). Some annotators label this sentence as a PSIC, as they believe that this sentence supports an implicit claim “hotel locates at a convenient place”; however, some other annotators label this sentence as NA, because the sentence says nothing about the hotel itself. As we do not provide a candidate list of implicit claims, different annotators naturally have different understandings of PSIC, resulting in the high confusion between PSIC and NA. A possible solution is to provide a candidate list of implicit claims; we leave this as future work.
In this work, we present the first Chinese argumentation corpus, and present the crowdsourcing technique we used to build this corpus. The argumentation model used in corpus extends some classic models, and we believe it is suitable for product reviews in general. In particular, we novelly use the clustering technique to aggregate annotations, which can not only resolve annotation conflicts, but also provide a confidence score at the same time. The annotation quality of our corpus is comparable to some widely used argumentation corpora in other languages. To stimulate further research, we make the corpus publicly available444Download website: 184.108.40.206.
-  Silvie Cinková, Martin Holub, and Vincent Kríž. Managing uncertainty in semantic tagging. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 840–850. Association for Computational Linguistics, 2012.
-  J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378, 1971.
-  D. Ghosh, S. Muresan, N. Wacholder, M. Aakhus, and M. Mitsui. Analyzing argumentative discourse units in online interactions. In Proc. of Workshop on Argumentation Mining, pages 39–48, 2014.
-  I. Habernal, J. Eckle-Kohler, and I. Gurevych. Argumentation mining on the web from information seeking perspective. In Proc. of the Workshop on Frontiers and Connections between Argumentation Theory and Natural Language Processing, 2014.
-  I. Habernal and I. Gurevych. Argumentation mining in user-generated web discourse. Computational Linguistics, 2016.
-  John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
-  D. Kawahara, Y. Machida, T. Shibata, and S. Kurohashi. Rapid development of a corpus with discourse annotations using two-stage crowdsourcing. In Proc. of COLING, pages 269–278, 2014.
-  Krippendorff Klaus. Content analysis: An introduction to its methodology, 1980.
-  Klaus Krippendorff. Measuring the reliability of qualitative text analysis data. Quality & quantity, 38(6):787–800, 2004.
-  M. Lippi and P. Torroni. Argumentation mining: State of the art and emerging trends. ACM Transactions on Internet Technology, 2015.
-  Marie-Francine Moens. Argumentation mining: Where are we now, where do we want to be and how do we get there? In Proc. of Forum on Information Retrieval Evaluation, 2013.
-  R. M. Palau and M.-F. Moens. Argumentation mining: the detection, classification and structure of arguments in text. In Proc. of ICAIL, 2009.
-  C. Reed, R. Mochales-Palau, G. Rowe, and M.-F. Moens. Language resources for studying argument. In Proc. of LREC, 2008.
-  R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proc. of EMNLP, pages 254–263, 2008.
-  C. Stab and I. Gurevych. Annotating argument components and relations in persuasive essays. In Proc. of COLING, 2014.
-  C. Stab and I. Gurevych. Parsing argumentation structures in persuasive essays. arXiv preprint, arXiv:1604.07370, 2016.
-  P. Stenetorp, S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, and J. Tsujii. brat: a web-based tool for nlp-assisted text annotation. In Proc. of ECAL, pages 102–107, 2012.
-  M. P. Garcia Villalba and P. Saint-Dizier. Some facets of argument mining for opinion analysis. In Proc. of COMMA, 2012.
-  H. Wachsmuth, M. Trenkmann, B. Stein, G. Engels, and T. Palakarska. A review corpus for argumentation analysis. In Computational Linguistics and Intelligent Text Processing, pages 115–127. 2014.
-  A. Wyner, J. Schneider, K. Atkinson, and T. JM Bench-Capon. Semi-automated argumentative analysis of online product reviews. In Proc. of COMMA, 2012.