An Annotated Corpus of Relational Strategies in Customer Service Under review in Language Resources and Evaluation

An Annotated Corpus of Relational Strategies in Customer Service thanks: Under review in Language Resources and Evaluation

Ian Beaver Ian Beaver Next IT Corporation
Spokane Valley, WA USA
(509) 242-0767
22email: ibeaver@nextit.comCynthia Freeman Next IT Corporation
Spokane Valley, WA USA
(509) 242-0767
44email: cfreeman@nextit.comAbdullah Mueen Department of Computer Science
University of New Mexico, USA
(505) 277-1914
   Cynthia Freeman Ian Beaver Next IT Corporation
Spokane Valley, WA USA
(509) 242-0767
22email: ibeaver@nextit.comCynthia Freeman Next IT Corporation
Spokane Valley, WA USA
(509) 242-0767
44email: cfreeman@nextit.comAbdullah Mueen Department of Computer Science
University of New Mexico, USA
(505) 277-1914
   Abdullah Mueen Ian Beaver Next IT Corporation
Spokane Valley, WA USA
(509) 242-0767
22email: ibeaver@nextit.comCynthia Freeman Next IT Corporation
Spokane Valley, WA USA
(509) 242-0767
44email: cfreeman@nextit.comAbdullah Mueen Department of Computer Science
University of New Mexico, USA
(505) 277-1914
Received: 16 August 2017 / Accepted: date

We create and release the first publicly available commercial customer service corpus with annotated relational segments. Human-computer data from three live customer service Intelligent Virtual Agents (IVAs) in the domains of travel and telecommunications were collected, and reviewers marked all text that was deemed unnecessary to the determination of user intention. After merging the selections of multiple reviewers to create highlighted texts, a second round of annotation was done to determine the classes of language present in the highlighted sections such as the presence of Greetings, Backstory, Justification, Gratitude, Rants, or Emotions. This resulting corpus is a valuable resource for improving the quality and relational abilities of IVAs. As well as discussing the corpus itself, we compare the usage of such language in human-human interactions on TripAdvisor forums. We show that removal of this language from task-based inputs has a positive effect on IVA understanding by both an increase in confidence and improvement in responses, demonstrating the need for automated methods of its discovery.

Relational Strategies Intelligent Virtual Agents Data Collection Annotation Natural Language Understanding Multi-Intent Detection
journal: Language Resources and Evaluation

1 Introduction

Intelligent personal assistants such as Apple’s Siri, Microsoft’s Cortana, or Google Now are commonly used for answering questions and task optimization. Many companies are deploying specialized automated assistants, known as Intelligent Virtual Agents (IVAs), for efficient problem resolution, cutting costs in call centers, and also as the first layer of technical and product support (Marois, 2013). In these business domains, IVA accuracy and efficiency directly impact customer satisfaction and company support costs. In one case study (Caramenico, 2013), a Fortune 50 insurance company experienced a reduction in contact center volume within five months of deploying an IVA on their website. Domino’s Pizza reported that product order time was reduced by through their IVA (Frearson, 2015).

To better assist humans, IVA designers strive to support human-like interactions. Take, for example, Amazon’s Alexa Prize competition where student developers attempt to build IVAs that can carry on meaningful, coherent, and engaging conversations for 20 minutes (Levy, 2016). As IVAs become more human-like, we theorize that users will increasingly use relational strategies (e.g. self-exposure and justification) with IVAs similar to conversing with humans. There is a large body of work on development of trust between humans engaged in virtual dialog (Gibson & Cohen, 2003; Ballantyne, 2004; Holton, 2001; Coppola et al. , 2004). The focus of these works is on how relational strategies contribute to trust between human speakers. From this literature, we predict the types of strategies humans may employ with IVAs as they relate to them in an increasingly human manner.

In customer service and personal assistant domains, trust is necessary between the human agent and customer. The customer’s issues must be viewed by the agent as legitimate for proper attention to be given. Likewise, customers must trust that the agent is capable of assisting them and will not mistreat their information. Current research shows that human-like virtual agents are associated with not only greater user trust but also trust resilience when the agent makes mistakes (de Visser et al. , 2016). To build trust with the agent, customers may establish credibility through small talk, self-exposure, and by providing justification of their requests (Bickmore & Cassell, 2001).

In interactive question answering, such as dialogs with an IVA, understanding user intent is essential for the success of the IVA (Chai et al. , 2006). The intent can be defined as the interpretation of a user input that allows an agent to formulate the best response. However, when relational strategies are applied to IVAs, the additional language introduced is often unnecessary and can even obfuscate user intent. Such language can lead to confusion in the IVA and a degradation of user experience in the form of clarification questions and wrong information.

Example 1
I need a ticket to Boston this Saturday, my son is graduating!

In Example 1, the fact that the customer’s son is graduating is unnecessary for determining the user’s intent to purchase a ticket. By including unnecessary background information, the IVA may incorrectly deduce that the customer is booking a ticket for his or her son instead. Thus, the identification of relational segments is a useful feature for an IVA; unfortunately, no corpus of annotated relational segments exists to develop identification techniques (Serban et al. , 2015).

This lack inspired us to create such a corpus. Within this corpus, we needed to not only identify the location of relational language but also label its type (Gratitude, Greetings, etc.) so that automated methods to determine the relational strategy in use can be explored.

If these strategies are practiced by users of IVAs, it is important to identify them; enabling IVAs to separate such language can help better clarify the users’ main intention. For IVAs to become more human-like, determining which segments of a request are relational is necessary to allow these IVAs to both understand the user intent correctly and to include empathetic or reciprocal relational strategies.

The identification of relational strategies in a single conversational turn can be structured as a multi-intent detection problem. The user not only wants the task completed (the primary intent); they may also attempt to build credibility or some common ground with the IVA (the secondary intent). Segments of text such as justification or backstory can be annotated as secondary intent and ignored while determining the primary intent. Once relational language is isolated, a separate classification can determine what relational strategies are in use and how to properly respond.

Multi-intent detection within dialog systems is still an emerging field; in recent work, only one intent is assumed to be present per turn (Sarikaya et al. , 2016). A few methods exist such as Xu & Sarikaya (2013) which uses multi-label learning and Kim et al.  (2016) which employs a two-stage intent detection strategy. However, Xu & Sarikaya (2013) provided no explanation of how data was annotated nor any mention of annotator agreement. In Kim et al.  (2016), multi-intent data was fabricated by concatenating all combinations of single-intent sentences.

In this article, we provide several contributions. Most importantly, we create the first publicly available customer service corpus with annotated relational segments. We propose an evaluation measure and set a baseline by comprehensive human annotation, ultimately confirming that the addition of relational language can obfuscate the user’s intention to IVAs not designed to recognize it. Along with annotated relational segments, our corpus includes multi-intent requests to further research in multi-intent detection. We analyze human agreement in determining the presence of multiple intents so that future research on multi-intent detection can be evaluated in the light of prior human performance. Through these contributions, we hope to encourage further research and ultimately aid in the design of more intelligent IVAs.

In the following sections, we discuss in detail how the data was collected, annotated, and merged to create highlighted sections. Another round of review was then done on these highlighted sections to determine the class of language present in these sections (e.g. Greeting, Gratitude, etc). We then measure and compare the frequency of relational strategies when users present their requests to IVAs versus humans. Finally, we conduct an experiment with three commercial IVAs, demonstrating that removal of relational strategies lowers confusion and leads to improved responses.

2 Data Collection

Next IT Corporation designs and builds IVAs on behalf of other companies and organizations, typically for customer service automation. This unique position allows access to a large number of IVA-human conversations that vary widely in scope and language domain. We selected IVAs for data collection based on the volume of conversations engaged in, the scope of knowledge, and the diversity of the customer base.

For diversity, we considered whether the target user base of the IVA was local, regional, national, or international and mapped the locations of the users engaging in conversations to visually verify. We only considered IVAs that had a national or international target user base and did not appear to have a dominate regional clustering to ensure that conversations were well distributed across users from different regions. This was to control for relational styles that may differ between regions.

IVAs deployed in domains that were highly sensitive, such as human resource management or health care, were not considered. As a result, human-computer data was collected from three live customer service IVAs in the language domains of airline, train travel, and telecommunications. Each agent met our criteria of a broad knowledge base, sufficient conversation volume, and a very diverse user base.

The selected IVAs are implemented as mixed-initiative dialog systems, each understanding more than 1,000 unique user intentions. The IVAs have conversational interfaces exposed through company websites and mobile applications. In addition, the IVAs are multi-modal, accepting both speech and textual inputs, and also have human-like qualities with simulated personalities and interests. A random sample of 2,000 conversations was taken from each domain. The samples originate from conversation logs during November 2015 for telecommunications and train travel and March 2013 for airline travel. There were 127,379 conversations available in the logs for the airline IVA. The telecommunications and train travel logs contained 837,370 and 694,764 conversations, respectively. The first user turn containing the problem statement was extracted. We focus on the initial turn as a user’s first impression of an IVA is formed by its ability to respond accurately to his or her problem statement, and these impressions persist once formed (, 2017; Madhavan et al. , 2006). Therefore, it is imperative that any relational language present does not interfere with the IVA’s understanding of the problem statement.

Finding a large mixed-initiative human-human customer service dataset for comparison with our human-computer dialogs proved difficult. Despite mentions of suitable data in Vinyals & Le (2015) and Roy et al.  (2016), the authors did not release their data. Inspecting the human-human chat corpora among those surveyed by Serban et al.  (2015) revealed only one candidate: the Ubuntu Dialogue Corpus (Lowe et al. , 2017). The corpus originates from an Internet Relay Chat (IRC) channel where many users discuss issues relating to the Ubuntu operating system. After a user posts a query on the channel, all following threads between the querying user and each responding user are isolated to create two-way task-specific dialogs. However, we want to study the initial problem statements to compare their composition with those extracted from our data. In the Ubuntu corpora, these are posed to a large unpaid audience in the hopes that someone will respond. The observed relational language and behavior was, therefore, no different than problem statements inspected in other IRC or forum datasets, and, for our purposes, was no more fitting than any other forum or open IRC dataset.

In addition, we desire to not just measure relational language content but also feed the problem statements into an IVA and measure the effect of any relational language on its understanding of the user intent. To do this, we needed requests that were very similar to those already handled by one of the selected IVAs to have any hope of the user intent already existing in the agent’s knowledge base. Unsatisfied with the Ubuntu dataset, we instead focused on publicly visible question and answering data in domains similar to those of the selected IVAs.

Upon searching publicly visible chat rooms and forums in the domains of travel and telecommunications support, we found the airline forum to be the closest in topic coverage. This forum includes discussions of airlines and polices, flight pricing and comparisons, flight booking websites, airports, and general flying tips and suggestions. We observed that the intentions of requests posted by users were very similar to that of requests handled by our airline travel IVA. While a forum setting is a different type of interaction than chatting with a customer service representative (user behavior is expected to differ when the audience is not paid to respond), it was the best fit that we could obtain for our study and subsequent release. A random sample of 2,000 threads from the 62,736 present during August 2016 was taken, and the initial post containing the problem statement was extracted. We use request hereafter to refer to the complete text of an initial turn or post extracted as described.

2.1 Annotation

From our four datasets of 2,000 requests each, we formed two equally-sized partitions of 4,000 requests with 1,000 pulled from every dataset. Each partition was assigned to four reviewers; thus, all 8,000 requests had exactly four independent annotations. All eight reviewers were employees of Next IT Corporation who volunteered to do the task in their personal time. As payment, each reviewer received a $150 gift card.

All Requests Multi-Intent Single Intent Unnecessary Avg. Length
TripAdvisor 2000 734 1266 94.1% 93.26

2000 149 1851 77.3% 19.81

2000 157 1843 68.6% 21.64

2000 201 1799 55.3% 20.07
Table 1: Dataset statistics. The Multi-Intent column represents the count of Requests where one or more reviewers flagged it as containing more than one user intention. The Unnecessary column represents the percentage of Single Intent requests where one or more reviewers selected any text as being unnecessary in determining user intent. Avg. Length is the number of words present in All Requests, on average.

The reviewers were instructed to read each request and mark all text that appeared to be additional to the user intention. The reviewers were given very detailed instructions, shown in Appendix B, and were required to complete a tutorial demonstrating different types of relational language use before working on the actual dataset. As the data was to be publicly released, we ensured that the task was clear. If more than one user intention was observed, the reviewer was instructed to flag it for removal. This was a design decision to simplify the problem of determining language necessary for identifying the user intention. Furthermore, as mentioned in section 1, IVAs with the ability to respond to multiple intentions are not yet commonplace. Although flagged requests were not used for further analysis, they are included in the released data to enable future research. After discarding all multi-intent requests, 6,759 requests remained. Per-dataset statistics are given in Table 1.

A request from the TripAdvisor data is given in Example 2.1 below. A reviewer first read over the request and determined that the user intent was to gather suggestions on things to do during a long layover in Atlanta. The reviewer then selected all of the text that they felt was not required to determine that intent. This unnecessary text in Example 2.1 is shown in gray. Each of the four reviewers performed this task independently, and we discuss in the next sections how we compare their agreement and merged the annotations.

Example 2
Original Request: Hi My daughter and I will have a 14 hour stopover from 20.20 on Sunday 7th August to 10.50 on Monday 8th August. Never been to Atlanta before. Any suggestions? Seems a very long time to be doing nothing. Thanks

Determine User Intent: Things to do during layover in Atlanta

Annotated Request: Hi My daughter and I will have a 14 hour stopover from 20.20 on Sunday 7th August to 10.50 on Monday 8th August. Never been to Atlanta before. Any suggestions? Seems a very long time to be doing nothing. Thanks

Reviewers averaged 1 request per minute over 1,000 requests on TripAdvisor data and 4 per minute over 3,000 requests from the three IVA datasets. We observed that each of the eight reviewers required 29 hours on average to complete their 4,000 assigned requests.

3 Annotation Alignment

To compare the raw agreement of annotations between two reviewers, we use a modification of alignment scores, a concept in speech recognition from hypothesis alignment to a reference transcript (Zechner & Waibel, 2000). We modify this procedure as insertions and deletions do not occur. Reviewers mark sequences of text as being unnecessary in determining user intention. When comparing annotations between two reviewers, an error () is considered to be any character position in the text where this binary determination does not match between them. The alignment score can be calculated as:

where is the total number of characters. Thus, where is perfect alignment. Reviewers may or may not include whitespace and punctuation on the boundaries of their selections which can lead to variations in . Therefore, when two selections overlap, we ignore such characters on the boundaries while determining . Figure 1 shows a fabricated example of alignment between two annotations. In the first selection, the trailing whitespace and punctuation are ignored as they occur within overlapping selections. Notice, however, that whitespace and punctuation count in the last selections as there is no overlapping selection with the other reviewer; therefore, there is no possibility of disagreement on the boundaries.

: [Hi, ]I need a new credit card[, my old doesn’t work any more.] Can you help?

: [Hi], I need a new credit card, my old doesn’t work any more.[ Can you help?]


Figure 1: Example alignment scoring between two fabricated annotations and . Text between “[” and “]” was marked as unnecessary for intent determination. Positions with an alignment error are underlined.

The alignment score was calculated for every request between all four annotations and then averaged. For example, an alignment score was calculated for each request between reviewer and , and , and . The same process was repeated between reviewer and , and , then and . Finally, alignment scores between all unique pairs of reviewers over all requests were averaged per dataset. The distribution of average scores per dataset is shown in Figure 2 (a). It may appear, at first, that two annotators could inflate the dataset alignment score by simply making annotations infrequently. However, as each request had four annotators, the average alignment score would actually be lower as those reviewers would have large error compared to the other two. The per dataset alignment averages can, in fact, be higher if a dataset has a large number of requests where no reviewer selected any text.

(a) Overall alignment scores
(b) Alignment scores when reviewers agree that additional language is present
Figure 2: The distribution of average alignment scores between all four annotations per dataset is shown in (a). We compute average alignment scores where all reviewers agree that additional language is present in (b).

Therefore, it is interesting to remove the effect of these cases and compare the ability of reviewers to agree on the selection boundaries given they both agree that selection is necessary. To measure this, we compute average alignment scores where both reviewers agree that additional language is present, shown in Figure 2 (b). Observe that although the Train dataset has the highest overall alignment in both cases, it is lower when the reviewers both select text, indicating it has many cases where no reviewers selected anything (which is in agreement with Table 1). In the case of TripAdvisor, it appears that there are a significant number of requests where one or more reviewers do not select text, but the others do, lowering the overall alignment score in Figure 2 (a).

Alignment based on word-level instead of character-level agreement was also considered. For each word, if the reviewer selected at least 50% of the word it was considered to be marked. This resolves situations where a reviewer accidentally missed the first or last few characters of a word in their selection. However, this may introduce errors where two letter words have only one character selected. In this case it is impossible to automatically decide if the reviewer meant to select the word or not as always selecting such words will be susceptible to the same error. Besides this ambiguous case, we felt it was safe to assume that words of longer length with less than half of the word selected were not intended to be marked.

Selected words were then used in place of selected characters in calculating the alignment scores between the reviewers in the same manner as Figure 1. We discovered that the alignment scores were only 0.2% different on average across the datasets than the character level alignment scores shown in Figure 2. This indicates that reviewers are rarely selecting partial words, and any disagreement is over which words to include in the selections. Therefore, in the released corpus and in this article, we consider selections using absolute character position which retains the reviewers’ original selection boundaries.

3.1 Agreement Between Reviewers

TripAdvisor Train Airline Telecom
0.270 0.450 0.405 0.383
1 1192 995 1264 1431
2 1092 709 948 1154
3 863 458 644 795
4 534 205 292 410
Table 2: Reviewer agreement on if any text should be selected. For example, row 3 is the number of requests with selections by at least three reviewers.

As it is difficult to determine how often all reviewers agree additional language is present from alignment scores alone, we measured reviewer agreement on the presence of additional language and multiple user intentions. For additional language presence, we calculated Fleiss’ over the annotations where the classes compared were if a reviewer did or did not select text. As demonstrated in Table 2, regardless of domain, this is a subjective task. While there is moderate agreement in the Train and Airline sets, the TripAdvisor set, in particular, is lower in agreement which reinforces our previous observations in Figures 2 (a) and (b). Due to the sensitivity of measurements (Feinstein & Cicchetti, 1990; Guggenmoos-Holzmann, 1993), these values must be interpreted in light of the task. Despite the lower values, we are only measuring presence or absence of unnecessary language, and these two categories did not necessarily occur in equal frequencies. Under these conditions, according to Bakeman et al.  (1997), a between 0.2 and 0.45 may suggest reviewer reliabilities between 80% to 90%, respectively. Therefore, despite the lower values for , the individual reviewer annotations appear reliable and can be further improved when merged based on agreement as discussed in the following section.

Example 3
: Our tv reception is horrible. is there an outage in my area?
: Our tv reception is horrible. is there an outage in my area?

We did observe situations where two reviewers disagree on the real intent of the user, therefore, causing conflict in the selection of unnecessary text. While these were rare, Example 3.1 demonstrates how even humans sometimes struggle with determining the intention of written requests. Reviewer 1 appears to believe that the primary intent of the user is to notify the agent about poor television reception, and the query about the outage in the area is out of curiosity. However, reviewer 7 appears to believe the primary intent is to discover if a cable outage is present in the area, and the complaint about reception justifies the query. The effects of these disagreements on intent can be mitigated by merging the annotations based on the number of reviewers who agreed on a selected character.

Next, we considered the reviewers’ determination of multiple intentions. A was calculated over how reviewers flagged requests containing more than one user intention. As shown in Table 3, we see somewhat similar performance in this task as in the previous selection task. This table demonstrates the difficulty of multi-intent detection, even for humans. The domain does not seem to be a factor as is similar across datasets. It is apparent, however, that in the forum setting, users are much more likely to insert multiple intentions in a single request than in a chat setting.

TripAdvisor Train Airline Telecom
0.415 0.374 0.434 0.386
1 734 201 157 149
2 480 85 69 56
3 275 50 38 32
4 71 8 15 11
Table 3: Reviewer agreement on multi-intent detection. For example, row 3 is the number of requests flagged as containing multiple intentions by at least three reviewers.
(a) Alignment between group 1 reviewers.
(b) Alignment between group 2 reviewers.
Figure 3: Alignment scores between each reviewer and the other three members of their group, averaged across the four datasets.

How reviewers compare to the rest in their selections is another aspect to be considered. Figure 3 (a) compares how each reviewer agreed with the other 3 in the first group. We can see that, overall, the mean is very close. However, reviewer 7, in particular, had more variation in his or her selections. Similarly, Figure 3 (b) compares how each reviewer agreed with the other 3 in the second group. In the second group, we see slightly more disagreement, particularly with reviewer 6. This could be because he did not interpret the user intention the same as others or because the reviewer was more generous or conservative in selections compared to the others in the group.

3.2 Merging Selections By Agreement

Figure 4: Mean number of words highlighted per request by dataset. Agreement is the number of reviewers who marked the same word for removal, where 0 is the original request length.

The four annotations per request were merged using the following strategy: for every character position in the request, if at least a threshold of two annotations contained that position, highlight it. To quantify the average reduction of request size, we count the number of words highlighted for each level of reviewer agreement. In Figure 4, we can see that as the agreement required increases, the size of the highlight decreases significantly.

4 Annotating Relational Content

To determine the use of relational strategies, a second round of manual analysis was performed. An increase in agreement corresponds to a significant removal of remaining annotations as can be seen in Figure 4. Therefore, in order to have sufficient data for analysis given the sample size, an agreement of two is used. A comparison of relational annotation using all agreement levels is left for future works.

Once merged, highlighted sections were analyzed by the authors to determine the classes of language present. Each such section was evaluated and given one or more of the following tags: Greeting, Backstory, Justification, Gratitude, Rant, Express Emotion, Other. See Figure 7 for an overview of the entire process.

Greetings are a common relational strategy humans use to build rapport with other humans and machines (Lee et al. , 2010).

Backstory is a method of self-exposure that may be employed by the customer. In Example 1 given in section 1, the customer included the fact that he or she is attending a graduation as a means of self-exposure. This may be an attempt to build common ground with the agent or it may indicate the importance of the trip and motivate the agent to help the customer succeed.

Justification is used by the customer to argue why the agent should take some action on the part of the customer. For instance, when trying to replace a defective product, a customer may explain how the product failed to establish credibility that the product was at fault.

Gratitude, like greetings, are used by humans to also build rapport with humans and machines (Lee et al. , 2010).

Ranting is a means of expressing dissatisfaction when a customer feels frustrated, ignored, or misunderstood. In computer-mediated conversations, the non-verbal emotional cues present in face-to-face conversations are missing; thus, humans resort to such negative strategies to convey their emotions (Laflen & Fiorenza, 2012). For tagging purposes, we define a Rant to encompass any excessive complaining or negative narrative.

Expressing emotions can be a means of showing displeasure when a customer feels a conversation is not making adequate progress or in reaction to an unexpected or disagreeable agent response. This can also indicate joking or other positive emotional expression. The tag Express Emotion is used as a catch-all for any emotional statement that is not covered by Rant. Examples would be: “i love that!”, “UGH!”, “WHY???”.

The Other tag indicates that some or all of the section does not contain any relational language. This is commonly a restatement of the primary intent or facts that reviewers marked as unnecessary.

4.1 Analysis of Relational Tags

Figure 5: Incidence of relational language per dataset. An incidence of 0.5 means the tag is present in 50% of all requests.

As shown in Figure 5, we see that backstory is more common in human-to-human forum posts. However, both Airline and Telecom IVAs also have a significant amount of backstory. Although minimal, ranting and justification were present in Telecom. The Train dataset appeared to contain the least amount of relational language. It is difficult to speculate why without deeper analysis of the user demographic, the presentation of the IVA on the website, and the IVA knowledge base.

Figure 6: Pearson coefficients of tag correlation across datasets.

The correlation between tags is shown in Figure 6. When greetings are present, it appears that there is a likelihood there will also be gratitude expressed which agrees with the findings in Lee et al.  (2010) and Makatchev et al.  (2009). Also interesting is the apparent correlation between backstory and gratitude. Those that give background on themselves and their situations appear more likely to thank the listener. Ranting appears to be slightly negatively correlated with greetings, which is understandable assuming frustrated individuals are not as interested in building rapport as they are venting their frustrations.

Figure 7: An overview of the review and merging process. In this example from the TripAdvisor corpus, reviewers 2, 3, and 4 all agree on which text is unnecessary. Selections are merged to form highlighted text that is then removed from the original text to create a cleaned request. A second round of annotation was done on highlighted texts to determine the classes of language present. The colors of the text correspond to the class present.

5 Experiments and Results

To measure the effect on IVA performance and determine what level of reviewer agreement is acceptable, we first constructed highlights for the 6,759 requests using all four levels of reviewer agreement. Next, four cleaned requests were generated from each original request by removing the highlighted portion for each level of agreement resulting in 27,036 requests with various amounts of relational language removed.

Every unaltered request was fed through its originating IVA, and the intent confidence score and response was recorded. We then fed each of the four cleaned requests to the IVA and recorded the confidence and response. The TripAdvisor data was fed through the Airline IVA as it provided the most similar domain. This was also a test to see if lengthy human-to-human forum posts could be condensed and fed into an existing IVA to generate acceptable responses. We observed an increase in confidence in all domains with an average of 4.1%. The Telecom set, which had the highest incidence of backstory outside of TripAdvisor, gained 5.8%.

In addition to intent confidence, we measured the effect of relational language removal on overall IVA understanding. An A-B test was conducted where four reviewers were shown the user’s original request along with the IVA response from the original request and the IVA response from a cleaned request. They were asked to determine which, if any, response they believed better addressed the request. If the original IVA response was preferred, it was assigned the value -1. If the response to the cleaned request was preferred, it was assigned the value 1. Finally, if neither response even remotely addressed the user’s request or if both responses were comparable, it was given the value 0.

Figure 8: Results of the A-B test on IVA response to original request versus cleaned request. Black bars indicate 95% confidence intervals.

This A-B test was done only on responses that changed as a result of the cleaned request (3,588 IVA responses changed out of the 27,036 total responses). The result of this analysis is shown in Figure 8. Note that the lower bound is -1, indicating the original IVA response is preferred. If language is removed, the IVA response to the cleaned request is more likely preferred as made evident by the significantly positive skew. 95% confidence intervals are included, and although they may seem large, this is expected; recall that a 0 was assigned if both IVA responses address the user request comparably or neither did. In 10 of the 16 cases, the skew is towards the cleaned response within the 95% confidence interval.

This is evidence that the current usage of unnecessary language has a measurable negative effect on live commercial IVAs. TripAdvisor is an interesting exception, especially when the threshold is 4. However, this can be somewhat expected as it is a human-to-human forum where user inputs are significantly longer, and primary intent can be difficult to identify even for a human.

Although, in general, the removal of language is preferred, how much removal? This is another question addressed in Figure 8. The higher the threshold, the more reviewers need to agree on the removal of the same segment of text. Thus, although language may still be removed, less language is removed with a high threshold than if the threshold was lower due to low kappa (see 3.1). In effect, the higher thresholds may remove less unneeded language but the language that is removed is more likely to be actually unnecessary which appears to improve the IVA understanding. However, using a threshold of 4 seems to have limited improvement over 3 due to the reviewer disagreement.

6 Conclusion

Through the collection of this corpus and the annotation of relational segments, we have shown that users of commercial IVAs are already applying relational strategies to these IVAs. It is our prediction that these strategies will only increase as IVAs become more ubiquitous and human-like. We have also shown that the removal of unnecessary language during intent determination not only increases intent classifier confidence but also improves response by reviewer standards. It is our hope that by providing this data to the community, others can work on the automation of both the separation of business content from relational content and the classification of relational strategies.

The fact that it is possible to improve IVA responses to noisy forum data by the removal of relational language gives hope that automated methods of relational language detection may allow IVAs to contribute to human-to-human forum settings without substantial modifications to their underlying language models. For example, an IVA could be employed on a commercial airline website while also monitoring and contributing to airline forum threads related to its company. This saves substantial effort and cost compared to deploying two special-purpose IVAs for each task.

Given the problematic presence of relational language in task-based inputs and our promising preliminary results, we encourage the research community to investigate ways on automating this annotation using our publicly available data111 There are many applications for such automation. Determining if turns contain ranting in automatic quality assurance monitoring systems like the one presented by Roy et al.  (2016) could help surface poor customer service more efficiently. In systems for automating IVA refinement such as the one described by Beaver & Freeman (2016), automatic detection of the presence of backstory or justification can be used as an indicator of possible missed intention. In live IVAs, simplifying inputs before determining user intention as in our experiment can increase intent classification accuracy. Finally, IVAs can appear more human-like by classifying relational language to explicitly deal with relational content and respond in a relational manner. Such applications would greatly improve the quality and relational abilities of customer service IVAs.

Appendix A Annotation tools and process

Figure 9: Screenshot from the web-based annotation tool used by the eight reviewers for the first round of annotations.

The reviewers used a special purpose web application for the annotation tasks. A screenshot of the interface used for the first task is shown in Figure 9. Each reviewer was required to complete a tutorial on tool usage and example selection tasks before they were allowed to start on the actual datasets. A comprehensive explanation of the task was given to all reviewers before they began, and the authors were available by email to address questions or comments during the process. As all reviewers were Next IT Corporation employees and knew the authors, there was ongoing communication through the task to ensure that annotations were correctly applied. In addition, at any time they could click on the Help link at the top right to see the instructions given in Appendix B.

For the second round of relational tagging done by the authors, each highlighted section created by merging selections with an agreement of at least two reviewers was displayed. An author then selected all relational tags that appeared within that highlighted section. A screenshot of this interface is shown in in Figure 10.

Figure 10: Screenshot from the web-based annotation tool used by the authors for the tagging of relational segments.

Appendix B Instructions for Reviewers


First, determine the purpose of the user query or statement. Then, using the mouse, select all subsections of the text that do not contribute to determining the purpose. Only select text that is clearly unneeded. If more than one selection is required click the Save button between each selection. If only one selection is required click the Next button to save the current selection and load the next text. If no action is necessary click Next to load the next text. If you need to return to the previous text click Previous (Note: this does not work if the last input was marked as Multiple Intent). To remove a selection or edit a selection click Undo and reselect the intended text.

It is possible many texts will not need any markup. Do not worry about including spaces on either side of your selection as they will be ignored. Greetings and expressions of gratitude should always be marked as they are unnecessary to determining the intention, unless the entire text consists only of a greeting or thanks in which case they are the intention. If a text does not appear to have any clear intention or point no markup beyond greetings or emotion is needed. In the following examples the gray text indicates selections that are unnecessary for determining the user’s intent.


  • “I will be traveling to see my husband before he leaves on a deployment with my child under the age of 2 in March and I am looking for the cheapest price with her having a seat. Can you help.

  • Hi, how can I change controls to allow R rated. I have no kids so I don’t know why I don’t have permission for this.

  • I did not get a reservation number. What number may I call to get my reservation number?”

Multiple Intentions

If an input clearly contains two or more user intentions, click the Report Multiple button to flag it for removal and load the next text.


  • “Hi, we are traveling with our Grandson and need to know what kind of identification we will need for him? Also, when we arrive in Penn station, will we be able to stop our one bag in the baggage area?”

  • “hi I have a ticket ###### when must I book the ticket and when must I finish my trip. How much is the ticket worth? How much is the change fee?”

Personal Identifiable Information (PII)

If a text contains any information that could be considered personally identifiable it needs to be flagged for cleanup. PII includes usernames and proper names, credit card numbers, ticket numbers, confirmation numbers, phone numbers, account numbers, email addresses, zipcodes or mailing addresses, and any other information that could be used to identify an individual.

If you see anything you suspect of being sensitive click on the Report PII button and a red exclamation mark will appear next to the input. It is better to report something you are unsure of than not report data that is actually sensitive. However, there is no need to report data that is already sanitized such as “My account number is xxx-xxxx”. After reporting the text, mark it up as usual and continue to the next user input. All inputs reported with PII will be later reviewed and cleaned manually.

The authors would like to thank Next IT Corporation for sponsoring this work and sharing their data with the research community to further advance the human-like properties of IVA’s.


  • Bakeman et al.  (1997) Bakeman, Roger, McArthur, Duncan, Quera, Vincenç, & Robinson, Byron F. 1997. Detecting sequential patterns and determining their reliability with fallible observers. Psychological Methods, 2(4), 357.
  • Ballantyne (2004) Ballantyne, David. 2004. Dialogue and its role in the development of relationship specific knowledge. Journal of Business & Industrial Marketing, 19(2), 114–123.
  • Beaver & Freeman (2016) Beaver, Ian, & Freeman, Cynthia. 2016. Prioritization of Risky Chats for Intent Classifier Improvement. Pages 167–172 of: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference. AAAI.
  • (2017) 2017. Bentley Graduate Students Reveal How Users Respond To ChatBots Compared to Human Agents. Available online at
  • Bickmore & Cassell (2001) Bickmore, Timothy, & Cassell, Justine. 2001. Relational agents: a model and implementation of building user trust. Pages 396–403 of: Proceedings of the SIGCHI conference on Human factors in computing systems. ACM.
  • Caramenico (2013) Caramenico, Alicia. 2013. Aetna improves virtual customer experience. Fierce Healthcare. Available online at
  • Chai et al.  (2006) Chai, Joyce Y, Zhang, Chen, & Baldwin, Tyler. 2006. Towards conversational QA: automatic identification of problematic situations and user intent. Pages 57–64 of: Proceedings of the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics.
  • Coppola et al.  (2004) Coppola, Nancy W, Hiltz, Starr Roxanne, & Rotter, Naomi G. 2004. Building trust in virtual teams. IEEE transactions on professional communication, 47(2), 95–104.
  • de Visser et al.  (2016) de Visser, Ewart J, Monfort, Samuel S, McKendrick, Ryan, Smith, Melissa AB, McKnight, Patrick E, Krueger, Frank, & Parasuraman, Raja. 2016. Almost human: Anthropomorphism increases trust resilience in cognitive agents. Journal of Experimental Psychology: Applied, 22(3), 331.
  • Feinstein & Cicchetti (1990) Feinstein, Alvan R, & Cicchetti, Domenic V. 1990. High agreement but low kappa: I. The problems of two paradoxes. Journal of clinical epidemiology, 43(6), 543–549.
  • Frearson (2015) Frearson, Joanne. 2015. From New York: How intelligent assistants are cutting customer waiting times. Business Reporter. Available online at
  • Gibson & Cohen (2003) Gibson, Cristina B, & Cohen, Susan G. 2003. Virtual teams that work. JosseyBass, San Francisco.
  • Guggenmoos-Holzmann (1993) Guggenmoos-Holzmann, Irene. 1993. HOW reliable are change-corrected measures of agreement? Statistics in Medicine, 12(23), 2191–2205.
  • Holton (2001) Holton, Judith A. 2001. Building trust and collaboration in a virtual team. Team performance management: an international journal, 7(3/4), 36–47.
  • Kim et al.  (2016) Kim, Byeongchang, Ryu, Seonghan, & Lee, Gary Geunbae. 2016. Two-stage multi-intent detection for spoken language understanding. Multimedia Tools and Applications, 1–14.
  • Laflen & Fiorenza (2012) Laflen, Angela, & Fiorenza, Brittany. 2012. “Okay, My Rant is Over”: The Language of Emotion in Computer-Mediated Communication. Computers and Composition, 29(4), 296–308.
  • Lee et al.  (2010) Lee, Min Kyung, Kiesler, Sara, & Forlizzi, Jodi. 2010. Receptionist or information kiosk: How do people talk with a robot? Pages 31–40 of: Proceedings of the 2010 ACM conference on Computer supported cooperative work. ACM.
  • Levy (2016) Levy, Stephen. 2016. Alexa, Tell Me Where You’re Going Next. Available online at
  • Lowe et al.  (2017) Lowe, Ryan Thomas, Pow, Nissan, Serban, Iulian Vlad, Charlin, Laurent, Liu, Chia-Wei, & Pineau, Joelle. 2017. Training end-to-end dialogue systems with the ubuntu dialogue corpus. Dialogue & Discourse, 8(1), 31–65.
  • Madhavan et al.  (2006) Madhavan, Poornima, Wiegmann, Douglas A, & Lacson, Frank C. 2006. Automation failures on tasks easily performed by operators undermine trust in automated aids. Human Factors: The Journal of the Human Factors and Ergonomics Society, 48(2), 241–256.
  • Makatchev et al.  (2009) Makatchev, Maxim, Lee, Min Kyung, & Simmons, Reid. 2009. Relating initial turns of human-robot dialogues to discourse. Pages 321–322 of: Proceedings of the 4th ACM/IEEE international conference on Human robot interaction. ACM.
  • Marois (2013) Marois, Erica. 2013. Using Intelligent Virtual Agents to Improve the Customer Experience: Brains Before Beauty. ICMI Blog. Available online at
  • Roy et al.  (2016) Roy, Shourya, Mariappan, Ragunathan, Dandapat, Sandipan, Srivastava, Saurabh, Galhotra, Sainyam, & Peddamuthu, Balaji. 2016. QART: A System for Real-Time Holistic Quality Assurance for Contact Center Dialogues. Pages 3768–3775 of: AAAI.
  • Sarikaya et al.  (2016) Sarikaya, Ruhi, Crook, Paul, Marin, Alex, Jeong, Minwoo, Robichaud, Jean-Philippe, Celikyilmaz, Asli, Kim, Young-Bum, Rochette, Alexandre, Khan, Omar Zia, Liu, Xiuahu, et al. . 2016. An overview of end-to-end language understanding and dialog management for personal digital assistants. In: IEEE Workshop on Spoken Language Technology.
  • Serban et al.  (2015) Serban, Iulian Vlad, Lowe, Ryan, Henderson, Peter, Charlin, Laurent, & Pineau, Joelle. 2015. A Survey of Available Corpora for Building Data-Driven Dialogue Systems. CoRR, abs/1512.05742.
  • Vinyals & Le (2015) Vinyals, Oriol, & Le, Quoc. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
  • Xu & Sarikaya (2013) Xu, Puyang, & Sarikaya, Ruhi. 2013. Exploiting shared information for multi-intent natural language sentence classification. Pages 3785–3789 of: INTERSPEECH.
  • Zechner & Waibel (2000) Zechner, Klaus, & Waibel, Alex. 2000. Minimizing word error rate in textual summaries of spoken language. Pages 186–193 of: Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference. Association for Computational Linguistics.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description