Different Types of Voice User Interface Failures May Cause Different Degrees of Frustration
We report on an investigation into how different types of failures in a voice user interface (VUI) affects user frustration. To this end, we conducted a pilot user study () and a main user study (), both with a simple voice-operated calendar application that we built using the Alexa Skills Kit. In our pilot study, we identified three major failure types as perceived by the users, namely, Reason Unknown, Speech Misrecognition, and Utterance Pattern Match Failure, along with more fine-grained failure types from the developer’s viewpoint such as Intent Pattern Match Failure and Intent Misclassification. Then, in our main study, we set up three user tasks that were designed to each induce a specific failure type, and collected user frustration ratings for each task. Our main findings are: (a) Users may be relatively tolerant to user-perceived Speech Misrecognition, and not so to user-perceived Reason Unknown and Utterance Mattern Match Failures; (b) Regarding the relationship between developer-perceived and user-perceived failure types, 68.8% of developer-perceived Intent Misclassification instances caused user-perceived Reason Unkown failures. From (a) and (b), a practical design implication would be to try to prevent Intent Misclassification from happening by carefully crafting the utterance patterns for each intent.
Voice User Interfaces (VUIs) for dialogue systems have started to penetrate into our daily lives, in the form of smart speakers such as Amazon Alexa and smart phones features such as Siri. While many consumers regard these services as Artificial Intelligence and may have various expectations for them, current VUI interactions often fail for various trivial reasons. Moreover, the VUIs are generally not good at communicating the reasons of failures to the user: when they fail to properly process the user’s utterance for whatever reason, they often just say “I’m sorry, I don’t understand.” or something similarly uninformative. Hence, how the users perceive VUI failures may be different from the actual failures as categorised from the developer’s point of view. We argue that it is important to understand the different failure types from both developer and user perspectives, how the two perspectives align, and how each failure type affects user frustration, so that VUI and dialogue system developers can try to improve on the failure types that matter most. In the present study, we primarily focus on the users’ perception of VUI failures, and investigate how the different failure types affect user frustration.
The present study consists of a pilot user study  () and a main study ();
both leveraged a simple voice-operated calendar application that we built for the purpose of this study
uusing the Alexa Skills Kit
Intent Pattern Match Failure;
Slot Value Extraction Failure;
System Not In Listen Mode;
Utterance Not Directed To System;
Partial Speech Misrecognition;
Complete Speech Misrecognition.
In contrast, the user-perceived failure types that we identified through interviews are naturally more coarse-grained:
Utterance Pattern Match Failure.
More explanations of these failure types will be given in Section 4.
The objective of our main study is to investigate how the different failure types affect user frustration. While we cannot directly control how our participants will perceive failures, we gave them three simple calendar manipulation tasks (voice-operated in Japanese) that were designed to each induce a specific developer-perceived failure type:
Create a new event using a series of voice commands, by uttering the date and time (December 25, 18:30-20:00), name of the event (Drinking party with part-time workers), and the venue (Ginza Station) separately (not designed specifically to induce any failures);
Create a new event by specifying all required information in one utterance (date and time: December 31, 9:00-11:00, name: Study session, venue: Takadanobaba Station) (designed to induce D1: Intent Pattern Match Failure, since the utterance pattern for this intent requires the user to specify all of these slot values in one go in a specific syntax);
Modify the name of the event created in T1 from Drinking party to Christmas (designed to induce D4: Intent Misclassification: we discovered in our pilot study that Alexa automatically converts Christmas into December 25, which causes our VUI to misclassify the “Modify Event Name” intent).
As we shall explain later in Section 5, T1 induced many U2 instances, T2 induced many U3 instances, and T3 induced many U1 instances, and hence we managed to emphasise different user-perceived failures with the three different tasks. Our main findings are:
Users may be relatively tolerant to what they perceive as speech recognition errors (U2), and not so when they do not understand the failures (U1), or when they feel that their wordings were not understood by the VUI (U3).
Regarding the relationship between developer-perceived and user-perceived failure types, 68.8% of D4 (Intent Misclassification) instances caused U1 (Reason Unkown).
This paper concludes with a design implication based on (a) and (b).
2 Related Work
There is a body of research that tries to gain insight into problems with VUIs and conversational agents (CAs) by observing and/or interviewing users who use such systems regularly. For example, Luger and Sellen  interviewed regular users of conversational agents such as Siri and concluded: users had poor mental models of how their CA worked and that these were reinforced through a lack of meaningful feedback regarding system capability and intelligence. Porcheron et al.  report on an analysis of audio data from month-long deployments of Amazon Echo and stress the importance of system response design as the design of interactional resources for users. Pyae and Joelsson  conducted a web-based survey to which Google Home users responded; the study lists up some problems that the users encountered (according to the users’ viewpoints), such as “Non-English words are not correctly captured by the device,” “Commands have to be repeated to accomplish a task,” and “Multiple commands in a single transaction cannot be captured.” Sciuto et al.  reported on a study of Amazon Alexa users which involved both log analysis and in-home interviews.
In contrast to the above line of research that involves real users of commercial VUIs, our study involves user studies in a controlled environment with a very simple VUI application; the two approaches are clearly complementary. Below, we also discuss some existing studies based on controlled studies.
The present study offers VUI failure types from both the developer’s and the user’s perspectives, as well as an analysis of how the two are related. In previous work, there have also been a few studies that listed up different failure types based on controlled studies. For example, in the context of conversational search, Jiang, Jeng, and He  identified the following voice input error types: Speech Recognition Error, System Interruption (the participant was interrupted before her voice command was complete), and Query Suggestion (Google voice search generated a query not uttered by the user). The work of Myers et al. , which was the direct inspiration of the present study, identified the following four obstacles in VUIs through an experiment () with their calendar application called DiscoverCal: Unfamilar Intent (the VUI cannot parse the utterance for an existing intent, or the participant tries to execute an intent not supported by the VUI), NLP error (the VUI maps the user utterance to an incorrect intent), Failed Feedback (for example, the VUI did not make it clear to the user that the date and time must be uttered in one go to make an entry into a calendar), and System Error (e.g. bugs). We note that the above taxonomy is based on the developer’s point of view, i.e., what is really happening in the system internally.
In a follow-up study with participants, Myers et al.  investigated the impact of user characteristics on VUI task performance; they concluded that while programming experience did not have a wide-spread impact on their performance metrics (e.g., total time spent on the tasks, number of times the user had to repeat an intent, etc.) assimilation bias (i.e., prejudice due to prior VUI experience) did. They remark that, while Luger and Sellen  reported that participants with more technical knowledge were self-reported as being more patient and willing to utter more to accomplish a VUI task, their own results based on the total number of words uttered indicated otherwise. Furthermore, based on the above studies with DiscoverCal, Myers  proposes that the system’s guidance to users should differ according to the user’s proficiency.
While we argue that our analysis of VUI failures from both developer and user perspectives is a strength compared to prior art, we acknowledge that our participants are not representative of the general consumers: most of them are Computer Science (CS) students. This limitation is also discussed in Section 6. To the best of our knowledge, however, the present study is the first to show that different VUI failure types affect user frustration differently.
3 Calendar Application
Inspired by the work of Myers et al. , we built a voice-operated calendar application using Alexa Skills Kit in order to conduct our pilot and main user studies. Our application consists of a VUI and a calendar GUI for visual feedback, and can let the user create a new event on the calendar, delete it, or modify it. Internally, the system maps a user’s utterance to intents (e.g., create an event, delete an event) based on rule-based utterance patterns, and then fills in slots required in that intent wherever necessary (e.g., name of the event, date and time of the event). Whenever the system fails to process the user request, it returns the generic message: “Sorry, I cannot understand your request” (in Japanese).
4 Pilot Study
This section briefly describes our pilot study with participants . Its main objective was to identify different failure types from both developer and user perspectives. The user study design is similar to our main study, so here we shall focus on the parts specific to the pilot study.
All of our pilot study participants were Japanese male students from the CS department of Waseda university; 7 of them owned a smart speaker. Each participant was instructed to conduct four tasks: the first three were similar to T1-T3 for the main task; the fourth task was to delete an existing event. (We obtained very few VUI failures from the delete task in the pilot study; hence it was dropped for the main study.) The instructions given to the participants were similar to those for the main study: see T1-T3 described earlier. All participants were asked to continue to try the task at least twice when they encountered a failure during eack task.
After completing the tasks, the first author interviewed each participant in a face-to-face session; the interviews were recorded on a smartphone and later manually transcribed for analysis. On average, participants spent 11.0 minutes to complete the tasks, and 9.1 minutes for the interview. In the interviews, participants were asked how they felt and what they thought about the failures they encountered during each task. By manually analysing the actual dialogues and the interviews, we identified 7 failure types from the developer perspective, and 3 from the user perspective, as we have described in Section 1.
Here, we briefly explain each failure type that we have identified. As was mentioned in Section 3, our VUI application was developed by setting up several intents, where each intent is associated with several predefined utterance patterns. D1 means that the user’s utterance did not match any of the utterance patterns of any intent; D2 means that the mapping to an intent was successful, but that at least one required slot value could not be extracted from the user utterance; D3 means that the user uttered a command when the VUI was not listening; D4 means that the user’s utterance was mapped to an incorrect intent; D5 means that the user’s utterance was not meant for the VUI (i.e., the user was talking to herself); D6 means that the user’s utterance was only partially successfully recognised (e.g., when the user says “from 9 to 11 o’clock” and the system only recognises “11 o’clock”); D7 means complete speech recognition error. On the other hand, the user’s diagnosis of failures is naturally less specific. U1 means that the user has no idea why the VUI fails to respond properly to the user’s command; U2 means that the user suspects that speech recognition is the problem (and therefore a possible action she might take next would be to repeat her previous utterance more slowly and clearly). In contrast, when the user detects a U3, this means that she assumes that the VUI cannot accept the particular syntax of the uttered sentence (and therefore possibly try to rephrase the same request).
5 Main Study
Having identified the developer-perceived and user-perceived VUI failure types, we proceeded with our main study (), where the objective was to investigate how different failure types affect user frustration. All of our participants were Japanese in their 20’s and had a science background, with 23 with a CS background; 20 were male and 10 female; 27 were students at Waseda University and the other three were recent graduates from the same university. Regarding experience with smart speakers, 8 were regular users, 20 used them a few times before, and 2 had no experience. Each participant was asked to conduct tasks T1-T3 discussed in Section 1, and then was interviewed by the first author after completing all three tasks to see what types of failures were perceived during each task session. On average, participants spent 9.1 minutes to complete the tasks, and 10.3 minutes for the interview. Thus, for each participant-task pair, we analysed the VUI dialogues and the post-hoc interviews to manually identify D1-D7 as well as U1-U3. Moreover, each participant was asked to rate her overall frustration for each of the three tasks on a Likert scale (1-5).
As was mentioned in Section 1,
it turned out that
T1 induced many U2 (Speech Misrecognition) instances,
T2 induced many U3 (Utterance Pattern Match Failure) instances,
T3 induced many U1 (Reason Unknown) instances.
Hence, even though we have the user frustration ratings at the task level
and not for each failure within the task
Table 3 shows the mean frustration (MF) scores over the participants for each task. Table 3 shows the results of the paired Tukey HSD (Honesty Significant Difference) test : it can be observed that the MF for T2 and that for T3 are statistically significantly higher than that for T1, while the difference between T3 and T2 is not statistically significant.
Table 3 shows the distribution of user-perceived failures within each task of our main study. The dominant user-perceived failure type for each task is shown in bold. Since participants were more frustrated with T3 than with T1, and U1 constitutes % of the user-perceived failures in T3, the results suggest that U1 (Reason Unknown) may have a strong negative impact on user frustration. Similarly, since participants were more frustrated with T2 than with T1, and U3 constitutes % of the user-perceived failures in T2, the results suggest that U3 (Utterance Pattern Match Failure) may also have a negative impact on user frustration. In contrast, it appears that the participants were relatively tolerant to what they perceive as speech recognition errors (U2), which constitutes % of the user-perceived failures in T1.
|Task||Mean Frustration ()|
From the rightmost column of Table 3, it can be observed that, of the 481 user-perceived failures across all tasks, U1 constituted 42.0%, U3 constituted 31.0%, and U2 constituted 27.0%. In fact, from the developer’s viewpoint, there were more failures during the user experiments: from the dialogues, we identified 586 developer-perceived failures, which means that % of the actual failures were not detected by the participants (or at least, the participants did not mention them in their interviews). Among the 586 developer-perceived failures, the dominant types were D4 (39.9%), D1 (29.2%), D2 (13.1%), D3 (10.6%), and D7 (5.6%).
Figure 1 visualises how the developer-perceived failures were perceived by the participants; note that only the aforementioned 481 failure instances that were detected by the participants are shown here. The most remarkable feature of this figure is that D4 (Intent Misclassification) often translates to U1 (Reason Unknown); more specifically, of the 189 D4 instances observed across the tasks, as many as 130 (68.8%) were perceived by the participants as U1. Since we have observed that U1 may have a strong negative impact on user frustration, one practical approach to reducing user frustration in VUI applications would be to try to minimise D4 incidents. This can probably be achieved to some extent by carefully crafting the utterance patterns for each intent. However, note that the present study does not show that Intent Misclassification is the main cause of VUI failures: recall that we intentionally induced Intent Misclassifications just to verify that different failure types affect user frustration differently. All we claim is that Intent Misclassification is something that VUI application developers should try to avoid.
The second strongest signal from Figure 1 is that D1 (Intent Pattern Match Failure) is often perceived by the participant as U3 (Utterance Pattern Match Failure): of the 150 D1 instances observed, 72 (48.0%) were perceived by the participants as U3. That is, the diagnosis by the participants were correct in these cases, although they may not necessarily be aware of the fact that our VUI application is composed of a set of intents where each intent is associated with a set of utterance patterns.
6 Conclusions and Future Work
Through our pilot study with a simple voice-operated calendar application, we identified both developer-perceived and user-perceived failure types; then, in our main study, we collected participants’ frustration scores for each of our three tasks (T1-T3) that were designed to induce specific failure types. Our main findings are:
The mean frustration score of T1 (in which U2: Speech Misrecognition constituted 46.2%) was statistically significantly lower than that of T2 (in which U3: Utterance Pattern Match Failure constituted 56.7%) and that of T3 (in which U1: Reason Unknown constituted 68.3%). Hence, users may be relatively tolerant to what they perceive as speech recognition errors (U2), and not so when they do not understand the failures (U1), or when they feel that their wordings were not understood by the VUI (U3).
Regarding the relationship between developer-perceived and user-perceived failure types, 68.8% of D4 (Intent Misclassification) instances caused U1 (Reason Unkown).
One design implication based on the above two findings would be: try to avoid Intent Misclassification, since this very likely leads to failures for reason unknown from the user’s point of view, which in turn are likely to cause user frustration. For current rule-based VUI applications, Intent Misclassifications can be suppressed to some extent by carefully crafting the utterance patterns for each intent. However, as we have pointed out earlier, our study does not show that Intent Misclassification is the main cause of user frustration.
The participants we hired were mostly CS students in their 20’s, and therefore arguably closer to developers than to general consumers who have no knowledge of how VUI applications are implemented. Hence our failure type taxonomies may not apply to them: as an extreme situation, for some consumers, all failures may be of the Reason Unknown type. Hence, as future work, we would like to conduct a follow-up experiment that covers a wider variety of user backgrounds. Moreover, while the present study collected user frustration scores at the task level, we would like to explore nonintrusive ways to collect frustration signals for each user-perceived failure that occurs during each task, so that we can directly discuss the relationship between user-perceived failure types and user frustration. Furthermore, we would like to establish a diagram similar to Figure 1 based on real VUI failure distributions as opposed to our induced failures: this should be useful for designing VUI applications that provide informative failure responses for avoiding or recovering from dialogue breakdowns.
mæ å ± å¤ªé1970å¹´çï¼1992å¹´æ å ±å¦çå¤§å¦çå¦é¨æ å ±ç§å¦ç§åï¼ 1994å¹´åå¤§å¤§å¦é¢ä¿®å£«èª²ç¨äºï¼åå¹´æ å ±å¦çå¦ä¼å ¥ç¤¾ï¼ãªã³ã©ã¤ã³åºçã®ç ç©¶ ã«å¾äºï¼é»åæ å ±éä¿¡å¦ä¼ï¼IEEEï¼ACM åä¼å¡ \profilenå¦ç è±å1960å¹´çï¼1982å¹´æ å ±å¦çå¤§å¦çå¦é¨æ å ±ç§å¦ç§åï¼ 1984å¹´åå¤§å¤§å¦é¢ä¿®å£«èª²ç¨äºï¼1987å¹´ååå£«èª²ç¨äºï¼çå¦åå£«ï¼1987å¹´æ å ±å¦ çå¤§å¦å©æï¼1992å¹´æ¶ç©ºå¤§å¦å©ææï¼1997å¹´åå¤§ææï¼ãªã³ã©ã¤ã³åºçã®ç ç©¶ ã«å¾äºï¼2010å¹´æ å ±å¦çè¨å¿µè³åè³ï¼é»åæ å ±éä¿¡å¦ä¼ï¼IEEEï¼IEEE-CSï¼ACM åä¼å¡ \profileså¦ä¼ æ¬¡é1950å¹´çï¼1974å¹´æ¶ç©ºå¤§å¦å¤§å¦é¢ä¿®å£«èª²ç¨äºï¼ 1987å¹´ååå£«èª²ç¨äºï¼å·¥å¦åå£«ï¼1977å¹´æ¶ç©ºå¤§å¦å©æï¼1992å¹´æ å ±å¦çå¤§å¦å© ææï¼1987å¹´åå¤§ææï¼2000å¹´ããæ å ±å¦çå¦ä¼é¡§åï¼ãªã³ã©ã¤ã³åºçã®ç ç©¶ ã«å¾äºï¼2010å¹´æ å ±å¦çè¨å¿µè³åè³ï¼æ å ±å¦çå¦ä¼çäºï¼é»åæ å ±éä¿¡å¦ä¼ï¼ IEEEï¼IEEE-CSï¼ACM åä¼å¡
- User-perceived failures were identified by a post-hoc analysis of the interviews; therefore, it was not possible for the participants to provide a frustration rating for each failure during the interviews.
- (2019-10) Voice input interface failures and frustration: developer and user perspectives. In The Adjunct Publication of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 24–26. External Links: Cited by: §1, §4.
- (2013-07) How do users respond to voice input errors?: lexical and phonetic query reformulation in voice search. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pp. 143–152. External Links: Cited by: §2.
- (2016-05) “Like having a really bad pa”: the gulf between user expectation and experience of conversational agents. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 5286–5297. External Links: Cited by: §2, §2.
- (2019) The impact of user characteristics and preferences on performance with an unfamiliar voice user interface. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, External Links: Cited by: §2.
- (2018-04) Patterns for how users overcome obstacles in voice user interfaces. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–7. External Links: Cited by: §2, §3.
- (2019-03) Adaptive suggestions to increase learnability for voice user interfaces. In Proceedings of the 24th International Conference on Intelligent User Interfaces: Companion, pp. 159–160. External Links: Cited by: §2.
- (2018-04) Voice interfaces in everyday life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12. External Links: Cited by: §2.
- (2018-09) Investigating the usability and user experiences of voice user interface: a case of google home smart speaker. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, pp. 127–131. External Links: Cited by: §2.
- (2018) Laboratory experiments in information retrieval: sample sizes, effect sizes, and statistical power. In Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power, External Links: Cited by: §5.
- (2018-06) ”Hey alexa, what’s up?”: a mixed-methods studies of in-home conversational agent usage. In Proceedings of the 2018 Designing Interactive Systems Conference, pp. 857–868. External Links: Cited by: §2.