Towards Personalized Dialog Policies for Conversational Skill Discovery
Many businesses and consumers are extending the capabilities of voice-based services such as Amazon Alexa, Google Home, Microsoft Cortana, and Apple Siri to create custom voice experiences (also known as skills). As the number of these experiences increases, a key problem is the discovery of skills that can be used to address a user’s request. In this paper, we focus on conversational skill discovery and present a conversational agent which engages in a dialog with users to help them find the skills that fulfill their needs. To this end, we start with a rule-based agent and improve it by using reinforcement learning. In this way, we enable the agent to adapt to different user attributes and conversational styles as it interacts with users. We evaluate our approach in a real production setting by deploying the agent to interact with real users, and show the effectiveness of the conversational agent in helping users find the skills that serve their request.
Modern speech-based assistants, such as Amazon Alexa, Google Home, Microsoft Cortana, and Apple Siri, enable users to complete daily tasks such as shopping, setting reminders, and playing games using voice commands. Such human-like interfaces create a rich experience for users by enabling them to complete many tasks hands- and eyes-free in a conversational manner. Furthermore, these services offer tools to enable developers and customers to create custom voice experiences (skills) and as a result extend the capabilities of the assistant. Amazon’s Alexa Skills Kit Kumar2017, Google’s Actions and Microsoft’s Cortana Skills Kit are examples of such tools. As the number of skills (with potentially overlapping functionality) increases, it becomes more difficult for end users to find the skills that can address their request.
To mitigate the skill discovery problem, recently researchers have proposed solutions for personalized domain selection and continuous domain adaptation in speech-based assistants Kim2018; Han2019. Although such solutions help users find skills, in scenarios such as searching for a game where many different skills exist and user’s preferences change, routing the user to a particular experience would not be satisfactory. In such cases, the assistant should initiate a conversation with the user, making recommendations, asking for preferences, and allowing the user to browse through different options. Similar to other search problems, personalization is important for conversational skill discovery and can be achieved at two levels: 1) personalization of skill recommendations, and 2) personalization of the interaction. Users have evolving attributes (e.g., first-time vs returning user) and different conversational styles and preferences (e.g., brief vs verbose communication) which affect how they respond to what the agent is proposing and its recommendations. By personalizing the interaction according to user attributes, conversational styles and preferences, the speech-based assistant can help speed up the conversation process Mo2018 and increase user satisfaction. However, existing works are limited with respect to considering user’s evolving attributes and diverse multi-aspect preferences Rafailidis2019, such as preferences with respect to how the conversational agent interacts with them.
In this paper, we focus on conversational discovery of skills to guide customers from an intent to a specific skill or set of skills that can serve their request. To this end, we start with a rule-based agent and improve it by using reinforcement learning (RL), enabling the agent to adapt to different conversational styles as it interacts with users. In summary, the contributions of this paper are as follows: 1) We introduce the problem of conversational skill discovery for large-scale virtual assistants. 2) We describe a solution which enables the assistant to adapt to user’s attributes (e.g., first-time user vs returning user) and conversational styles (e.g., brief vs. verbose). 3) We conduct experiments in a real production setting by deploying the agent to interact with real users in large scale, showing that the personalized policy learned using RL significantly outperforms a one-fits-all rule-based agent in terms of success rate (measured in terms of number of dialogs which result in launching a skill) with significantly shorter dialogs.
2 Conversational Skill Discovery
Conversational skill discovery is the task of initiating a dialog with the user in order to help them find the skills that address their needs when interacting with a speech-based assistant. More specifically, a conversational skill discovery agent receives a natural language input from the user, understands it using its automatic speech recognition (ASR) and natural language understanding (NLU) components, and decides how to respond to the user based on user provided and contextual information in order to help the user find the needed skill. Skills can often be grouped into categories and subcategories based on functionality (e.g., ride-sharing skills or trivia games). These categories help customers explore with much more specificity and relevance, as such a key functionality of a skill discovery system is to allow users to browse through existing categories. Additionally, it is important for the agent to be able to adapt to user’s conversational styles, overtime shifting to more and more personalized conversations with the user.
Table 1 shows an example of a dialog between a user and an agent. Here, in each turn of the dialog, the user can either ask for a particular category or skill, select from the list of recommendations, accept or reject a recommendation, ask for other (sub)categories or skills, ask for details or rating of a skill, or perform some general action such as asking for help, asking the agent to repeat the previous prompt, going over a list of recommendations, going back in the conversation, or asking the agent to stop. The agent, on the other hand, can suggest a skill, provide information or help, offer a few different types of categories to choose from, stop the conversation if it is not going well, or launch a selected skill.
2.1 Problem Formulation
Conversational skill discovery, similar to other goal-oriented dialog systems, can be formalized as a Markov Decision Process (MDP) Levin2000. An MDP is a tuple , where is the state space, is the action space, is the transition probability function, is the reward function, and is the discount factor. In this framework, at each time step t, the agent observes state and selects action according to its policy (). After performing the selected action, the agent receives the next state and a scalar reward . The trajectory restarts after the agent reaches a terminal state. RL solvers have been used to find the optimal dialog policy (e.g., Singh2002 Singh2002; Williams2007 Williams2007; Georgila2011 Georgila2011; Lee2012 Lee2012). In this context, at each turn the agent acts based on its understanding of what the user said, and reward function is modeled in terms of various dimensions of the interaction such as per-interaction user satisfaction, accomplishment of the task, efficiency of interaction, and dialog duration. Recently, deep RL has also been applied to the problem of dialog management and has shown improvements over rule-based systems Cuayahuitl2016; Zhao2016; Fatemi2016; Fazel2017; Liu2018.
In this paper, we adopt the above formalism with the goal of training a dialog policy which allows the agent to take actions that maximize its success rate (measured in terms of number of dialogs which result in launching a skill) while providing a flexible and natural way for the user to navigate throughout various dialog states. In each turn of the dialog, the agent makes its decisions based on various available information such as user’s intent (e.g., asking for a particular skill), the category the user has selected, whether the user is a first-time user, etc. In order to make the agent adapt to different conversational styles, when making recommendations, we focus on 1) whether to recommend skills or categories, 2) how many skills or categories to recommend, and 3) what type of metadata to provide to the user. Examples of metadata include: popularity, star rating, number of reviews, or a short description of the skill. The agent can proactively provide metadata to the user at certain points in the experience. Depending on user’s conversational style, they may prefer brief conversations with the agent (i.e., no metadata), or verbose with different types of metadata.
An important challenge in using RL for learning dialog policies is creating realistic user simulators that can generate natural conversations similar to a human user Schatzmann2006, and as such in previous works researchers have focused on the development of different types of user simulators (e.g., Eckert1997 Eckert1997; Scheffler2002 Scheffler2002; Cuayahuitl2005 Cuayahuitl2005; Georgila2006 Georgila2006; Schatzmann2006 Schatzmann2006; ElAsri2016 ElAsri2016; XiujunLi2016 XiujunLi2016). We take a data-driven approach to user simulation, and start with a rule-based policy to gather data and then improve the agent by using RL.
2.2 Rule-based Agent
The rule-based agent selects from the following actions depending on user’s intent in each turn of the dialog: 1) offering k categories (), 2) offering n skills (), 3) offering a skill or asking for category, 4) providing information about skill rating, 5) providing details about a skill, 6) ending the conversation, and 7) launching a skill. When multiple actions are possible, the rule-based agent randomly selects among them. For example, at the beginning of the dialog, the agent randomly selects among different offer actions. If all skills in a category have been exhausted, the agent will inform the user that no additional skills are available for the selected category. Furthermore, each action is mapped to a specific prompt template. For example, offering a skill or asking for category can be mapped to "Would you like to launch skill or try a different type of skill?", where the specific skill is provided by a skill recommendation system. Additionally, in cases where the agent does not understand what the user has said (e.g., out-of-domain requests), it will first repeat the previous prompt, if user’s utterance is again misunderstood, it will give a new prompt, and finally it will stop the conversation.
2.3 User Simulation
We deployed the rule-based system to gather dialogs with users and trained a user simulator similar to Fazel2017 with dialogs with real users. Note that the collected dialogs are not annotated and may include understanding errors. Figure 1 illustrates the interaction between the user simulator (left) and the conversational agent (right). More specifically, the user simulator first generates the next user intent based on dialog context. Intent generation is modeled as a language modeling problem. In this formulation, each possible intent forms a token in the vocabulary, and every training dialog becomes a training intent sequence. For example, the sequence for the conversation in Table 1 is [Start, CategoryName, CategoryName, GetRating, Yes, End].
We used recurrent neural networks with Gated Recurrent Unit (GRU) Cho2014 to predict the next user intent, and used the following for dialog context: 1) previous user intent, 2) previous agent action, 3) previous agent prompt, 4) whether the user is a first-time user, 5) whether the user has already selected an item (skill or category) from a list, and 6) number of user turns so far in the conversation. The optimal parameters were found using Hyperopt hyperopt and the model with lowest perplexity Serban2016 score was chosen. Given the predicted intent, the user simulator uniformly samples one utterance from the combination of available templates and user turns in the collected dialogs.
2.4 RL-based Agent
The components used to learn dialog policies using RL are as follows.
State Space S: The input state is composed of 1) user’s intent 2) previous action the agent took, 3) previous prompt and metadata it gave the user, 4) the category the user has selected if any, 5) whether the agent has proposed a skill, 6) whether the user is a first-time user, and 7) number of user turns so far in the dialog. This set of parameters were selected using a forward feature selection approach based on the correlation between the new feature and the feature set with the goal of achieving a higher Expected Cumulative Reward (ECR) Shen2016. This set can be augmented with user preferences regarding skills, the last skill launched by the user, or the frequency of skill launches.
Action Space A: We constrain the action space of the agent to a set of composite actions: 1) offering k categories (e.g., offer-one-category, offer-two-category), 2) offering n skills (e.g., offer-one-skill, offer-two-skill), 3) offering a skill or asking for category (e.g., offer-one-skill-or-category), 4) executing a user request, 5) ending the conversation, and 6) launching a skill. The execute action refers to delivering information such as providing skill ratings or more details about a skill, repeating the previous prompt, or handling out-of-domain requests. At run-time, the RL policy falls back on the rule-based policy for the execute action.
Reward R: We use a simple reward function based on goal completion, where the environment gives a reward of at the end of the dialog if the user launches a skill, and gives a reward of if the user or agent end the dialog.
|State - User Intent||start, category-name, skill-name, stop, etc.|
|State - Previous Agent Action||offer-one-skill, offer-one-skill-or-category,|
|State - Previous Prompt||first-time-user-offer-three-categories, etc.|
|State - Previous Metadata||no-metadata, short-description, trending,|
|State - Target Category||adventure, adventure-kids, family, etc.|
|State - First Time User||true/false|
|State - Turn Depth|
Policy: We use DQN Mnih2013; Mnih2015 with action masking for the RL agent, with a fully-connected MLP to represent the deep Q-network. The hidden layers use a rectifier nonlinearity, and the output layer is a fully connected layer with linear activation function and a single output for each valid action. The action mask suppresses impossible actions in any particular dialog state, such as launching a skill before the user has selected one.
3 Experimental Results
We focused on the use case of a user searching for a game to play among skills belonging to game categories. Each category may also have subcategories, resulting in total categories. Example of categories are adventure, trivia, choose your own story, family, and kids. The number of categories to offer k is set to one, three, and five; and the number of skills to offer n is set to one, based on the results of internal user studies. Table 2 summarizes the state and action spaces. For all agents, we randomly sample from the set of possible prompts and metadata for the selected action. Furthermore, we used the Alexa Skill portal to train the NLU model from a set of sample utterances.
3.1 Simulation Results
We trained the DQN agent using an -greedy policy with decreasing linearly from to over steps. Additionally, we tuned a window size to include previous dialog turns as input and set to . We ran the method times for steps, and in each run, after every 10,000 steps, we sampled dialog episodes with no exploration to evaluate the performance. The optimal parameters were found using Hyperopt hyperopt (see Appendix B). Figure 2 shows the simulation results during training. The Y-axis in the figure is the success rate of the agent (measured in terms of number of dialogs that resulted in launching a skill divided by total number of dialogs), and the X-axis is the number of learning steps. Given our choice of reward function, the increase in success rate is indicative of the agent learning to improve its policy over time. Furthermore, the RL agent outperformed the rule-based agent with average success rate of ) in simulation.
3.2 Human Evaluation
To evaluate the performance of the skill discovery agent, we deployed the dialog policies and evaluated them with real users (see Appendix A for examples of dialogs). We first conducted a test with a baseline policy of recommending up to five skills based on popularity and allowing the user to either accept or reject the recommendation. The success rate of this simple policy was %, illustrating the importance of providing flexible search to the user. We then conducted an A/B test on the rule-based and RL policies to compare their effects on skill launches in a production environment. Both policies were tested on randomly sampled users, with the additional constraint of using the same policy for returning users. The results are reported in Table 3. Both policies significantly outperform the baseline policy, indicating the importance of providing flexible search and navigation to users. Additionally, the difference between the success rate of the rule-based (%) and RL (%) policies is statistically significant (-value ) and the RL policy has significantly shorter dialogs (-value ), showing the importance of optimizing for the entire interaction with the user.
In order to understand the effect of adapting to user attributes, we investigated the difference in success rate between first-time and returning users for the two policies. First-time users make up and of the population for the rule-based and RL policies, respectively. Table 4 shows the results. The RL policy significantly outperforms the rule-based policy for both first-time (-value ) and returning users (-value ), indicating that the RL model has learned and adapted to user attributes. Additionally, the RL policy has a similar performance for both groups of users. The difference for the rule-based policy between the two groups, on the other hand, is significant (-value ), indicating that this policy is more tuned to returning users. This highlights the difficulty of authoring personalized dialog policies with rules, and shows the advantage of using RL for this problem.
4 Related Work
Conversational search and recommendation, especially in the context of e-commerce, have been explored by researchers Christakopoulou2016; Radlinski2017; Aggarwal2018; Zhang2018; Sun2018. Christakopoulou2016 introduced an interactive recommendation protocol and studied whether to ask absolute or relative questions when gathering user preferences. Their dialog system collects like/dislike and pairwise comparison feedback from users, and does not include actions typically present in a dialog system Sun2018. Radlinski2017 proposed a theoretical framework for conversational search. Kenter2017 framed the problem as a machine reading task and applied it to question answering. Aggarwal2018 developed a RL-based conversational search assistant, in which state and action spaces are domain specific and may require a significant amount of time to develop. Sun2018 proposed a unified framework to integrate recommender and dialog systems, in which instead of just returning the top-ranking results for a given user query, the agent attempts to optimize for long term reward by asking the user for the value of an attribute. In their work, the action space is limited to two types of actions, namely, requesting for the value of an attribute or making a recommendation. Zhang2018 proposed a multi-memory network architecture and applied it to search and recommendation in e-commerce. Compared to previous works, our formulation of the search problem is domain independent, accounts for user attributes and conversational preferences, and includes actions typically present in a dialog system. Additionally, whereas existing works have not been evaluated in a real production setting, we conduct experiments with real users at large scale.
In this paper, we introduced the problem of conversational skill discovery in speech-based assistants and presented an approach to enable users to find skills. To this end, we started with a rule-based agent and improved it by using RL, enabling the agent to adapt to different user attributes and conversational styles. We compared popularity based, rule-based and RL-based model conversational agents by deploying them in a real production setting and showed that the RL agent learns to adapt its policy to achieve a higher success rate with shorter dialogs. For future work, we plan to further personalize the dialog policy based on user attributes and conversational preferences, and investigate richer state representations. Furthermore, we plan to explore the impact of evolving attributes and preferences on the learned policies.
We would like to thank the Alexa Machine Learning Platform team for making the customer experiments possible. We would also like to thank Jared Casale, Jason Pazis, Longshaokan Wang, and Spyros Matsoukas for their feedback and support.
A Examples of Dialogs
Dialog with the Rule-based Agent
Dialog with RL-based Agent
B Learning Parameters for the RL Agent
|Number of Hidden Layers|
|Number of Hidden Nodes per Layer|
|Experience Replay Size|
|Target Model Update Interval|