Subgoal Discovery for Hierarchical Dialogue Policy Learning

Subgoal Discovery for Hierarchical Dialogue Policy Learning

Da Tang Xiujun Li Jianfeng Gao Chong Wang Lihong Li Tony Jebara
Microsoft Research, Redmond, WA, USA
Columbia University, NY, USA
Google Inc., Kirkland, WA, USA

Developing conversational agents to engage in complex dialogues is challenging partly because the dialogue policy needs to explore a large state-action space. In this paper, we propose a divide-and-conquer approach that discovers and exploits the hidden structure of the task to enable efficient policy learning. First, given a set of successful dialogue sessions, we present a Subgoal Discovery Network (SDN) to divide a complex goal-oriented task into a set of simpler subgoals in an unsupervised fashion. We then use these subgoals to learn a hierarchical policy which consists of 1) a top-level policy that selects among subgoals, and 2) a low-level policy that selects primitive actions to accomplish the subgoal. We exemplify our method by building a dialogue agent for the composite task of travel planning. Experiments with simulated and real users show that an agent trained with automatically discovered subgoals performs competitively against an agent with human-defined subgoals, and significantly outperforms an agent without subgoals. Moreover, we show that learned subgoals are human comprehensible.

Subgoal Discovery for Hierarchical Dialogue Policy Learning

Da Tang Xiujun Li Jianfeng Gao Chong Wang Lihong Li Tony Jebara Microsoft Research, Redmond, WA, USA Columbia University, NY, USA Google Inc., Kirkland, WA, USA {xiul,jfgao} {datang,jebara} {chongw,lihong}

1 Introduction

Personal assistants, often in the form of dialogue agents, are emerging in today’s market including Amazon’s Echo, Apple’s Siri, Google’s Home, and Microsoft’s Cortana. Most of these dialogue agents can only accomplish simple tasks, such as scheduling meetings and making calls, but not complex tasks that contain multiple subtasks, such as travel planning (Peng et al., 2017b) and restaurant-hotel booking (Budzianowski et al., 2017). Building agents that can tackle complex tasks remains a challenging problem.

Recently, there have been growing interests in using reinforcement learning (RL) to optimize dialogue policies for task-completion dialogues (Su et al., 2016; Lipton et al., 2016; Zhao and Eskenazi, 2016; Williams et al., 2017; Dhingra et al., 2017; Li et al., 2017; Liu and Lane, 2017). Compared to these dialogue agents developed for simple tasks, building an agent for complex tasks is more difficult largely due to the reward sparsity. Dialogue policy learning for complex tasks requires exploration in a much larger state-action space, and it often takes many more conversation turns between user and agent to fulfill a task, leading to a much longer trajectory. Thus, the reward signal (usually provided by users at the end of a conversation) is substantially delayed and sparse.

A general strategy of solving a complex task is divide and conquer. For example, one can decompose a complex task into several subtasks and solve them separately. These subtasks are often simpler with smaller state-action spaces, and are therefore easier to solve. Peng et al. (2017b) used this strategy to learn a hierarchical dialogue policy based on human-defined subgoals for the composite task of travel planing, which contains subtasks like flight booking and hotel reservation. In many cases, however, the structure of a task is hidden and the subgoals are unknown. Manually defining the task structure (in the form of subgoals) requires extensive domain knowledge. This limits the applicability of hierarchical approaches in practice.

In this paper, we propose a Subgoal Discovery Network (SDN) to automatically learn the task structure. Given a set of successful dialogue sessions, SDN learns to identify “hub” states (i.e., the intersecting states of many state trajectories) as subgoals. Then, reinforcement learning is applied to learn a hierarchical dialogue policy which consists of 1) a top-level policy that selects among subgoals, and 2) a low-level policy that selects primitive actions to accomplish the subgoal selected by the top-level policy. To the best of our knowledge, this is the first approach that strives to automatically learn subgoals in the dialogue setting. We demonstrate the effectiveness of our approach by building a composite task-completion dialogue agent for travel planning. Experiments with both simulated and real users show that an agent learned with automatically discovered subgoals performs competitively against an agent learned using expert-defined subgoals, and significantly outperforms an agent learned without subgoals. We also find that the subgoals discovered by the SDN are often human comprehensible.

2 Composite Task-Completion Dialogue

The task-completion dialogue can be formulated as a Markov decision process (MDP) (Sutton and Barto, 1998), in which the agent interacts with its environment over a sequence of discrete steps. At each step , the agent observes state , and chooses action according to a policy . Then, the agent receives reward and observes new state . The cycle continues until the episode terminates.

A composite task consists of a set of subgoals. For example, to make a travel plan, an agent has to achieves a set of subtasks such as books flight tickets and reserves a hotel. This process has a natural hierarchy: a top-level process selects which subgoal to complete next, and a low level process chooses primitive actions to complete the selected subgoal. This hierarchical decision making processes can be formulated in the options framework (Sutton et al., 1999), where options generalize primitive actions to higher-level actions. Different from the traditional MDP setting where an agent can only choose a primitive action at each time step, with options the agent can choose a “multi-step” action, which is a sequence of primitive actions for completing a subtask.

2.1 Hierarchical Policy Learning

Based on the options framework, Peng et al. (2017b) developed a composite task-completion dialogue agent via hierarchical reinforcement learning (HRL) using human-defined subgoals. The agent uses a hierarchical dialogue policy that consists of (1) a top-level dialogue policy that selects among subgoals, and (2) a low-level dialogue policy that selects primitive actions to accomplish the subgoal given by the top-level policy.

The top-level policy perceives state from the environment and selects a subgoal for the low-level policy to execute. Then the agent receives an extrinsic reward , and reaches the state to . The low-level policy is shared by all options. It takes as input a state and a subgoal , and selects a primitive action to execute. Then the agent receives an intrinsic reward provided by the internal critic of the dialogue agent and updates the state. The subgoal remains a constant input to , until a terminal state is reached to terminate .

The goal of HRL is to find optimal policies, and , to maximize expected cumulative discounted extrinsic and intrinsic rewards, respectively. This can be achieved by approximating their corresponding optimal Q-value functions using DQN (Mnih et al., 2015). Specifically, we use deep neural networks to approximate the two optimal Q-value functions (Kulkarni et al., 2016): for top-level policy and for low-level policy. The parameters and are optimized to minimize the following quadratic loss functions:


Here, is a discount factor, and , are the replay buffers storing dialogue experience for training top-level and low-level policies, respectively. The gradients of the two loss functions with respect to their parameters are

The only remaining problem is how to define the extrinsic and intrinsic rewards.

2.2 Extrinsic and Intrinsic Rewards

To train the hierarchical policy of a composite task-completion dialogue agent, the extrinsic and intrinsic rewards are defined as follows.

Let be the maximum number of turns of the a dialogue ( in our experiments), and the number of subgoals. At the end of a dialogue, the agent receives a positive extrinsic reward of for a success dialogue, or for a failure dialogue; for each turn, the agent receives an extrinsic reward () for penalty.

When the end of an option is reached, the agent receives a positive intrinsic reward of if a subgoal is completed successfully, or a negative intrinsic reward of otherwise; for each turn, the agent receives an intrinsic reward () to discourage longer dialogues.

The combination of the extrinsic and intrinsic rewards defined above encourages the agent to accomplish a composite task as fast as possible while minimizing the number of switches between subgoals. In the cases where the subgoals of a composite task are manually defined as in (Peng et al., 2017b), it is straightforward to detect whether an option is about to terminate111Assume that a subgoal is defined by a set of slots. A simple heuristics of detecting whether an option is about to terminate is to check whether all the slots of the subgoal are captured in the dialogue state., and thus generate the intrinsic reward. In this paper, we generalize it to the cases where either the subgoals are unknown or the human-defined subgoals are sub-optimal, and thus the subgoals have to be discovered or refined automatically.

Figure 1: Illustration of “subgoals”. Suppose there are 3 state trajectories , and . Then red states , , could be good candidates for “subgoals”.

3 HRL with Discovered Subgoals

Assume that we have collected a set of successful state trajectories of a task, as shown in Figure 1. We want to find subgoal states, such as the three red states , and , which form the “hubs” of these trajectories. These hub states indicate the ends of subgoals, and thus divide a state trajectory into several segments, each for a subgoal. We formulate subgoal discovery as a state trajectory segmentation problem, and propose to address it using the Subgoal Discovery Network (SDN), which is an extension of the sequence segmentation model originally proposed by Wang et al. (2017).

3.1 Subgoal Discovery Network

Consider we have a state trajectory represents a successful dialogue as shown in Figure 2. The (candidate) subgoal states , and divide the trajectory into 3 segments , and . We can view each segment as being generated by a multi-step action, known as an option (Sutton et al., 1999).

Figure 2: An illustration of a SDN for trajectory with , and as subgoals. Symbol # is the termination. The top-level RNN (RNN1) models single segments and the low-level RNN (RNN2) provides information about previous states from RNN1. The embedding matrix maps to the outputs of the RNN2 to low dimensional representations so as to be consistent with the input dimensionality of the RNN1. Note that state is associated with two termination symbols #; one is for the termination of the last segment and the other is for the termination of the entire trajectory. These are needed to make a fully generative model.

As illustrated in Figure 2, we model the likelihood of each segment using an RNN, denoted by RNN1. At each time step, RNN1 outputs the next state given the current state until it reaches the option termination symbol #. Since different options are reasonable under different conditions, it is not plausible to apply a fixed initial input to different segments. Therefore, we use another RNN (RNN2) to encode all previous states to provide relevant information and we transform these information to low dimensional representations as the initial inputs for the RNN1 instances. This is based on the causality assumption of the options framework (Sutton et al., 1999) — the agent should be able to determine the next option given all previous information, and this should not depend on information related to any later state. The low dimensional representations are obtained via a global subgoal embedding matrix , where and are the dimensionality of RNN1’s input layer and RNN2’s output layer, respectively. Mathematically, if the output of RNN2 a time step is , then the RNN1 instance starting form time has as its initial input222 for . . is the number of subgoals we aim to learn. Ideally, we expect the vector in a well-trained SDN to be close to some one-hot vector. Therefore, should be close to a column of and we can view that provide at most different “embedding vectors” for RNN1 as inputs, indicating different subgoals. Even in the case is not close to any one-hot vector, choosing a small helps avoid overfitting.

Under the SDN assumption, we model the conditional likelihood of a proposed segmentation as , where each probability term is based on an RNN1 instance. However, this conditional likelihood is only valid when , and are known to be the subgoal states. However, we only know the whole trajectory as our observation. To address this challenge, we model the likelihood of the input trajectory as the sum over all possible segmentations. This is akin to the assumption in Wang et al. (2017).

For an input state trajectory , we model its likelihood using the following333For notational convenience, we include into the observational sequence, though is always conditioned upon.:


where is the set of all possible segmentations for the trajectory , denotes the segment in the segmentation , and is the concatenation operator. is an upper limit on the maximal number of segmentations we allow. This parameter is important for learning subgoals in the our setting since we usually prefer a small number of subgoals. This is different from Wang et al. (2017), where a maximum segment length is enforced.

We use maximum likelihood estimation with Eq. (3) for training. However, there are exponentially many possible segmentations in and simple enumeration is intractable. Fortunately, dynamic programming can be used to compute the likelihood in Eq. (3) efficiently: for a trajectory , if we denote the sub-trajectory of as , then these likelihood values obey the following recursive relation by examining the segmentation property:

Here, notation indicates the likelihood of sub-trajectory with no more than segments and function is the indicator function. is the likelihood segment given previous history, where RNN1 models the segment and RNN2 models the history as shown in Figure 2. With this recursive relation, we can compute the likelihood for the trajectory in time.

We denote as the model parameters of SDN, which include the parameters of the embedding matrix , RNN1 and RNN2. Given a set of state trajectories , we optimize by minimizing the negative mean log-likelihood with a -regularization term, where , using stochastic gradient descent:

3.2 Hierarchical Policy Learning with SDN

After SDN is learned, we use it to learn the dialogue policy with hierarchical reinforcement learning (HRL) as follows. The agent starts from the initial state , keeps sampling the output from the distribution related to the RNN1 until a termination symbol # is generated, which indicates the agent reaches a subgoal; it then selects a new option, and repeats this process. This type of naive sampling may allow the option to terminate at some places with a low probability. To stabilize the HRL training, we introduce a threshold , which directs the agent to terminate an option if and only if the probability of outputting # is at least . We found this modification leads to better behavior of the HRL agent than the naive sampling method, since it normally has a smaller variance. In HRL training, the agent only uses the probability of outputting # to decide subgoal termination. Algorithm 1 outlines the full procedure of one episode for hierarchical dialogue policies with SDN in the composite task-completion dialogue system.

0:  A trained SDN , initial state of an episode, threshold , the HRL agent .
1:  Initialize an RNN2 instance with parameters from and as the initial input.
2:  Initialize an RNN1 instance with parameters from and as the initial input, where is the embedding matrix (from ) and is the initial output of .
3:  Current state .
4:  Select an option using the agent .
5:  while Not reached the final goal do
6:     Select an action according to and using the agent . Get the reward and the next state from the environment.
7:     Place to , denote as ’s latest output and take as the ’s new input. Let be the probability of outputting the termination symbol #.
8:     if  then
9:        Select a new option using the agent .
10:        Re-initialize using the latest output from and the embedding matrix .
11:     end if
12:  end while
Algorithm 1 HRL episode with a trained SDN

4 Experiments and Results

We evaluate the proposed model on a travel planning scenario for composite task-completion dialogues (Peng et al., 2017b). Over the exchange of conversation, the agent gathers information about the user’s intention before booking a trip. The environment then assesses a binary outcome (success or failure) at the end of the conversation, based on (1) whether a trip is booked, and (2) whether the trip satisfies the user’s constraints.


The raw dataset in our experiments is from a publicly available multi-domain dialogue corpus444 (El Asri et al., 2017). Following Peng et al. (2017b), a few changes were made to introduce dependencies among subtasks. For example, the hotel check-in date should be the same with the departure flight arrival date. The data was mainly used to create simulated users, and to build the knowledge bases for the subtasks of booking flights and hotel reservations.

User Simulator.

In order to learn good policies, RL algorithms typically need an environment to interact with. In the dialogue research community, it is common to use simulated users for this purpose (Schatzmann et al., 2007; Li et al., 2017; Liu and Lane, 2017). In this work, we adapted a publicly available user simulator (Li et al., 2016) to the composite task-completion dialogue setting with the dataset described above. During training, the simulator provides the agent with an (extrinsic) reward signal at the end of the dialogue. A dialogue is considered to be successful only when a travel plan is booked successfully, and the information provided by the agent satisfies user’s constraints. During each turn, the agent receives an extrinsic reward, which was introduced in Section 2.2.

Baseline Agents.

We benchmarked the proposed agent (denote as the m-HRL Agent) against three baseline agents:

  • A Rule Agent uses a sophisticated, hand-crafted dialogue policy, which requests and informs a hand-picked subset of necessary slots, and then confirms with the user about the reserved trip before booking the ticket and hotel.

  • A flat RL Agent is trained with a standard deep reinforcement learning method (DQN) which learns a flat dialogue policy using extrinsic rewards only.

  • A h-HRL Agent is trained with hierarchical deep reinforcement learning (HDQN), which learns a hierarchical dialogue policy given human-defined subgoals (Peng et al., 2017b).

Collecting State Trajectories.

Recall that our subgoal discovery approach takes as input a set of state trajectories which lead to successful outcomes. In practice, one can collect a large set of successful state trajectories, either by asking human experts to demonstrate (e.g., in a call center), or by rolling out a reasonably good policy (e.g., a policy designed by human experts). These trajectories need not to be optimal, and also not be just random walks, at least some useful demonstrations. In this paper, we obtain dialogue state trajectories from a rule-based agent which is handcrafted by a domain expert, the performance of this rule-based agent can achieve a success rate of 32.2% as shown in Figure 3 and Table 1. We only collect the successful dialogue sessions from the roll-outs of the rule-based agent, and try to learn the subgoals from these dialogue state trajectories.

Experiment Settings.

To train SDN, we use RMSprop (Tieleman and Hinton, 2012) to optimize the model parameters. For both RNN1 and the RNN2, we use LSTM cells (Hochreiter and Schmidhuber, 1997) as hidden units and set the RNN hidden size to . We set embedding matrix with columns. As we discussed in Section 3.1, captures the maximum number of subgoals that the model is expected to learn. Again, to avoid SDN from learning many unnecessary subgoals, we only allow segmentation with at most segments during subgoal training. The values for and are usually set to be a little bit larger than the expected number of subgoals (e.g. 2 or 3 for this task) since we expect a great proportion of the subgoals that SDN learns are useful, but not necessary for all of them. As long as SDN discovers useful subgoals that guide the agent to learns policies faster, it is beneficial for HRL training, even if some non-perfect subgoals are found. During the HRL training, we use the learned SDN to propose subgoal-completion queries.

The training data for SDN was collected from a rule-based policy. We collected imperfect but successful dialogue episodes from this policy and randomly choose of these dialogue state trajectories for training SDN. The remaining were used as a validation set.

As illustrated in Section 3.2, SDN starts a new RNN1 instance and issues a subgoal-completion query when the probability of outputting the termination symbol # is above a certain threshold (as in Algorithm 1). In our experiment, is set to be 0.2, which was manually picked according to the termination probability value during SDN training.

In dialogue policy learning, for the baseline RL agent, we set the size of the hidden layer to . For the HRL agents, both top-level and low-level dialogue policies have a hidden layer size of . RMSprop was applied to optimize the parameters. We set the batch size to be . During training, we used -greedy strategy for exploration with annealing and set . For each simulation epoch, we simulated dialogues and stored these state transition tuples in an experience replay buffer. At the end of each simulation epoch, the model was updated with all the transition tuples in the buffer in a batch manner.

Figure 3: Learning curves of agents under simulation.
Agent Success Rate Turns Reward
Rule .3220 46.23 -24.02
RL .4440 45.50 -1.834
h-HRL .6485 44.23 35.32
m-HRL .6455 44.85 34.77
Table 1: Performance of different agents with simulated user.

4.1 Simulated User Evaluation

In the composite task-completion dialogue scenario, we compared the proposed m-HRL agent with three baseline agents in terms of three metrics: success rate555Success rate is the fraction of dialogues which accomplished the task successfully given a maximum number of turns., average rewards and average turns per dialogue session.

Figure 3 shows the learning curves of all four agents trained against the simulated user. Each learning curve was averaged over runs. Table 1 shows the test performance where each number was averaged over runs and each run generated simulated dialogues. We find that the HRL agents generated higher success rates and needed fewer conversation turns to achieve the users’ goals than the rule-based agent and the flat RL agent. The performance of the m-HRL agent is tied with that of the h-HRL agent, even though the latter requires high-quality subgoals designed by human experts.

Figure 4: Performance of three different agents tested with real users: success rate, number of dialogues and p-values are indicated on each bar (difference in mean is significant with 0.05).

4.2 Human Evaluation

We further evaluated the agents that were trained on simulated users against real users, who were recruited from the authors’ organization. We conducted a study using the one RL agent and two HRL agents {RL, h-HRL, m-HRL}, and compared two pairs: {RL, m-HRL} and {h-HRL, m-HRL}. In each dialogue session, one agent was randomly selected from the pool and allowed to interact with a user. The user was not aware of which agent was selected to avoid systematic bias. The user was presented with a goal sampled from a user-goal corpus, then was instructed to converse with the agent to complete the given task. At the end of each dialogue session, the user was asked to give a rating on a scale from to based on the naturalness and coherence of the dialogue; here, is the worst rating and the best. In total, we collected dialogue sessions from human users.

Figure 4 summarizes the performances of these agents against real users in terms of success rate. Figure 5 shows the distribution of user ratings for each agent. For these two metrics, both HRL agents were significantly better than the flat RL agent. Another interesting observation is that the m-HRL agent performs similarly to the h-HRL agent in terms of success rate in the simulations as shown on Figure 4. Meanwhile in Figure 5, the h-HRL agent is significantly better than m-HRL agent in terms of real user ratings. This may be caused by the probabilistic termination of subgoals: we used a threshold strategy to decide whether to terminate a subgoal. This could introduce variance so the agent might not behave reasonably compared with human-defined subgoals which terminate deterministically.

Figure 5: Distribution of user ratings for three different agents in human evaluation

4.3 Subgoal Visualization

Table 2 shows the subgoals discovered by SDN in a sample dialogue by a rule-based agent interacting with the user simulator. The rule-based agent is equipped with human-defined subtask structure which is flight subtask first (from turn 1 to turn 15), then hotel subtask (from turn 16 to turn 23) as shown in the first columns. At turn 10, the user starts to talk about hotel subtask while the rule-based agent is still working on the pre-defined, unfinished flight subtask until the flight subtask is finished at turn 15. At turn 16, the user switches to the hotel subtask, and the rule-based agent also switches to the hotel subtask until the end of the dialogue. For this rule-based agent, the human-defined subgoal (flight subtask) terminates at turn 15. Meanwhile, our SDN model detected two subgoals (except for the final goal): one terminating at turn 9 (Subgoal 1), and another subgoal terminating at turn 15 (Subgoal 2). Actually, Subgoal 2 is consistent with the human-defined subgoal. Subgoal 1 is also reasonable since the user tries to switch to talking about the hotel subtask at turn 10. In Appendix A, Table 3 shows a sample dialogue session by m-HRL agent interacting with a real user.

h-Task ID Text


1 agt: Hi, may I know your departure city?
2 usr: I depart from Fort Lauderdale.
3 agt: And what is your destination city?
4 usr: I need economy seat to Pittsburgh.
5 agt: What is your departure date?
6 usr: 2016-09-12, what time is available for the flight?
7 agt: There is one flight at 21:00:00.
8 usr: What is the price for that flight?
9 agt: $423.
Subgoal 1: A detected subgoal terminates!
10 usr: I need to reserve a hotel for five people.
11 agt: What is your return date for the trip?
12 usr: 2016-09-18, I need to book 5 tickets for the flight.
13 agt: What time do you prefer for return flight?
14 usr: What time is available that day?
15 agt: There is one flight departing at 11:00:00AM.
Subgoal 2: A detected subgoal terminates!


16 usr: What hotel is available?
17 agt: There is one hotel called Winter Hill Inn available.
18 usr: What is the price for hotel?
19 agt: $410.
20 usr: Great, could you book the trip for me?
21 agt: Okay. Trip tickets are issued.
22 usr: Thanks very much!
23 agt: You are welcome!
Table 2: Discovered subgoals (except for the final goal) in a sample dialogue by a rule-based agent interacting with user simulator. The left column (h-Task) shows the human-defined subtasks for the rule-based agent. SDN detects two subgoals that terminate at turn 9 and 15 respectively. (h-Task: human-defined subtask, ID: turn ID, agt: Agent, usr: User)

5 Related Work

Task-completion dialogue systems have attracted numerous research efforts, and there is growing interest in leveraging reinforcement learning for policy learning. One line of research is on single-domain task-completion dialogues with flat deep reinforcement learning algorithms such as DQN (Zhao and Eskenazi, 2016; Li et al., 2017; Peng et al., 2018), Actor-Critic (Peng et al., 2017a; Liu and Lane, 2017) and policy gradients (Williams et al., 2017; Liu et al., 2017). Another line of research addresses multi-domain dialogues where each domain is handled by a separate agent (Gašić et al., 2015b, a; Cuayáhuitl et al., 2016). Recently, Peng et al. (2017b) presented a composite task-completion dialogue system. Unlike multi-domain dialogue systems, composite task systems introduce the inter-subtask constraints. Thus, the completion of a set of individual subtasks does not guarantee the solution of the entire task.

Budzianowski et al. (2017) applied HRL to dialogue policy optimization in multi-domain dialogue systems. Peng et al. (2017b) first presented an HRL agent with a global state tracker to learn the dialogue policy in the composite task-completion dialogue systems. Both works are built based on subgoals that were pre-defined with human domain knowledge about the specific tasks. The only job of the policy learner is to learn a hierarchical dialogue policy, which leaves the subgoal discovery problem unsolved. In addition to the applications in dialogue systems, subgoal is also widely studied in the linguistics research community (Allwood, 2000; Linell, 2009).

In the machine learning community, one line of research has sought algorithms that discover subgoals automatically. One approach is to attempt to identify the bottleneck states that the agent passes frequently, and these bottleneck states can be treated as subgoals (Stolle and Precup, 2002; McGovern and Barto, 2001). Another approach is to cluster the states or graphs as subgoals (Mannor et al., 2004; Entezari et al., 2011; Bacon, 2013).

Segmental structures are common in human languages. In the NLP community, some related research on segmentation includes word segmentation (Gao et al., 2005; Zhang et al., 2016) to divide the words into meaningful units. Alternatively, topic detection and tracking (Allan et al., 1998; Sun et al., 2007) segment a stream of data and identify stories or events in news or social text. In this work, we formulate subgoal discovery as a trajectory segmentation problem. Section 3 presents our approach to subgoal discovery which is inspired by a probabilistic sequence segmentation model (Wang et al., 2017).

6 Discussion and Conclusion

We have proposed the Subgoal Discovery Network to learn subgoals automatically in an unsupervised fashion without human domain knowledge. Based on the discovered subgoals, We learn the dialogue policy for complex task-completion dialogue agents using HRL. Our experiments in a composite task of travel planning, on both simulated and real users, show that an agent trained with automatically discovered subgoals performs competitively against an agent with human-defined subgoals, and significantly outperforms an agent without subgoals. Through visualization, we find that SDN discovers reasonable, comprehensible subgoals given only a small amount of successful but suboptimal dialogue state trajectories.

These promising results suggest several directions for future research. First, we want to integrate subgoal discovery into dialogue policy learning rather than treat them as two separate processes. Second, we’d extend SDN to identify multi-level hierarchical structure among subgoals so that we can handle much more complex tasks than the composite task studied in this paper. Third, we want to generalize SDN to a wide range of complex goal-oriented tasks beyond dialogue, such as the ATARI “Montezuma’s Revenge” game.


  • Allan et al. (1998) James Allan, Jaime G Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. 1998. Topic detection and tracking pilot study final report.
  • Allwood (2000) Jens Allwood. 2000. An activity based approach to pragmatics. Abduction, belief and context in dialogue: Studies in computational pragmatics, pages 47–80.
  • Bacon (2013) Pierre-Luc Bacon. 2013. On the bottleneck concept for options discovery.
  • Budzianowski et al. (2017) Pawel Budzianowski, Stefan Ultes, Pei-Hao Su, Nikola Mrksic, Tsung-Hsien Wen, Inigo Casanueva, Lina Rojas-Barahona, and Milica Gasic. 2017. Sub-domain modelling for dialogue management with hierarchical reinforcement learning. arXiv preprint arXiv:1706.06210.
  • Cuayáhuitl et al. (2016) Heriberto Cuayáhuitl, Seunghak Yu, Ashley Williamson, and Jacob Carse. 2016. Deep reinforcement learning for multi-domain dialogue systems. arXiv preprint arXiv:1611.08675.
  • Dhingra et al. (2017) Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 484–495.
  • El Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057.
  • Entezari et al. (2011) Negin Entezari, Mohammad Ebrahim Shiri, and Parham Moradi. 2011. Subgoal discovery in reinforcement learning using local graph clustering.
  • Gao et al. (2005) Jianfeng Gao, Mu Li, Andi Wu, and Chang-Ning Huang. 2005. Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics, 31(4):531–574.
  • Gašić et al. (2015a) M Gašić, N Mrkšić, Pei-hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2015a. Policy committee for adaptation in multi-domain spoken dialogue systems. In ASRU 2015, pages 806–812. IEEE.
  • Gašić et al. (2015b) Milica Gašić, Dongho Kim, Pirros Tsiakoulis, and Steve Young. 2015b. Distributed dialogue policies for multi-domain statistical dialogue management. In ICASSP 2015, pages 5371–5375.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Kulkarni et al. (2016) Tejas D Kulkarni, Karthik Narasimhan, Ardavan Saeedi, and Josh Tenenbaum. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, pages 3675–3683.
  • Li et al. (2016) Xiujun Li, Zachary C Lipton, Bhuwan Dhingra, Lihong Li, Jianfeng Gao, and Yun-Nung Chen. 2016. A user simulator for task-completion dialogues. arXiv preprint arXiv:1612.05688.
  • Li et al. (2017) Xuijun Li, Yun-Nung Chen, Lihong Li, and Jianfeng Gao. 2017. End-to-end task-completion neural dialogue systems. arXiv preprint arXiv:1703.01008.
  • Linell (2009) Per Linell. 2009. Rethinking language, mind, and world dialogically. IAP.
  • Lipton et al. (2016) Zachary C Lipton, Jianfeng Gao, Lihong Li, Xiujun Li, Faisal Ahmed, and Li Deng. 2016. Efficient exploration for dialogue policy learning with bbq networks & replay buffer spiking. arXiv preprint arXiv:1608.05081.
  • Liu and Lane (2017) Bing Liu and Ian Lane. 2017. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. arXiv preprint arXiv:1709.06136.
  • Liu et al. (2017) Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, and Larry Heck. 2017. End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. arXiv preprint arXiv:1711.10712.
  • Mannor et al. (2004) Shie Mannor, Ishai Menache, Amit Hoze, and Uri Klein. 2004. Dynamic abstraction in reinforcement learning via clustering. In Proceedings of the twenty-first international conference on Machine learning, page 71. ACM.
  • McGovern and Barto (2001) Amy McGovern and Andrew G Barto. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. Computer Science Department Faculty Publication Series, page 8.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
  • Peng et al. (2017a) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung Chen, and Kam-Fai Wong. 2017a. Adversarial advantage actor-critic model for task-completion dialogue policy learning. arXiv preprint arXiv:1710.11277.
  • Peng et al. (2018) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Kam-Fai Wong. 2018. Integrating planning for task-completion dialogue policy learning. arXiv preprint arXiv:1801.06176.
  • Peng et al. (2017b) Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017b. Composite task-completion dialogue policy learning via hierarchical deep reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2221–2230.
  • Schatzmann et al. (2007) Jost Schatzmann, Blaise Thomson, Karl Weilhammer, Hui Ye, and Steve Young. 2007. Agenda-based user simulation for bootstrapping a pomdp dialogue system. In NAACL 2007; Companion Volume, Short Papers, pages 149–152. Association for Computational Linguistics.
  • Stolle and Precup (2002) Martin Stolle and Doina Precup. 2002. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation, and Approximation, pages 212–223. Springer.
  • Su et al. (2016) Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Continuously learning neural dialogue management. arXiv preprint arXiv:1606.02689.
  • Sun et al. (2007) Bingjun Sun, Prasenjit Mitra, C Lee Giles, John Yen, and Hongyuan Zha. 2007. Topic segmentation with shared topic detection and alignment of multiple documents. In SIGIR 2007, pages 199–206. ACM.
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
  • Sutton et al. (1999) Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211.
  • Tieleman and Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31.
  • Wang et al. (2017) Chong Wang, Yining Wang, Po-Sen Huang, Abdelrahman Mohamed, Dengyong Zhou, and Li Deng. 2017. Sequence modeling via segmentations. arXiv preprint arXiv:1702.07463.
  • Williams et al. (2017) Jason D Williams, Kavosh Asadi, and Geoffrey Zweig. 2017. Hybrid code networks: Practical and efficient end-to-end dialog control with supervised and reinforcement learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL).
  • Zhang et al. (2016) Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. Transition-based neural word segmentation. In ACL (1).
  • Zhao and Eskenazi (2016) Tiancheng Zhao and Maxine Eskenazi. 2016. Towards end-to-end learning for dialog state tracking and management using deep reinforcement learning. arXiv preprint arXiv:1606.02560.

Appendix A Sample Dialogue

User Goal
reserve-hotel subtask:
“request_slots”: {           “inform_slots”: {
“hotel_price”: “?”           “hotel_date_checkin”:“2016-09-22”
“hotel_date_checkout”: “?” “hotel_city”: “Curitiba”
“hotel_name”: “?”           “hotel_numberofpeople”: “4”
“hotel_amenity_wifi”: “?”   }
book-flight-ticket subtask:
“request_slots”: {         “inform_slots”: {
“price”: “?”                “or_city”: “Lima”,
“return_time_dep”: “?”    “dst_city”: “Curitiba”,
“return_date_dep”: “?”    “numberofpeople”: “4”,
“depart_time_dep”: “?”    “depart_date_dep”:“2016-09-22”
“seat”: “?”                 }
m-HRL agent
agt: Hi, may I know your departure city and destination city?
usr: I depart from Lima to Curitiba.
agt: What is your departure date?
usr: I am leaving on 2016-09-22.
agt: There is one flight available at 8:00:00AM.
usr: What date is available for return flight?
agt: There is one return flight on 2016-09-27.
usr: What is the time for the return flight?
agt: It is 00:00:00AM.
usr: What seat is available?
agt: Only Economy seat now.
usr: We need 4 tickets for the flight.
agt: Total is 1201 dollars for flight.
inform(subtask_complete): Subgoal reached!
usr: What hotel is available there?
agt: There is one hotel called Starlight.
usr: Is the wifi available for the hotel?
agt: Yes, it is free.
usr: What’s the price for the hotel?
agt: It is 1541 dollars for hotel.
usr: Could you please book the trip for me?
agt: Okay. Trip tickets are issued.
usr: Thanks!
agt: Thanks!
Table 3: Sample dialogue by m-HRL agent interacting with real user: bolded slots are the joint constraints between two subtasks. (agt: Agent, usr: User)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description