Deep Hierarchical Reinforcement Learning Algorithm in Partially Observable Markov Decision Processes
Abstract
In recent years, reinforcement learning has achieved many remarkable successes due to the growing adoption of deep learning techniques and the rapid growth in computing power. Nevertheless, it is wellknown that flat reinforcement learning algorithms are often not able to learn well and dataefficient in tasks having hierarchical structures, e.g. consisting of multiple subtasks. Hierarchical reinforcement learning is a principled approach that is able to tackle these challenging tasks. On the other hand, many realworld tasks usually have only partial observability in which state measurements are often imperfect and partially observable. The problems of RL in such settings can be formulated as a partially observable Markov decision process (POMDP). In this paper, we study hierarchical RL in POMDP in which the tasks have only partial observability and possess hierarchical properties. We propose a hierarchical deep reinforcement learning approach for learning in hierarchical POMDP. The deep hierarchical RL algorithm is proposed to apply to both MDP and POMDP learning. We evaluate the proposed algorithm on various challenging hierarchical POMDP.
Hierarchical Deep Reinforcement Learning, Partially Observable MDP (POMDP), SemiMDP, Partially Observable SemiMDP (POSMDP)
1 Introduction
Reinforcement Learning (RL) [[1]] is a subfield of machine learning focused on learning a policy in order to maximize total cumulative reward in an unknown environment. RL has been applied to many areas such as robotics [[2]][[3]][[4]][[5]], economics [[6]][[7]][[8]], computer games [[9]][[10]][[11]] and other applications [[12]][[13]][[14]]. RL is divided into two approaches: valuebased approach and policybased approach [[15]]. A typical valuebased approach tries to obtain an optimal policy by finding optimal value functions. The value functions are updated using the immediate reward and the discounted value of the next state. Some methods based on this approach are Qlearning, SARSA, and TDlearning [[1]]. In contrast, the policybased approach directly learns a parameterized policy that maximizes the cumulative discounted reward. Some techniques used for searching optimal parameters of the policy are policy gradient [[16]], expectation maximization [[17]], and evolutionary algorithm [[18]]. In addition, a hybrid approach of valuebased approach and policybased approach is called actorcritic [[19]]. Recently, RL algorithms integrated with a deep neural network (DNN) [[10]] achieved good performance, even better than human performance at some tasks such as Atari games [[10]] and Go game [[11]]. However, obtaining good performance on a task consisting of multiple subtasks, such as Mazebase [[20]] and Montezuma’s Revenge [[9]], is still a major challenge for flat RL algorithms. Hierarchical reinforcement learning (HRL) [[21]] was developed to tackle these domains.
HRL decomposes a RL problem into a hierarchy of subproblems (subtasks) and each of which can itself be a RL problem. Identical subtasks can be gathered into one group and are controlled by the same policy. As a result, hierarchical decomposition represents the problem in a compact form and reduces the computational complexity. Various approaches to decompose the RL problem have been proposed, such as options [[21]], HAMs [[22]][[23]], MAXQ [[24]], Bayesian HRL [[25], [26], [27]] and some other advanced techniques [[28]]. All approaches commonly consider semiMarkov decision process [[21]] as a base theory. Recently, many studies have combined HRL with deep neural networks in different ways to improve performance on hierarchical tasks [[29]][[30]][[31]][[32]][[33]][[34]][[35]]. Bacon et al. [[30]] proposed an optioncritic architecture, which has a fixed number of intraoptions, each of which is followed by a “deep” policy. At each time step, only one option is activated and is selected by another policy that is called “policy over options”. DeepMind lab [[33]] also proposed a deep hierarchical framework inspired by a classical framework called feudal reinforcement learning [[36]]. Similarly, Kulkarni et al. [[37]] proposed a deep hierarchical framework of two levels in which the highlevel controller produces a subgoal and the lowlevel controller performs primitive actions to obtain the subgoal. This framework is useful to solve a problem with multiple subgoals such as Montezuma’s Revenge [[9]] and games in Mazebase [[20]]. Other studies have tried to tackle more challenging problems in HRL such as discovering subgoals [[38]] and adaptively finding a number of options [[39]].
Though many studies have made great efforts in this topic, most of them rely on an assumption of full observability, where a learning agent can observe environment states fully. In other words, the environment is represented as a Markov decision process (MDP). This assumption does not reflect the nature of realworld applications in which the agent only observes a partial information of the environment states. Therefore, the environment, in this case, is represented as a POMDP. In order for the agent to learn in such a POMDP environment, more advanced techniques are required in order to have a good prediction over environment responses, such as maintaining a belief distribution over unobservable states; alternatively using a recurrent neural network (RNN) [[40]] to summarize an observable history. Recently, there have been some studies using deep RNNs to deal with learning in POMDP environment [[41]][[42]].
In this paper, we develop a deep HRL algorithm for POMDP problems inspired by the deep HRL framework [[37]]. The agent in this framework makes decisions through a hierarchical policy of two levels. The top policy determines the subgoal to be achieved, while the lowerlevel policy performs primitive actions to achieve the selected subgoal. To learn in POMDP, we represent all policies using RNNs. The RNN in lowerlevel policies constructs an internal state to remember the whole states observed during the course of interaction with the environment. The top policy is a RNN to encode a sequence of observations during the execution of a selected subgoal. We highlight our contributions as follows. First, we exploit the advantages of RNNs to integrate with hierarchical RL in order to handle learning on challenging and complex tasks in POMDP. Second, this integration leads to a new hierarchical POMDP learning framework that is more scalable and dataefficient.
The rest of the paper is organized as follows. Section 2 reviews the underlying knowledge such as semiMarkov decision process, partially observable Markov decision process and deep reinforcement learning. Our contributions are described in Section 3, which consists of two parts. The deep model part describes all elements in our algorithm and the learning algorithm part summarizes the algorithm in a generalized form. The usefulness of the proposed algorithms is demonstrated through POMDP benchmarks in Section 4. Finally, the conclusion is established in Section 5.
2 Background
In this section, we briefly review all underlying theories that the content of this paper relies on: Markov decision process, semiMarkov decision process for hierarchical RL, partially observable Markov decision process for RL, and deep reinforcement learning.
2.1 Markov Decision Process
A discrete MDP models a sequence of decisions of a learning agent interacting with an environment at some discrete time scale, . Formally, an MDP consists of a tuple of five elements , where is a discrete state space, is a discrete action space , is a transition function that measures the probability of obtaining a next state given a current stateaction pair , defines an immediate reward achieved at each stateaction pair, and denotes a discount factor. MDP relies on the Markov property that a next state only depends on the current stateaction pair:
A policy of a RL algorithm is denoted by which is the probability of taking an action given a state : , the goal of a RL algorithm is to find an optimal policy in order to maximize the expected total discounted reward as follows
(1) 
2.2 SemiMarkov Decision Process for Hierarchical Reinforcement Learning
Learning over different levels of policy is the main challenge for hierarchical tasks. The semiMarkov decision process (SMDP) [[21]], which is as an extension of MDP, was developed to deal with this challenge. In this theory, the concept “options” is introduced as a type of temporal abstraction. An “option” is defined by three elements: an option’s policy , a termination condition , and an initiation set denoted as the set of states in the option. In addition, a policy over options is introduced to select options. An option is executed as follows. First, when an option’s policy is taken, the action is selected based on . The environment then transitions to state . The option either terminates or continues according to a termination probability . When the option terminates, the agent can select a next option based on . The total discounted reward received by executing option is defined as
(2) 
The multitime state transition model of option [[43]], which is initiated in state and terminate at state after steps, has the formula
(3) 
where is the joint probability of ending up at after steps if taking option at state . Given this, we can write the Bellman equation for the value function of a policy over options:
(4) 
and the optionvalue function:
(5) 
Similarly, the corresponding optimal Bellman equations are as follows:
(6) 
(7) 
The optimal Bellman equation can be computed using synchronous value iteration (SVI) [[21]], which iterates the following steps for every state:
(8) 
When the option model is unknown, can be estimated using a Qlearning algorithm with an estimation formula:
(9) 
where denotes the learning rate, denotes the number of time steps elapsing between and and denotes an intermediate reward if is a primitive action, otherwise is the total reward when executing option .
2.3 Partially observable Markov Decision Process in Reinforcement Learning
In many realworld tasks, the agent might not have full observability over the environment. In principle, those tasks can in principle be formulated as a POMDP which is defined as a tuple of six components , where are the state space, action space, transition function and reward function, respectively as in a Markov decision process. and are observation space and observation model, respectively. If the POMDP model is known, the optimal approach is to maintain a hidden state called belief state. The belief defines the probability of being in state according to its history of actions and observations. Given a new action and observation, belief updates are performed using the Bayes rule [[44]] and defined as follows:
(10) 
However, exact belief updates require a high computational cost and expensive memory [[40]]. Another approach is using a Qlearning agent with function approximation, which uses Qlearning algorithm for updating the policy. Because updating the Qvalue from an observation can be less accurate than estimating from a complete state, a better way would be that a POMDP agent using QLearning uses the last observations as input to the policy. Nevertheless, the problem with using a finite number of observations is that keyevent observations far in the past would be neglected in future decisions. For this reason, a RNN is used to maintain a longterm state, as in [[41]]. Our model using RNNs at different levels of the hierarchical policies is expected to take advantage in POMDP environments.
2.4 Deep Reinforcement Learning
Recent advances in deep learning [[45]] are widely applied to reinforcement learning to form deep reinforcement learning. A few years ago, reinforcement learning still used “shallow” policies such as tabular, linear, radial basis network, or neural networks of few layers. The shallow policies contain many limitations, especially in representing highly complex behaviors and the computational cost of updating parameters. In contrast, deep neural networks in deep reinforcement learning can extract more information from the raw inputs by pushing them through multiple neural layers such as multilayer perceptron layer (MLP) and convolutional layer (CONV). Multiple layers in DNNs can have a lot of parameters allowing them to represent highly nonlinear problems. Deep QNetwork (DQN) has been proposed recently by Google Deepmind [[10]], that opens a new era of deep reinforcement learning and influenced most later studies in deep reinforcement learning. In term of architecture, a Qnetwork parameterized by , e.g., is built on a convolutional neural network (CNN) which receives an input of four images of size and is processed by three hidden CONV layers. The final hidden layer is a MLP layer with 512 rectifier units. The output layer is a MLP layer, which has the number of outputs equal to the number of actions. In term of the learning algorithm, DQN learns Qvalue functions iteratively by updating the Qvalue estimation via the temporal difference error:
(11) 
In addition, the stability of DQN also relies on two mechanisms. The first mechanism is experience replay memory, which stores transition data in the form of tuples . It allows the agent to uniformly sample from and train on previous data (offpolicy) in batch, thus reducing the variance of learning updates and breaking the temporal correlation among data samples. The second mechanism is the target Qnetwork, which is parameterized by , e.g., , and is a copy of the main Qnetwork. The target Qnetwork is used to estimate the loss function as follows:
(12) 
where . Initially, the parameters of the target Qnetwork are the same as in the main Qnetwork. However, during the learning iteration, they are only updated at a specified time step. This update rule causes the target Qnetwork to decouple from the main Qnetwork and improves the stability of the learning process.
Many other models based on DQNs have been developed such as Double DQNs [[46]], Dueling DQN [[47]], and Priority Experiment Replay [[48]]. On the other hand, deep neural networks can be integrated into other methods rather than estimating Qvalues, such as representing the policy in policy search algorithms [[5]], estimating advantage function [[49]], or mixed actor network and critic network [[50]].
Recently, researchers have used RNNs in reinforcement learning to deal with POMDP domains. RNNs have been successfully applied to domains, such as natural language processing and speech recognition, and are expected to be advantageous in the POMDP domain, which needs to process a sequence of observation rather than a single input. Our proposed method uses RNNs not only for solving the POMDP domain but also for solving these domains in a hierarchical form of reinforcement learning.
3 Hierarchical Deep Recurrent QLearning in Partially Observable SemiMarkov Decision Process
In this section, we describe hierarchical recurrent Q learning algorithm (hDRQN), our proposed framework. First, we describe the model of hDRQN and explain the way to learn this model. Second, we describe some components of our algorithm such as sampling strategies, subgoal definition, extrinsic and intrinsic reward functions. We rely on partially observable semiMarkov decision process (POSMDP) settings, where the agent follows a hierarchical policy to solve POMDP domains. The setting of POSMDP [[43], [26]] is described as follows. The domain is decomposed into multiple subdomains. Each subdomain is equivalent to an option in the SMDP framework and has a subgoal , that needs to be achieved before switching to another option. Within an option, the agent observes a partial state , takes an action , receives an extrinsic reward , and transits to another state (but the agent only observes a part of the state ). The agent executes option until it is terminated (either the subgoal is obtained or the termination signal is raised). Afterward, the agent selects and executes another option. The sequence of options is repeated until the agent reaches the final goal. In addition, for obtaining subgoal in each option, an intrinsic reward function is maintained. While the objective of a subdomain is to maximize the cumulative discounted intrinsic reward, the objective of the whole domain is to maximize the cumulative discounted extrinsic reward.
Specifically, the belief update given a taken option , observation and current belief is defined as
where is a joint transition and observation function of the underlying POSMDP model on the environment.
We adopt a similar notation from the two frameworks MAXQ [[24]] and Options [[21]], to describe our problem. We denote as the Qvalue function of the metacontroller at state (in which we use a RNN to encode past observation that has been encoded in its weights ) and option (macroaction) (assuming has a corresponding subgoal ). We note that the pair represents the belief or history at time , (we will use them interchangeably: or ).
The multitime observation model of option [[43]], which is initiated in belief and observe , has a formula as follows
(13) 
Given above, we can write the Bellman equation for the value function of the metacontroller policy over options as follows
(14) 
and the optionvalue function,
(15) 
Similarly to the MAXQ framework, the reward term is the total reward collected by executing the option and defined as . Its corresponding Qvalue function is defined as (the value function for subcontrollers). In the use of RNN, we also denote as in which is the weights of the subcontroller network that encodes previous observations, and denote the observation input to the subcontroller.
Our hierarchical frameworks are illustrated in Fig. 1. The framework in Fig. a is inspired by a related idea in [[37]]. The difference is that our framework is built on two deep recurrent neural policies: a metacontroller and a subcontroller. The metacontroller is equivalent to a “policy over subgoals” that receives the current observation and determine new subgoal . The subcontroller is equivalent to the option’s policy, which directly interacts with the environment by performing action . The subcontroller receives both and as inputs to the deep neural network. Each controller contains an arrow (Fig. 1) pointed to itself that indicates that the controller employs recurrent neural networks. In addition, an internal component called “critic” is used to determine whether the agent has achieved subgoal or not and to produce an intrinsic reward that is used to learn the subcontroller. In contrast to the framework in Fig. a, the framework in Fig. b does not use the current observation to determine the subgoal in the metacontroller but instead uses the last hidden state of the subcontroller. The last hidden state that is inferred from a sequence of observations of the subcontroller is expected to contribute to the metacontroller to correctly determine the next subgoal.


As mentioned in the previous section, RNNs is used in our framework in order to enable learning in POMDP. Particularly, CNNs is used to learn a lowdimensional representation for image inputs, while RNN is used in order to capture a temporal relation of observation data. Without using a recurrent layer as in RNNs, CNNs cannot accurately approximate the state feature from observations in POMDP domain. The procedure of the agent is clearly illustrated in Fig. 2. The metacontroller and subcontroller use the Deep Recurrent Q Network (DRQN), as described in [[41]]. Particularly, at step , the metacontroller takes an observation from the environment (framework 1) or the last hidden state of subcontroller generated by the previous sequence of observations (framework 2), extracts state features through some deep neural layers, internally constructs a hidden state , and produces the subgoal values . The subgoal values then are used to determine the next subgoal . Similarly, the subcontroller receives both observation and subgoal , extracts their feature, constructs a hidden state , and produces action values , which are used to determine the next action . Those explanations are formalized into the following equations:
(16)  
(17)  
(18)  
(19) 
where and are the features after the extraction process of the metacontroller and subcontroller, respectively. and are the respective recurrent networks of the metacontroller and subcontroller. They receive the state features and a hidden state, and then provide the next hidden state and value function.
The networks for controllers are illustrated in Fig. 3. For framework 1, we use the networks demonstrated in Fig. a for both the metacontroller and subcontroller. A sequence of four CONV layers and ReLU layers interleaved together is used to extract information from raw observations. A RNN layer, especially LSTM, is employed in front of the feature to memorize information from previous observations. The output of the RNN layer is split into Advantage stream and Value stream before being unified to the Qvalue. This architecture inherits from Dueling architecture [[47]], effectively learning states without having to learn the effect of an action on that state. For framework 2, we use the network in Fig. a for the subcontroller and use the network in Fig. b to the metacontroller. The metacontroller in framework 2 uses the state that is the internal hidden state from the subcontroller. By passing through four fully connected layers and ReLU layers, features of the metacontroller are extracted. The rest of the network is the same as the network for framework 1.
valign=b
3.1 Learning Model
We use a stateoftheart Double DQN to learn the parameters of the network. Particularly, the controllers estimate the following value function:
(20) 
and
(21) 
where can be a direct observation or an internal hidden state generated by the subcontroller. Let and be parameters (weights and bias) of the deep neural networks, that parameterize the networks and , correspondingly, e.g. and . Then, and are trained by minimizing the loss function and , correspondingly. can be formulated as:
(22) 
where denotes the expectation over a batch of data which is uniformly sampled from an experience replay , is the iteration number in the batch of samples and
(23) 
Similarly, the formula of is
(24) 
where
(25) 
is experience replay that stores the transition data from subcontroller, and is the target network of . Intuitively, in contrast to DQN which uses the maximum operator for both selecting and evaluating an action, Double DQN separately uses the main Q network to greedily estimate the next action and the target Q network to estimate the value function. This method has been shown to achieve better performance than DQN on Atari games [[46]].
3.2 Minibatch sampling strategy
For updating RNNs in our model, we need to pass a sequence of samples. Particularly, some episodes from the experience replay are uniformly sampled and are processed from the beginning of the episode to the end of the episode. This strategy called Bootstrapped Sequential Updates [[41]], is an ideal method to update RNNs because their hidden state can carry all information through the whole episode. However, this strategy is computationally expensive in a long episode, which can contain many time steps. Another approach proposed in [[41]] has been evaluated to achieve the same performance as Bootstrapped Sequential Updates, but reduce the computational complexity. The strategy is called Bootstrapped Random Updates and has a procedure as follows. This strategy also randomly selects a batch of episodes from the experience replay. Then, for each episode, we begin at a random transition and select a sequence of transitions. The value of that affects to the performance of our algorithm is analyzed in section 4. We apply the same procedure of Bootstrapped Random Updates to our algorithm.
In addition, the mechanism explained in [[51]] is also applied. That study explains a problem when updating DRQN: using the first observations in a sequence of transitions to update the value function might be inaccurate. Thus, the solution is to use the last observations to update DRQN. Particularly, our method uses the last transitions to update the Qvalue.
3.3 Subgoal Definition
Our model is based on the “option” framework. Learning an option is using flat deep RL algorithms to achieve a subgoal of that option. However, discovering subgoals among existing states in the environment is still one of the challenges in hierarchical reinforcement learning. For simplifying the model, we assume that a set of predefined subgoals is provided in advance. The predefined subgoals based on objectoriented MDPs [[52]], where entities or objects in the environment are decoded as subgoals.
3.4 Intrinsic and Extrinsic Reward Functions
Traditional RL accumulates all reward and penalty in a reward function, which is hard to learn in a specified task in a complex domain. In contrast, hierarchical RL introduces the notions of intrinsic reward function and an extrinsic reward function. Initially, intrinsic motivation is based on psychology, cognitive science, and neuroscience [[53]] and has been applied to hierarchical RL [[54]][[55]][[56]][[57]][[58]][[59]]. Our framework follows the model of intrinsic motivation in [[55]]. Particularly, within an option (or skill), the agent needs to learn an option’s policy (subcontroller in our framework) to obtain a subgoal (a salient event) under reinforcing of an intrinsic reward while for the overall task, a policy over options (metacontroller) is learned to generate a sequence of subgoals while reinforcing an extrinsic reward. Defining “good” intrinsic reward and extrinsic reward functions is still an open question in reinforcement learning, and it is difficult to find a reward function model that is generalized to all domains. To demonstrate some notions above, Fig. 4 describes the domain of multiple goals in fourrooms which is used to evaluate our algorithm in Section 4. The fourrooms contain a number of objects: an agent (in black), two obstacles (in red) and two goals (in blue and green). These objects are randomly located on the map. At each time step, the agent has to follow one of the four actions: top, down, left or right, and has to move to the goal location in a proper order: the blue goal first and then the green goal. If the agent obtains all goals in the right order, it will receive a big reward; otherwise, it will only receive a small reward. In addition, the agent has to learn to avoid the obstacles if it does not want to be penalized. For this example, the salient event is equivalent to reaching the subgoal or hitting the obstacle. In addition, there are two skills the agent should learn. One is moving to the goals while correctly avoiding the obstacles, and the second is selecting the goal it should reach first. The intrinsic reward for each skill is generated based on the salient events encounters while exploring the environment. Particularly, for reaching the goal, the intrinsic reward includes the reward for reaching goal successful and the penalty if the agent encounters an obstacle. For reaching the goals in order, the intrinsic reward includes a big reward if the agent reaches the goals in a proper order and a small reward if the agent reaches the goal in an improper order. A detailed explanation of intrinsic and extrinsic rewards for this domain is included in Section 4.
3.5 Learning Algorithm

Experiences replay memories and

Randomly initialize and

Assign value to the target networks and

and decreasing to

and decreasing to
In this section, our contributions are summarized through pseudocode Algorithm 27. The algorithm learns four neural networks: two networks for metacontroller ( and ) and two networks for subcontroller ( and ). They are parameterized by , , and . The architectures of the networks are described in Section 3. In addition, the algorithm separately maintains two experience replay memories and to store transition data from metacontroller and subcontroller, respectively. Before starting the algorithm, the parameters of the main networks are randomly initialized and are copied to the target networks. and are annealed from 1.0 to 0.1, which gradually increase control of controllers. The algorithm loops through a specified number of episodes (Line 9) and each episode is executed until the agent reaches the terminal state. To start an episode, first, a starting observation is obtained (Line 10). Next, hidden states, which are inputs to RNNs, must be initialized with zero values (Line 11 and Line 13) and are updated during the episode (Line 15 and Line 17). Each subgoal is determined by passing observation or hidden state (depending on the framework) to the metacontroller (Line 15). By following a greedy fashion, a subgoal will be selected from the metacontroller if it is a random number greater than . Otherwise, a random subgoal will be selected (Algorithm 12). The subcontroller is taught to reach the subgoal; when the subgoal is reached, a new subgoal will be selected. The process is repeated until the final goal is obtained. Intrinsic reward is evaluated by the critic module and is stored in (Line 19) for updating the subcontroller. Meanwhile, the extrinsic reward is directly received from the environment and is stored in for updating the metacontroller (Line 24) . Updating controllers at Line 20 and 21 is described in Section 3.1 and is summarized in Algorithm 4 and Algorithm 4.
4 Experiments
In this section, we evaluate two versions of the hierarchical deep recurrent network algorithm. hDRQNv1 is the algorithm formed by framework 1, and hDRQNv2 is the algorithm formed by framework 2. We compare them with flat algorithms (DQN, DRQN) and the stateoftheart hierarchical RL algorithm (hDQN). The comparisons are performed on three domains. The domain of multiple goals in a gridworld is used to evaluate many aspects of the proposed algorithms. Meanwhile, the harder domain called multiple goals in fourrooms is used to benchmark the proposed algorithm. Finally, the most challenging game in ATARI 2600 [[9]] called Montezuma’s Revenge, is used to confirm the efficiency of our proposed framework.
4.1 Implementation Details
We use Tensorflow [[60]] to implement our algorithms. The settings for each domain are different, but they have some commonalities as follows. For a hDRQNv1 algorithm, the inputs to the metacontroller and subcontroller are an image of size (a color image). The input image is resized from an observation which is observed around the agent (either unit or unit). The image feature of 256 values extracted through four CNNs and ReLUs is put into a LSTM layer of 256 states to generate 256 output values, and an internal hidden state of 256 values is also constructed. For a hDRQNv2 algorithm, a hidden state of 256 values is put into the network of the metacontroller. The state is passed through four fully connected layers and ReLU layers instead of four CONV layers. The output is a feature of 256 values. The algorithm uses ADAM [[61]] for learning the neural network parameters with the learning rate for both the metacontroller network and subcontroller. For updating the target network, a value of is applied. The algorithm uses a discount factor of . The capacity of and is and , respectively.
4.2 Multiple Goals in a Gridworld
The domain of multiple goals in a gridworld is a simple form of multiple goals in fourrooms which is described in Section 3.4. In this domain, a gridworld map of size unit is instead of fourrooms map. At each time step, the agent only observes a part of the surrounding environment, either unit (Fig. b) or unit (Fig. c). The agent is allowed to choose one of four actions (top, left, right, bottom) that are deterministic. The agent can not move if the action leads it to the wall. The rewards for the agent are defined as follows. If the agent hits an obstacle, it will receive a penalty of minus one. If the agent reaches two goals in proper order, it will receive a reward of one for each goal. Otherwise, it only receives . For applying to an hDRQN algorithm, an intrinsic reward function and an extrinsic reward function are defined as follows
(26) 
and
(27) 
\adjustboxvalign=b

\adjustboxvalign=b

The first evaluation reported in Fig. 6 is a comparison of different lengths of selected transitions discussed in section 3.2). The agent in this evaluation can observe an area of unit. We report the performance through three runs of 20000 episodes and each episode has 50 steps. The number of steps for each episode assures that the agent can explore any location on the map. In the figures on the left (hDRQNv1) and in the middle (hDRQNv2), we use a fixed length of metatransitions () and compare different lengths of subtransitions. Otherwise, the figures on the right show the performance of the algorithm using a fixed length of subtransitions () and compare different lengths of metatransitions. With a fixed length of metatransition, the algorithm performs well with a long length of subtransition ( or ); the performance decreases when the length of subtransitions is decreased. Intuitively, the RNN needs a sequence of transitions that is long enough to increase the probability that the agent will reach the subgoal within that sequence. Another observation is that, with or , there is a little difference in performance. This is reasonable because only eight transitions are needed for the agent to reach the subgoals. For a fixed length of subtransitions (), with a hDRQNv1 algorithm, the setting with has low performance and high variance compared to the setting with . The reason is that while the subcontroller for two settings has the same performance (Fig. f) the metacontroller with performs better than the metacontroller with . Meanwhile, with a hDRQNv2 algorithm, the performance is the same at both settings and . This means that the hidden state from the subcontroller is a better input to determine the subgoal rather than a raw observation as it causes the algorithm to not depend on the length of the metatransition. The number of steps to obtain two goals in order is around 22.



.
The next evaluation is a comparison at different levels of observation. Fig. 7 shows the performance of hDRQN algorithms with a observable agent compared with a observable agent and a fully observable agent. It is clear that a fully observable agent can have more information around it than a observable agent and a observable agent; thus, the agent with a larger observation area can quickly explore and localize the environment completely. As a result, the performance of the agent with a larger observation area is better than the agent with a smaller observing ability. From the figure, the performance of a observable agent using hDRQNv2 seems to converge faster than a fully observable agent. However, the performance of the fully observable agent surpasses the performance of observable agent at the end.
In the last evaluation of this domain, we compare the performance of the proposed algorithms with the wellknown algorithms DQN, DRQN, and hDQN [[37]]. All algorithms assume that the agent only observes an area of units around it. The results are shown in Fig. 8. For both domains of two goals and three goals, hDRQN algorithms outperform other algorithms and hDRQNv2 has the best performance. The hDQN algorithm, which can operate in a hierarchical domain, is better than flat algorithms but not better than hDRQN algorithms. It might be that hDQN algorithm is only for fully observed domains and has poor performance in partially observable domains.

4.3 Multiple Goals in a Fourrooms Domain
In this domain, we apply the multiple goals domain to a complex map called fourrooms (Fig. 9). The dynamics of the environment in this domain is similar to that of the task in 4.2. The agent in this domain must usually pass through hallways to obtain goals that are randomly located in four rooms. Originally, the fourrooms domain was an environment for testing a hierarchical reinforcement learning algorithm [[21]].
\adjustboxvalign=b

\adjustboxvalign=b

The performance is shown in Fig. a is averaged through three runs of 50000 episodes and each episode has 50 time steps. Meanwhile, the performance is shown in Fig. b is averaged through three runs of 100000 episodes, and each episode has 100 time steps. Similarly, the proposed algorithms outperform other algorithms, especially, the hDRQNv2 algorithm.

4.4 Montezuma’s Revenge game in ATARI 2600
Montezumaâs Revenge is the hardest game in ATARI 2600, where the DQN algorithm [[10]] can only achieve a score of zero. We use OpenAI Gym to simulate this domain [[62]]. The game is hard because the agent must execute a long sequence of actions until a state with nonzero reward (delayed reward) can be visited. In addition, in order to obtain a state with larger rewards, the agent needs to reach a special state in advance. This paper evaluates our proposed algorithms on the first screen of the game (Fig. 11). Particularly, the agent, which only observes a part of the environment (Fig. b), needs to pass through doors (the yellow line in the top left and top right corners of the figure) to explore other screens. However, to pass through the doors, first, the agent needs to pick up the key on the left side of the screen. Thus, the agent must learn to navigate to the key’s location and then navigate back to the door and open the next screens. The agent will earn 100 after it obtains the key and 300 after it reaches any door. Totally, the agent can receive 400 for this screen.
\adjustboxvalign=b

\adjustboxvalign=b

The intrinsic reward function is defined to motivate the agent to explore the whole environment. Particularly, the agent will be received an intrinsic value of 1 if it could reach a subgoal from other subgoals. The set of subgoals is predefined in Fig. a (the white rectangles). In contrast to intrinsic reward function, the extrinsic reward function is defined as a reward value of 1 when the agent obtains the key or open the doors. Because learning the metacontroller and the subcontrollers simultaneously is highly complex and timeconsuming, we separate the learning process into two phases. In the first phase, we learn the subcontrollers completely such that the agent can explore the whole environment by moving between subgoals. In the second phase, we learn the metacontroller and subcontrollers altogether. The architecture of the metacontroller and the subcontrollers is described in section 4.1. The length of subtransition and metatransition is and correspondingly. In this domain, the agent can observe an area of pixels. Then, the observation area is resized to to fit the input of a controller network. The performance of proposed algorithms compared to baseline algorithms is shown in Fig. 12. DQN reported a score of zero which is similar to the result from [[10]]. DRQN which can perform well in a partially observable environment also achieves a score of zero because of the highly hierarchical complexity of the domain. Meanwhile, hDQN can achieve a high score on this domain. However, it cannot perform well in a partial observability setting. The performance of hDQN in a full observability setting can be found in the paper of Kulkarni [[37]]. Our proposed algorithms can adapt to the partial observability setting and hierarchical domains as well. The hDRQNv2 algorithm shows a better performance than hDRQNv1. It seems that the difference in the architecture of two frameworks (described in section 3) has affected their performance. Particularly, using internal states of a subcontroller as the input to the metacontroller can give more information for prediction than using only a raw observation. To evaluate the effectiveness of the two algorithms, we report the success ratio for reaching the goal “key” in Fig. 13 and the number of time steps the agent explores each subgoal in Fig. 14. In Fig. 13, the agent using hDRQNv2 algorithm almost picks up the “key” at the end of the learning process. Moreover, Fig. 14 shows that hDRQNv2 tends to explore more often on subgoals that are on the way reaching the “key” (E.g. toprightladder, bottomrightladder, and bottomleftladder) while exploring less often on other subgoals such as the left door and right door.
5 Conclusion
We introduced a new hierarchical deep reinforcement learning algorithm that is a learning framework for both full observability (MDP) and partial observability (POMDP). The algorithm takes advantages of deep neural networks (DNN, CNN, LSTM) to produce hierarchical policies that can solve domains with a highly hierarchical nonlinearity. We showed that the framework has performed well when learning in hierarchical POMDP environments. Nevertheless, our approach contains some limitations as follows. First, our framework is built on two levels of a hierarchy which does not fit into the domain with multiple levels of hierarchy. Second, in order to simplify the learning problem in hierarchical POMDP, we assume that the set of subgoals is predefined and fixed because the problem of discovering a set of subgoals in POMDP is still a hard problem. In the future, we plan to improve our framework by tackling those problems. In addition, we can apply DRQN to multiagent problems where the environment is partially observable and the task is hierarchical.
Acknowledgment
The authors are grateful to the Basic Science Research Program through the National Research Foundation of Korea (NRF2017R1D1A1B04036354).
References
 R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 1st ed. Cambridge, MA, USA: MIT Press, 1998.
 M. P. Deisenroth, G. Neumann, J. Peters et al., “A survey on policy search for robotics,” Foundations and Trends® in Robotics, vol. 2, no. 1–2, pp. 1–142, 2013.
 J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on. IEEE, 2006, pp. 2219–2225.
 ——, “Natural actorcritic,” Neurocomputing, vol. 71, no. 7, pp. 1180 – 1190, 2008, progress in Modeling, Theory, and Application of Computational Intelligenc. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231208000532
 J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of the 32nd International Conference on Machine Learning (ICML15), 2015, pp. 1889–1897.
 J. W. Lee, “Stock price prediction using reinforcement learning,” in Industrial Electronics, 2001. Proceedings. ISIE 2001. IEEE International Symposium on, vol. 1. IEEE, 2001, pp. 690–695.
 J. Moody and M. Saffell, “Learning to trade via direct reinforcement,” IEEE transactions on neural Networks, vol. 12, no. 4, pp. 875–889, 2001.
 Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement learning for financial signal representation and trading,” IEEE transactions on neural networks and learning systems, vol. 28, no. 3, pp. 653–664, 2017.
 V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
 V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 G. Tesauro, D. Gondek, J. Lenchner, J. Fan, and J. M. Prager, “Simulation, learning, and optimization techniques in watson’s game strategies,” IBM Journal of Research and Development, vol. 56, no. 3.4, pp. 16–1, 2012.
 G. Tesauro, “Tdgammon: A selfteaching backgammon program,” in Applications of Neural Networks. Springer, 1995, pp. 267–285.
 D. Ernst, M. Glavic, and L. Wehenkel, “Power systems stability control: Reinforcement learning framework,” IEEE transactions on power systems, vol. 19, no. 1, pp. 427–435, 2004.
 J. Kober and J. Peters, “Reinforcement learning in robotics: A survey,” in Reinforcement Learning. Springer, 2012, pp. 579–610.
 R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
 P. Dayan and G. E. Hinton, “Using expectationmaximization for reinforcement learning,” Neural Computation, vol. 9, no. 2, pp. 271–278, 1997.
 D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette, “Evolutionary algorithms for reinforcement learning,” 1999.
 V. R. Konda and J. N. Tsitsiklis, “Actorcritic algorithms,” in Advances in neural information processing systems, 2000, pp. 1008–1014.
 S. Sukhbaatar, A. Szlam, G. Synnaeve, S. Chintala, and R. Fergus, “Mazebase: A sandbox for learning from games,” arXiv preprint arXiv:1511.07401, 2015.
 R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semimdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, vol. 112, no. 12, pp. 181–211, 1999.
 R. E. Parr, Hierarchical control and learning for Markov decision processes. University of California, Berkeley Berkeley, CA, 1998.
 R. Parr and S. J. Russell, “Reinforcement learning with hierarchies of machines,” in Advances in neural information processing systems, 1998, pp. 1043–1049.
 T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,” Journal of Artificial Intelligence Research, vol. 13, no. 1, pp. 227–303, 2000.
 N. A. Vien, H. Q. Ngo, S. Lee, and T. Chung, “Approximate planning for bayesian hierarchical reinforcement learning,” Appl. Intell., vol. 41, no. 3, pp. 808–819, 2014.
 N. A. Vien and M. Toussaint, “Hierarchical montecarlo planning,” in Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, January 2530, 2015, Austin, Texas, USA., 2015, pp. 3613–3619.
 N. A. Vien, S. Lee, and T. Chung, “Bayesadaptive hierarchical mdps,” Appl. Intell., vol. 45, no. 1, pp. 112–126, 2016.
 A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete Event Dynamic Systems, vol. 13, no. 4, pp. 341–379, 2003.
 M. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying countbased exploration and intrinsic motivation,” in Advances in Neural Information Processing Systems, 2016, pp. 1471–1479.
 P.L. Bacon, J. Harb, and D. Precup, “The optioncritic architecture.” 2017.
 R. Fox, S. Krishnan, I. Stoica, and K. Goldberg, “Multilevel discovery of deep options,” arXiv preprint arXiv:1703.08294, 2017.
 S. Lee, S.W. Lee, J. Choi, D.H. Kwak, and B.T. Zhang, “Microobjective learning: Accelerating deep reinforcement learning through the discovery of continuous subgoals,” arXiv preprint arXiv:1703.03933, 2017.
 A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” arXiv preprint arXiv:1703.01161, 2017.
 I. P. Durugkar, C. Rosenbaum, S. Dernbach, and S. Mahadevan, “Deep reinforcement learning with macroactions,” arXiv preprint arXiv:1606.04615, 2016.
 K. Arulkumaran, N. Dilokthanakul, M. Shanahan, and A. A. Bharath, “Classifying options for deep reinforcement learning,” arXiv preprint arXiv:1604.08153, 2016.
 P. Dayan and G. E. Hinton, “Feudal reinforcement learning,” in Advances in neural information processing systems, 1993, pp. 271–278.
 T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum, “Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation,” in Advances in Neural Information Processing Systems, 2016, pp. 3675–3683.
 C.C. Chiu and V.W. Soo, “Subgoal identifications in reinforcement learning: A survey,” in Advances in Reinforcement Learning. InTech, 2011.
 M. Stolle, “Automated discovery of options in reinforcement learning,” Ph.D. dissertation, McGill University, 2004.
 K. P. Murphy, “A survey of pomdp solution techniques,” environment, vol. 2, p. X3, 2000.
 M. Hausknecht and P. Stone, “Deep recurrent qlearning for partially observable mdps,” 2015.
 M. Egorov, “Deep reinforcement learning with pomdps,” 2015.
 C. C. White, “Procedures for the solution of a finitehorizon, partially observed, semimarkov optimization problem,” Operations Research, vol. 24, no. 2, pp. 348–358, 1976.
 L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial intelligence, vol. 101, no. 1, pp. 99–134, 1998.
 I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qlearning.” 2016.
 Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas, “Dueling network architectures for deep reinforcement learning,” arXiv preprint arXiv:1511.06581, 2015.
 T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
 S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep qlearning with modelbased acceleration,” in International Conference on Machine Learning, 2016, pp. 2829–2838.
 T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
 G. Lample and D. S. Chaplot, “Playing fps games with deep reinforcement learning.” 2017.
 C. Diuk, A. Cohen, and M. L. Littman, “An objectoriented representation for efficient reinforcement learning,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 240–247.
 R. M. Ryan and E. L. Deci, “Intrinsic and extrinsic motivations: Classic definitions and new directions,” Contemporary educational psychology, vol. 25, no. 1, pp. 54–67, 2000.
 A. Stout, G. D. Konidaris, and A. G. Barto, “Intrinsically motivated reinforcement learning: A promising framework for developmental robot learning,” MASSACHUSETTS UNIV AMHERST DEPT OF COMPUTER SCIENCE, Tech. Rep., 2005.
 A. G. Barto, “Intrinsically motivated learning of hierarchical collections of skills,” 2004, pp. 112–119.
 S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg, “Intrinsically motivated reinforcement learning: An evolutionary perspective,” IEEE Transactions on Autonomous Mental Development, vol. 2, no. 2, pp. 70–82, 2010.
 M. Frank, J. Leitner, M. Stollenga, A. Förster, and J. Schmidhuber, “Curiosity driven reinforcement learning for motion planning on humanoids,” Frontiers in neurorobotics, vol. 7, p. 25, 2014.
 S. Mohamed and D. J. Rezende, “Variational information maximisation for intrinsically motivated reinforcement learning,” in Advances in neural information processing systems, 2015, pp. 2125–2133.
 J. Schmidhuber, “Formal theory of creativity, fun, and intrinsic motivation (1990–2010),” IEEE Transactions on Autonomous Mental Development, vol. 2, no. 3, pp. 230–247, 2010.
 M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Largescale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org.
 D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” 2016.