Hierarchical Reinforcement Learning in StarCraft II with Human Expertise in Subgoals Selection
This work is inspired by recent advances in hierarchical reinforcement learning (HRL) [recent-hrl-Barto2003, hrl-springer-Hengst2010], and improvements in learning efficiency from heuristic-based subgoal selection, experience replay [experience-replay-Lin1993, HER-Andrychowicz2017], and task-based curriculum learning [Bengio2009-curriculum-learning, Zaremba2014-curriculum-task-specific]. We propose a new method to integrate HRL, experience replay and effective subgoal selection through an implicit curriculum design based on human expertise to support sample-efficient learning and enhance interpretability of the agentâs behavior. Human expertise remains indispensable in many areas such as medicine [ai-medicine-Buch2018] and law [ai-law-Cath2018], where interpretability, explainability and transparency are crucial in the decision making process, for ethical and legal reasons. Our method simplifies the complex task sets for achieving the overall objectives by decomposing them into subgoals at different levels of abstraction. Incorporating relevant subjective knowledge also significantly reduces the computational resources spent in exploration for RL, especially in high speed, changing, and complex environments where the transition dynamics cannot be effectively learned and modelled in a short time. Experimental results in two StarCraft II (SC2) [Vinyals2017] minigames demonstrate that our method can achieve better sample efficiency than flat and end-to-end RL methods, and provides an effective method for explaining the agentâs performance.
Reinforcement learning (RL) [RLbook2018] enables agents to learn how to take actions, by interacting with an environment, to maximize a series of rewards received over time. In combination with advances in deep learning and computational resources, the Deep Reinforcement Learning (DRL) [dqn-Mnih2013] formulation has led to dramatic results in acting from perception [human-level-control-Mnih2015], game playing [go-Silver2016a], and robotics [robotics-OpenAI2018]. However, DRL usually requires extensive computations to achieve satisfactory performance. For example, in full-length StarCraft II (SC2) games, AlphaStar [alphaStar] achieves superhuman performance at the expense of huge computational resources
We argue that learning a new task in general or SC2 minigames in particular is a two-stage process, viz., learning the fundamentals, and mastering the skills. For SC2 minigames, novice human players learn the minigame fundamentals reasonably quickly by decomposing the game into smaller, distinct and necessary steps. However, to achieve mastery over the minigame, humans take a long time, mainly to practice the precision of skills. RL agents, on the other hand, may take a long time to learn the fundamentals of the gameplay but achieve mastery (stage two) efficiently. This can be observed from the training progress curves in [Vinyals2017] which shows spikes followed plateaus of reward signals instead of steady and gradual increases.
We want to leverage human expertise to reduce the ‘warm-up’ time required by the RL agents. The Hierarchical Reinforcement Learning (HRL) framework [HRL-subgoalsubpolicy-Bakker2004, HER-HRL-Evel2019] comprises a general layered architecture that supports different levels of abstractions corresponding to human expertise and agent’s skills at the low-level manoeuvres. Intuitively, HRL provides a way for combining the best from human expertise and agent by organizing the inputs from humans at a high level (more abstract) and those from agents at a lower level (more precise). In this work, we extend the HRL framework to incorporate human expertise in subgoal selection. We demonstrate the effects of our methods in mastering SC2 minigames, and present preliminary results on sample efficiency and interpretability over the flat RL methods.
The rest of the paper is organized as follows. We briefly outline the background information in the next section. Next, we describe our proposed methodology. Further, we discuss the related works and present our experimental results. We then conclude the paper highlighting opportunities for future work.
Markov decision process and Reinforcement learning:
A Markov decision process (MDP) is a five-tuple , where, is the set of states the agent can be in; is the set of possible actions available for the agent; is the reward function, is the transition function; and is the discount factor that denotes the usefulness of the future rewards. We consider the standard formalism of reinforcement learning where an agent continuously interacts with a fully observable environment, defined using an MDP. A deterministic policy is a mapping and we can describe a sequence of actions and reward signals from the environment. Every episode begins with an initial . At each , the agent takes an action , and gets a reward . At the same time, is sampled from . Over time, the discounted cumulative reward, called return, is calculated as: . The agent’s task is to maximize the expected return . Furthermore, the Q-function (or action-value function) is defined as . Assuming an optimal policy . All optimal policies have the same Q-function called the optimal Q-function, denoted , satisfying this Bellman equation:
Q-function Approximators The above definitions enable one possible solution to MDPs: using a function approximator for . Deep-Q-Networks (DQN) [dqn-Mnih2013] and Deep Deterministic Policy Gradients (DDPG) [DDPG-Lillicrap2016], are such approaches tackling model-free RL problems. Typically, a neural network is trained to approximate . During training, experiences are generated via an exploration policy, usually -greedy policy with the current . The experience tuples are stored in a replay buffer. is trained using gradient descent with respect to the loss , where with experiences sampled from the replay buffer.
An exploration policy is a policy that describes how the agent interacts with the environment. For instance, a policy that picks actions randomly encourages exploration. On the other hand, a greedy policy with respect to , as in , encourages exploitation. To balance these, a standard approach of -greedy [RLbook2018] is adopted: with probability take a random action, and with probability take a greedy action.
Goal Space \citeauthoruvfa-Schaul2015 \shortciteuvfa-Schaul2015 extended DQN to include a goal space . A (sub)goal can be described with specifically selected states, or via functions such as , either a state is a goal or not. Introducing modifies the original reward function slightly: , . At the beginning of each episode, in addition to , the initialization includes a fixed to create a tuple . Other modifications naturally follow: , and .
Experience Replay \citeauthorexperience-replay-Lin1993 \shortciteexperience-replay-Lin1993 proposed the idea of using ‘experiences buffers’ to help machines learn. Formally, a single time step experience is defined as a tuple and more generally an experience can be constructed by concatenating multiple consecutive experience tuples.
Curriculum Learning Methods in this framework typically explicitly or implicitly design a series of tasks or goals (with gradually increased difficulties) for the agent to follow and learn, i.e., the curriculum [Bengio2009-curriculum-learning, weng2020curriculum-blog].
StarCraft II SC2 is a real-time-strategy (RTS) game, where players command their units to compete against each other. In an SC2 full-length game, typically players start out by commanding units to collect resources (minerals and gas) to build up their economy and army at the same time. When they have amassed a sufficiently large army, they command these units to attack their opponents’ base in order to win. SC2 is currently a very promising simulation environment for RL, due to its high flexibility and complexity and wide-ranging applicability in the fields of game theory, planning and decision making, operations optimization, etc. SC2 minigames, as opposed to full-length games described above, are built-in episodic tutorials where novice players can learn and practice their skills in a controlled and less complex environment. Some relevant skills include collecting resources, building certain army units, etc.
We propose a novel method of integrating the advantages of human expertise and RL agents to facilitate fundamentals learning and skills mastery of a learning task. Our method adopts the principle of Curriculum Learning [Bengio2009-curriculum-learning] and follows a task-oriented approach [Zaremba2014-curriculum-task-specific]. The key idea is to leverage human expertise to simplify the complex learning procedure, by decomposing it into hierarchical subgoals as the curriculum for the agent. More specifically, we factorize the learning task into several successive subtasks indispensable for the agent to complete the entire complex learning procedure. The customized reward function in each subtask implicitly captures the corresponding subgoal. Importantly, these successive subgoals are determined so that they are gradually more difficult to improve learning efficiency [Bengio2009-curriculum-learning, Justesen2018-illuminating-increasing-difficulty]. With defined subgoals, we use the Experience Replay technique to construct the experiences to further improve the empirical sample efficiency [HER-Andrychowicz2017, HRL-subgoalsubpolicy-Bakker2004, HER-HRL-Evel2019]. Furthermore, adopting clearly defined subtasks and subgoals enhances the interpretability of the agent’s learning progress. In implementation, we customize SC2 minigames to embed human expertise on subgoal information and the criteria to identify and select subgoals during learning. Therefore, the agent learns the subpolicies and combines them in a hierarchical way. By following a well-defined decomposition of the original minigame into subtasks, we can choose the desired state of a previous subtask to be the starting conditions of the next subtask, thus completing the connection between subtasks.
Hierarchy: Subgoals and Subtasks
Our proposed hierarchy is composed of subgoals, which collectively divide the problem into simpler subtasks that can be solved easily and efficiently. Each subgoal is implicitly captured as the desired state in its corresponding subtask, and we refer to the agent’s skills to reach a subgoal its corresponding subpolicy. The rationale behind this is as follows. First, the advantages of human expertise and the agents are complementary to each other in terms of learning and mastering the task. Human players are good at seeing the big picture and thus identifying the essential and distinct steps/skills very quickly. On the other hand, agents are proficient in honing learned skills and maneuvers to a high degree of precision. Second, a hierarchy helps reduce the complexity of search space via divide-and-conquer. Lastly, this method enhances the interpretability of the subgoals (and subpolicies).
Figure 1 illustrates the concept of subgoals and subpolicies with a simple navigation agent. The agent is learning to navigate to the flag post from the initial state . One possible sequence of the states is . Therefore, the entire trajectory can be decomposed into subgoals; for instance, \citeauthorHER-HRL-Evel2019 \shortciteHER-HRL-Evel2019 used heuristic-based subgoal selection criteria (in Figure 1 these selected subgoals, , are denoted by orange circles). On the other hand, the sequence of red nodes denote subgoals of our method. We highlight that this sequence would constitute a better guided and more efficient exploration path. In addition this sequence is better aligned with the game where some states are the prerequisites for other states (illustrated as the black dashed arrows).
Subgoals Selection and Experience Replays
Subgoal Design and Selection. We use the similar method for constructing experiences with a goal space as previous works [HER-Andrychowicz2017, HER-HRL-Evel2019]. However, our method introduces human expertise in constructing the hierarchy and subgoals selection. In [HER-Andrychowicz2017], the hindsight experience replay buffer is constructed via random sampling from the goal space and concatenating the sampled goals to an already executed sequence , hence the name hindsight. The subgoals are initialized with heuristic-based selection and updated according to hindsight actions. For example, in Figure 1, given a predetermined subgoal , the agent might not successfully reach it, and instead ends up in . In this case, the subgoal set in hindsight is (updated from ).
Our method distinguishes in that the (sub)goals selection strategy is designed with human expertise, to give a fixed but suitable decomposition of the learning task. Furthermore, we exploit the underlying sequential relationship among the subgoals as in the game some states are the prerequisites for others. Hence, certain actions are required to be performed in order. Furthermore, another reason for introducing human expertise rather than using end-to-end learning alone is that compared with the environments investigated in previous HRL works, SC2 encompasses a significantly larger state-action space that prohibits a sample-efficient end-to-end learning strategy. As a result, our method enjoys an added advantage of interpretability of the selected subgoals.
Subtasks Implementations. We leverage the customizability of SC2 minigames to carefully design subtasks to enable training of the corresponding subpolicies, as suggested in [recent-hrl-Barto2003]. We illustrate with the Collect Minerals and Gas (CMAG) minigame, as shown and described in Figure 2. There are several distinct and sequential actions the player has to perform to score well: 1. commanding the Space Construction Vehicles (SCVs) - basic units of the game, to collect minerals; 2. having collected sufficient minerals, selecting SCVs to build the gas refinery (a prerequisite building for collecting vespene gas) on specific locations with existing gas wells; 3. commanding the SCVs to collect vespene gas from the constructed gas refinery; 4. producing additional SCVs (at a fixed cost) to optimize the mining efficiency. And there is a fixed time duration of 900 seconds. The challenge of CMAG is that all these actions/subpolicies should be performed in an optimized sequence for best performance. The optimality depends on the order, timing, and the number of repetitions of these actions. For instance, it is important not to under/over-produce SCVs at a mineral site for optimal efficiency. Hence, we implemented the following subtasks: BuildRefinery, CollectGasWithRefineries and BuildRefinieryAndCollectGas. In the first two subtasks, the agent learns the specific subpolicies to build refineries and to collect gas (from built refineries), respectively, while in the last subtask the agent learns to combine them. Based on the same idea, the complete decomposition for CMAG is given by [CMAG, BuildRefinery, CollectGasWithRefineries, BuildRefineryAndCollectGas, CMAG] where the first CMAG trains the agent to collect minerals, and the last CMAG trains it to combine all subpolicies and also ‘re-introduces’ the reward signal for collecting minerals to avoid forgetting [Zaremba2014-curriculum-task-specific]. Similarly, for the BuildMarines (BM) minigame, shown in Figure 3, the sequential steps/actions are: 1. commanding the SCVs to collect minerals; 2. having collected sufficient minerals, selecting SCVs to build a supply depot (a prerequisite building for barracks and to increase the supplies limit); 3. having both sufficient minerals and a supply depot, selecting SCVs to build barracks; 4. having minerals, a supply depot and barracks and with current unit count less than the supplies limit, selecting the barracks to train marines. The fixed time duration for BM is 450 seconds. Therefore, we implemented the corresponding subtasks: BuildSupplyDepots, BuildBarracks, BuildMarinesWithBarracks and the complete decomposition for BM is [BuildSupplyDepots, BuildBarracks, BuildMarinesWithBarracks, BM]. Note we do not set BM as a first subtask as for CMAG because CMAG contains both reward signals for minerals and gas, so it is an adequate simple task for the agent to learn to collect minerals. However, BM has only the reward signals for training marines, thus too difficult as the first subtask.
Construct Experience Replay for Each Subtask. With the designed subtasks represented by our customized minigames, constructing experience replays is straightforward. For a subtask, a predetermined subgoal is implicitly captured in its customized minigame (e.g., to build barracks, to manufacture SCVs, etc.) using a corresponding reward signal, so that the agent learns to reach . For the immediate subsequent subtask, we set its initial conditions to be the completed subgoal . So, the agent learns to continue on the basis of a completed . It is an implicit process because, when learning to reach subgoal , the agent does not see or interact directly with the reward signal corresponding to . For example, between two ordered subtasks CollectMinerals and BuildRefinery, the agent learns to collect minerals first and starts with some collected minerals in the latter with the sole objective of learning to build refineries.
Off-policy learning and PPO. Off-policy learning is a learning paradigm where the exploration and learning are decoupled and take place separately. Exploration is mainly used by the agent to collect experiences or ‘data points’ for its policy function or model. Learning is then conducted on these collected experiences, and Proximal Policy Optimization (PPO) [PPO-Schulman2017] is one such method. Its details are not the focus of this work and omitted here.
Algorithm. We describe the HRL algorithm with human expertise in subgoal selection here. The pseudo-code is given in Algorithm 1. For a learning task, a sequence of subtasks is designed with human expertise to implicitly define the subgoals and we refer to our customized SC2 minigames as subtasks for the learning task. We pre-define reward thresholds , for all subtasks. As the agent’s running average reward is higher than a threshold, this agent is considered to have learnt the corresponding subtask well and will move to the subsequent subtask. We use learner to denote the agent and to describe how it makes decisions and takes actions. It can be represented by a deep neural network, and parametrized by . In addition, we define a sample count and sample limit . Sample count refers to the number of samples the agent has used for learning a subtask. Sample limit refers to the total number of samples allowed for the agent for the entire learning task, i.e., for all subtasks combined. and together are used to demonstrate empirical sample efficiency.
With these definitions and initializations, the algorithm takes the defined sequence of subtasks with corresponding and initiates learning on these subtasks in the same sequence. During the process, a running average of the agent’s past achieved rewards is kept for each subtask, represented by the API call test(). For each subtask , either the agent completely exhausts its assigned sample limit or it successfully reaches the . If the running average of past rewards , the agent completes learning on and starts with ; the process continues until all subtasks are learned. We follow the exploration policy in preliminaries and adopt an -greedy policy, represented by explore() in Algorithm 1.
Experience Replay RL has achieved impressive developments in robotics [roboticRL], strategic games such as Go [alphaGo], real-time strategy games [deep-relational-Zambaldi2019, alphaStar] etc. Researchers have attempted in various ways to address the challenge of goal-learning, reward shaping to get the ‘agent’ to learn to master the task, and yet not overfit to the particular instances of the goals or reward signals. Experience Replay [experience-replay-Lin1993] is a technique to store and re-use past records of executions (along with the signals from the environment) to train the ‘agent’, achieving efficient sample usage. \citeauthordqn-Mnih2013 \shortcitedqn-Mnih2013 employed this technique together with Deep-Q-Learning to produce state-of-the-art results in Atari, and subsequently \citeauthorhuman-level-control-Mnih2015 \shortcitehuman-level-control-Mnih2015 confirmed the effectiveness of such approach under the stipulation that the ‘agent’ only sees what human players would see, i.e., the pixels from the screen and some scoring indices.
Curriculum Learning \citeauthorBengio2009-curriculum-learning \shortciteBengio2009-curriculum-learning hypothesized and empirically showed that introducing gradually more difficult examples speeds up the online learning, using a manually designed task-specific curriculum. \citeauthorZaremba2014-curriculum-task-specific \shortciteZaremba2014-curriculum-task-specific experimentally showed that it is important to mix in easy tasks to avoid forgetting. \citeauthorJustesen2018-illuminating-increasing-difficulty \shortciteJustesen2018-illuminating-increasing-difficulty demonstrated that training an RL agent over a simple curriculum with gradually increasing difficulty can effectively prevent overfitting and lead to better generalization.
Hierarchical Reinforcement Learning (HRL) HRL and its related concepts such as options [options-Sutton1999] macro-actions [macro-actions-Hauskrecht2013], or tasks [CSRL-zhuoru2017-medcomp] were introduced to decompose the problem, usually a Markov decision process (MDP), into smaller sub-parts to be efficiently solved. We refer the readers to [recent-hrl-Barto2003, hrl-springer-Hengst2010] for more comprehensive treatments. We describe two tracks of related works most relevant to our problem. \citeauthorHRL-subgoalsubpolicy-Bakker2004 \shortciteHRL-subgoalsubpolicy-Bakker2004 proposed a two-level hierarchy, using subgoal and subpolicy to describe the learning taking place at the lower level of the hierarchy. \citeauthorHER-HRL-Evel2019 \shortciteHER-HRL-Evel2019 further articulated these ideas, and explicitly combined them with Hindsight Experience Replay [HER-Andrychowicz2017] for better sample efficiency and performance. Another similarly inspired approach called context sensitive reinforcement learning (CSRL) introduced by \citeauthorCSRL-zhuoru2017-medcomp \shortciteCSRL-zhuoru2017-medcomp employed the hierarchical structure to enable effective re-use of learnt knowledge of similar (sub)tasks in a probabilistic way. In CSRL, instead of Experience Replay, efficient simulations over constructed states are used in learning, able to learn both the tasks, and the environment (the transition and reward functions). CSRL scales well with state space, and is relatively easily parallelizable.
StarCraft II In addition to [deep-relational-Zambaldi2019], several works addressed some of the challenges presented by SC2. In a real-time strategy (RTS) game such as SC2, the hierarchical architecture is an intuitive solution concept, for its efficient representation and interpretability. Similar but different hierarchies were employed in two other works, where \citeauthormodular-sc2-lee2018 \shortcitemodular-sc2-lee2018 designed the hierarchy with semantic meaning and from a operational perspective while \citeauthorSC2Fulllength-Pang2018 \shortciteSC2Fulllength-Pang2018 forewent explicit semantic meanings for higher flexibility. Both provided promising empirical results on the full-length games against built-in AIs. Instead of full-length SC2 games, our investigation targets the minigames and we propose a way to integrate human expertise, the Curriculum Learning paradigm and the Experience Replay technique into the learning process.
Different from related works, our work adopts a principle-driven HRL approach with human expertise in the subgoal selection and thus an implicit formulation of a curriculum for the agent, on SC2 minigames in order to achieve empirical sample efficiency and to enhance interpretability.
In the experiments, we specifically focus on two minigames, viz., BM and CMAG to investigate the effectiveness of our method. We choose these two because, the discrepancies in the performance between trained RL agents and human experts are the most significant as reported in [Vinyals2017], suggesting these two are the most challenging for non-hierarchical end-to-end learning approaches. For both CMAG and BM, we have implemented our customized SC2 minigames (subtasks) as described in the proposed methodology section, and we pair them with pre-defined reward thresholds. In our experiments, the decompositions for BM and CMAG are [BuildSupplyDepots, BuildBarracks, BuildMarinesWithBarracks, BM], and [CMAG, BuildRefinery, CollectGasWithRefineries, BuildRefineryAndCollectGas, CMAG], respectively.
Model Architecture and Hyperparameters. We follow the model architecture of Fully Convolutional agent in [Vinyals2017] by utilizing an open-source implementation by \citeauthorreaver \shortcitereaver. We use the hyperparameters listed in Table 1.
Training & Testing. In order to evaluate the empirical sample efficiency of our method, we restrict the total number of training samples to be 10 million. Note this is still significantly fewer than 600 million in [Vinyals2017] or 10 billion in [deep-relational-Zambaldi2019]. Furthermore, we adopt their practice of training multiple agents to report the best results attained. After training, on the trained model, average and maximum scores over 30 independent episodes are reported.
Computing Resource. CPU: Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz, RAM:64 GB, GPU: GeForce RTX 2080 SUPER 8GB. The training time for a single model initialization: approximately 1.66 hours for CMAG and 1.5 hours for BM.
|Off-policy learning algorithm||PPO||PPO|
Our experimental results demonstrate similar trends to those shown in [Vinyals2017]. The variance observed in final performance achieved can be quite large, over different hyperparameter sets, different or same model parameter initializations and other stochasticity involved in learning. For Tables 2 and 3, the higher the values the better. For Table 4, the lower the values the better. Among the 5 agents for BM, the best performing agent can achieve an average reward of 6.7 during testing, while the worst performing agent can barely achieve 0.1. Note that the average reward of 6.7 is twice more than the average reward of the best performing agent (3) reported in [Vinyals2017] for BM. In addition, our method allows for an in-depth investigation into the agent’s learning curves to identify which part of the learning was not effective and led to the sub-optimal final performance. We compare the best (average 6.7) and worst (average 0.1) agents based on their subgoal learning curves, and we find that the best agent is learning effectively across all subgoals. From Figure 5, the learning curves in all subtasks show consistent progress with more samples, where the learning curves of the worst agent show substantially less progress, often flat at zero with very rare spikes, as shown in Figure 6. Especially for the BuildBarracks subtask, the agent’s learning is ineffective and it only occasionally stumbles upon the correct actions of building barracks at random and receives a corresponding reward signal. Alternatively, the comparison between the running average rewards for these two agents clearly demonstrates that learning for the best agent on the BuildBarracks subtask is significantly more effective. The performance on this subtask also affects the final subtask BuildMarines since without knowing how to build barracks, the agent cannot take the action of producing marines even if it has learnt this subpolicy. We believe such interpretability and explainability provided by our method are helpful in understanding and improving the learning process and the behavior of the agent.
On the other hand, the experimental results in CMAG show slightly less success. We believe this can be attributed to the difference in the setting of learning. In BM, the agent has to learn distinct skills and how to execute them in sequence in order to perform well, with relatively less emphasis on the degree of mastery of these skills. However, in CMAG the agent’s mastery of the skills including mining minerals and gas directly and critically affects its final score, viz., total amount of minerals and gas collected. It means that the agent has to be able to perform the skills well, i.e., optimize with respect to time and manufacturing cost, which in itself can be a separate and more complex learning task. Another experimental difficulty for CMAG lies in the reward scales because the subtasks for collecting minerals and gas have high reward ceilings (as high as several thousand), while those for building the gas refineries have comparatively low reward ceilings (less than one hundred). Due to this large difference in the scales of the reward signals between subtasks, the learning on the subtasks is even more difficult and can be unbalanced.
Conclusion & Future Work
In this work, we examined the SC2 minigames and proposed a way to introduce human expertise to an HRL framework. By designing customized minigames to facilitate learning and leveraging the effectiveness of hierarchical structures in decomposing complex and large problems, we empirically showed that our approach is sample-efficient and enhances interpretability. This initial work invites several exploration directions: developing more efficient and effective ways of introducing human expertise; a more formal and principled state representation to further reduce the complexity of the state space (goal space) with theoretical analysis on its complexity; and a more efficient learning algorithm to pair with the HRL architecture, Experience Replay and Curriculum Learning.
This work was partially supported by an Academic Research Grant T1 251RES1827 from the Ministry of Education in Singapore and a grant from the Advanced Robotics Center at the National University of Singapore.
- According to [alphaStar], for each of their 12 agents, they conduct training on 32 TPUs for 44 days.